Figure 1 shows a multi-order acoustic simulation for replay voice spoofing detection. In this study, we assume that a clean signal is audio recorded in a non-reverberant environment, such as a studio. Also, we assume the original audio corresponds to the 1
st-order audio, which performs one recording process considering the speaker, room path, and microphone for the clean signal and the replay audio corresponds to the 2
nd-order audio, which performs two recording processes. In addition, because the audio may have acoustic configurations during recording, we assume that the 1
st-order audio has one acoustic configuration and 2
nd-order audio has two. Multi-order acoustic simulation utilizes the existing clean signal and RIR dataset to generate the audio that simulates the acoustic configuration of the 1
st-order and 2
nd-order audios. When simulating the 1
st-order audio, the clean signal and one RIR are convolved, and the 2
nd-order audio is convolved with two RIRs. In addition, when the audio simulating the 1
st-order audio is called
, and audio simulating the 2
nd-order is called
,
and
, using a clean signal and RIR can be represented as:
where
n is the index of the signal,
s is the clean signal, and
and
are the different RIRs. Equation
1 shows the expression to generate
, by convolving the temporal characteristics, such as frequency, phase, and amplitude of
s, and acoustic configurations, such as the microphone type, sound reduction, reverberation, and noise of
. Equation
2 shows the expression to generate
by convolving the temporal characteristics of
and acoustic configuration of
. The convolution of clean signals and RIR to generate the audio with an acoustic configuration has been utilized in various applications [
21]. Research is being conducted to generate the RIR using techniques, such as the image method and fast-RIR, to simulate room acoustics in various environments without restrictions [
22,
23]. These RIR generation techniques can easily generate impulse responses considering the room size, sound reduction, time delay, reverberation, etc., and show high performance in simulating room acoustics [
24]. However, the RIR generated by this technique may not be suitable for simulating the original and replay audio because it does not consider factors, such as the non-linearity or distortion caused by the microphone.
Considering these problems, this study used the RIR datasets acquired using smartphones, which are the most accessible recording devices among the existing RIR datasets. Smartphones are rapidly evolving in hardware, and the performance of their built-in microphones is improving. Therefore, the threat of replay voice spoofing from smartphones may increase. Considering that, we used the Aachen impulse response dataset, which acquires the RIRs through a physical recording process using a smartphone. The Aachen impulse response dataset provides 214 RIRs that reproduce the situation of a user talking or listening to a meeting or lecture in various places, such as offices, kitchens, corridors, stairways, lecture rooms, and meeting rooms, using HEAD acoustics HMS II.3 artificial head and omnidirectional Beyerdynamic MM1 measurement microphones. In addition, we assumed the VCTK Corpus dataset to be a clean signal because the ASVspoof2019 PA dataset was created based on the VCTK Corpus dataset.