Skip to main content

Table 1 Breakdown of the used dataset showing the training, development and evaluation splits composition per number of samples, speakers, and synthesis methods

From: Synthetic speech detection through short-term and long-term prediction traces

   \({\mathcal {D}_{\text {tr}}}\) \({\mathcal {D}_{\text {dev}}}\) \({\mathcal {D}_{\text {eval}}}\) Category
Samples Bona fide 2580 2548 7355  
  Synthetic 22800 22296 63882  
Speakers Bona fide 20 10 48  
Synthetic A01 \(\checkmark \) \(\checkmark \)   NN
Methods A02 \(\checkmark \) \(\checkmark \)   VC
  A03 \(\checkmark \) \(\checkmark \)   VC
  A04 = A16 \(\checkmark \) \(\checkmark \) \(\checkmark \) WC
  A05 \(\checkmark \) \(\checkmark \)   VC
  A06 = A19 \(\checkmark \) \(\checkmark \) \(\checkmark \) VC
  A07    \(\checkmark \) NN
  A08    \(\checkmark \) NN
  A09    \(\checkmark \) VC
  A10    \(\checkmark \) NN
  A11    \(\checkmark \) NN
  A12    \(\checkmark \) NN
  A13    \(\checkmark \) NN
  A14    \(\checkmark \) VC
  A15    \(\checkmark \) VC
  A17    \(\checkmark \) VC
  A18    \(\checkmark \) VC
  1. The column “Category” roughly indicates the approach used for waveform generation by the synthetic speech generation algorithm, where NN = neural network, VC = vocoder, and WC = waveform concatenation