Skip to main content

Table 1 Breakdown of the used dataset showing the training, development and evaluation splits composition per number of samples, speakers, and synthesis methods

From: Synthetic speech detection through short-term and long-term prediction traces

  

\({\mathcal {D}_{\text {tr}}}\)

\({\mathcal {D}_{\text {dev}}}\)

\({\mathcal {D}_{\text {eval}}}\)

Category

Samples

Bona fide

2580

2548

7355

 
 

Synthetic

22800

22296

63882

 

Speakers

Bona fide

20

10

48

 

Synthetic

A01

\(\checkmark \)

\(\checkmark \)

 

NN

Methods

A02

\(\checkmark \)

\(\checkmark \)

 

VC

 

A03

\(\checkmark \)

\(\checkmark \)

 

VC

 

A04 = A16

\(\checkmark \)

\(\checkmark \)

\(\checkmark \)

WC

 

A05

\(\checkmark \)

\(\checkmark \)

 

VC

 

A06 = A19

\(\checkmark \)

\(\checkmark \)

\(\checkmark \)

VC

 

A07

  

\(\checkmark \)

NN

 

A08

  

\(\checkmark \)

NN

 

A09

  

\(\checkmark \)

VC

 

A10

  

\(\checkmark \)

NN

 

A11

  

\(\checkmark \)

NN

 

A12

  

\(\checkmark \)

NN

 

A13

  

\(\checkmark \)

NN

 

A14

  

\(\checkmark \)

VC

 

A15

  

\(\checkmark \)

VC

 

A17

  

\(\checkmark \)

VC

 

A18

  

\(\checkmark \)

VC

  1. The column “Category” roughly indicates the approach used for waveform generation by the synthetic speech generation algorithm, where NN = neural network, VC = vocoder, and WC = waveform concatenation