Synthetic speech detection through short-term and long-term prediction traces

EURASIP Journal on Information Security

Table 1 Breakdown of the used dataset showing the training, development and evaluation splits composition per number of samples, speakers, and synthesis methods

		\({\mathcal {D}_{\text {tr}}}\)	\({\mathcal {D}_{\text {dev}}}\)	\({\mathcal {D}_{\text {eval}}}\)	Category
Samples	Bona fide	2580	2548	7355
	Synthetic	22800	22296	63882
Speakers	Bona fide	20	10	48
Synthetic	A01	\(\checkmark \)	\(\checkmark \)		NN
Methods	A02	\(\checkmark \)	\(\checkmark \)		VC
	A03	\(\checkmark \)	\(\checkmark \)		VC
	A04 = A16	\(\checkmark \)	\(\checkmark \)	\(\checkmark \)	WC
	A05	\(\checkmark \)	\(\checkmark \)		VC
	A06 = A19	\(\checkmark \)	\(\checkmark \)	\(\checkmark \)	VC
	A07			\(\checkmark \)	NN
	A08			\(\checkmark \)	NN
	A09			\(\checkmark \)	VC
	A10			\(\checkmark \)	NN
	A11			\(\checkmark \)	NN
	A12			\(\checkmark \)	NN
	A13			\(\checkmark \)	NN
	A14			\(\checkmark \)	VC
	A15			\(\checkmark \)	VC
	A17			\(\checkmark \)	VC
	A18			\(\checkmark \)	VC

The column “Category” roughly indicates the approach used for waveform generation by the synthetic speech generation algorithm, where NN = neural network, VC = vocoder, and WC = waveform concatenation