Non-native speech database

Table 1: Abbreviations for languages used in Table 2

Arabic	A	Japanese	J
Chinese	C	Korean	K
Czech	Cze	Malaysian	M
Danish	D	Norwegian	N
Dutch	Dut	Portuguese	P
English	E	Russian	R
French	F	Spanish	S
German	G	Swedish	Swe
Greek	Gre	Thai	T
Indonesian	Ind	Vietnamese	V
Italian	I

The actual table with information about the different databases is shown in Table 2.

More information Corpus, Author ...

Table 2: Overview of non-native Databases

Corpus	Author	Available at	Languages	#Speakers	Native Language	#Utt.	Duration	Date	Remarks
AMI ^[2]		EU	E		Dut and other		100h		meeting recordings
ATR-Gruhn ^[3]	Gruhn	ATR	E	96	C G F J Ind	15000		2004	proficiency rating
BAS Strange Corpus 1+10 ^[4]		ELRA	G	139	50 countries	7500		1998
Berkeley Restaurant ^[5]		ICSI	E	55	G I H C F S J	2500		1994
Broadcast News ^[6]		LDC	E					1997
Cambridge-Witt ^[7]	Witt	U. Cambridge	E	10	J I K S	1200		1999
Cambridge-Ye ^[8]	Ye	U. Cambridge	E	20	C	1600		2005
Children News ^[9]	Tomokiyo	CMU	E	62	J C	7500		2000	partly spontaneous
CLIPS-IMAG ^[10]	Tan	CLIPS-IMAG	F	15	C V		6h	2006
CLSU ^[11]		LDC	E		22 countries	5000		2007	telephone, spontaneous
CMU ^[12]		CMU	E	64	G	452	0.9h		not available
Cross Towns ^[13]	Schaden	U. Bochum	E F G I Cze Dut	161	E F G I S	72000	133h	2006	city names
Duke-Arslan ^[14]	Arslan	Duke University	E	93	15 countries	2200		1995	partly telephone speech
ERJ ^[15]	Minematsu	U. Tokyo	E	200	J	68000		2002	proficiency rating
Fischer ^[16]		LDC	E		many		200h		telephone speech
Fitt ^[17]	Fitt	U. Edinburgh	F I N Gre	10	E	700		1995	city names
Fraenki ^[18]		U. Erlangen	E	19	G	2148
Hispanic ^[19]	Byrne		E	22	S		20h	1998	partly spontaneous
HLTC ^[20]		HKUST	E	44	C		3h	2010	available on request
IBM-Fischer ^[21]		IBM	E	40	S F G I	2000		2002	digits
iCALL ^[22]^[23]	Chen	I²R, A*STAR	C	305	24 countries	90841	142h	2015	phonetic and tonal transcriptions (in Pinyin), proficiency ratings
ISLE ^[24]	Atwell	EU/ELDA	E	46	G I	4000	18h	2000
Jupiter ^[25]	Zue	MIT	E	unknown	unknown	5146		1999	telephone speech
K-SEC ^[26]	Rhee	SiTEC	E	unknown	K			2004
LDC WSJ1 ^[27]		LDC		10		800	1h	1994
LeaP ^[28]	Gut	University of Münster	E G	127	41 different ones	73.941 words	12h	2003
MIST ^[29]		ELRA	E F G	75	Dut	2200		1996
NATO HIWIRE ^[30]		NATO	E	81	F Gre I S	8100		2007	clean speech
NATO M-ATC ^[31]	Pigeon	NATO	E	622	F G I S	9833	17h	2007	heavy background noise
NATO N4 ^[32]		NATO	E	115	unknown		7.5h	2006	heavy background noise
Onomastica ^[33]			D Dut E F G Gre I N P S Swe			(121000)		1995	only lexicon
PF-STAR ^[34]		U. Erlangen	E	57	G	4627	3.4h	2005	children speech
Sunstar ^[35]		EU	E	100	G S I P D	40000		1992	parliament speech
TC-STAR ^[36]	Heuvel	ELDA	E S	unknown	EU countries		13h	2006	multiple data sets
TED ^[37]	Lamel	ELDA	E	40(188)	many		10h(47h)	1994	eurospeech 93
TLTS ^[38]		DARPA	A		E		1h	2004
Tokyo-Kikuko ^[39]		U. Tokyo	J	140	10 countries	35000		2004	proficiency rating
Verbmobil ^[40]		U. Munich	E	44	G		1.5h	1994	very spontaneous
VODIS ^[41]		EU	F G	178	F G	2500		1998	about car navigation
WP Arabic ^[42]	Rocca	LDC	A	35	E	800	1h	2002
WP Russian ^[43]	Rocca	LDC	R	26	E	2500	2h	2003
WP Spanish ^[44]	Morgan	LDC	S		E			2006
WSJ Spoke ^[45]			E	10	unknown	800		1993

Legend

In the table of non-native databases some abbreviations for language names are used. They are listed in Table 1. Table 2 gives the following information about each corpus: The name of the corpus, the institution where the corpus can be obtained, or at least further information should be available, the language which was actually spoken by the speakers, the number of speakers, the native language of the speakers, the total amount of non-native utterances the corpus contains, the duration in hours of the non-native part, the date of the first public reference to this corpus, some free text highlighting special aspects of this database and a reference to another publication. The reference in the last field is in most cases to the paper which is especially devoted to describe this corpus by the original collectors. In some cases it was not possible to identify such a paper. In these cases a paper is referenced which is using this corpus is.

Some entries are left blank and others are marked with unknown. The difference here is that blank entries refer to attributes where the value is just not known. Unknown entries, however, indicate that no information about this attribute is available in the database itself. As an example, in the Jupiter weather database^[46] no information about the origin of the speakers is given. Therefore this data would be less useful for verifying accent detection or similar issues.

Where possible, the name is a standard name of the corpus, for some of the smaller corpora, however, there was no established name and hence an identifier had to be created. In such cases, a combination of the institution and the collector of the database is used.

In the case where the databases contain native and non-native speech, only attributes of the non-native part of the corpus are listed. Most of the corpora are collections of read speech. If the corpus instead consists either partly or completely of spontaneous utterances, this is mentioned in the Specials column.

Non-native speech database

List

Legend

References

Wikiwand - on