Datasets

Speech datasets

The table below aims to provide a list of speech datasets with detailed attributes and links to software baselines and evaluation results. Each dataset may be used for one or more applications: automatic speech recognition, speaker identification and verification, source localization, speech enhancement and separation... The meaning of each attribute is detailed below.

Disclaimer: Only datasets that are publicly available, (at least partially) annotated, suitable for research on robustness, and longer than 5 min are listed. Other relevant datasets are listed below.

If you would like to refer to this table, please cite J. Le Roux and E. Vincent, "A categorization of robust speech processing datasets", Mitsubishi Electric Research Laboratories Technical Report, TR2014-116, Aug. 2014.

Datasets	General attributes								Speech							Channel				Noise		Ground truth
Datasets	rel. year	use case	total time (h)	sam. rate (kHz)	dist. or noisy mics	video cams	cost (non- memb)	links	speak. time (h)	uniq. speak.	lang.	uniq. words (k)	speak. style	speak. / rec.	overl. type	chan. type	speak. radiat.	speak. loc.	speak. moves	noise type	avg. SNR	ref. signal	speak. loc., orient.	words	non- verb. traits	noise events
ShATR	1994	meeting	0.6	48	3	no	free	download paper	0.6	5	UK English	1	spontaneous	5	multiple dialogs	reverb	human	quasi-fixed	head	meeting	high	headset	yes	yes	no	yes
LLSEC	1996	dialog	1.4	16	4	no	free	download	?	12	N/S	N/S	read, spontaneous	2	dialog	reverb	human	quasi-fixed	head	hallway, restaurant (scenarized)	medium	no	yes	no	no	no
RWCP Spoken Dialog Corpus	1996 - 1997	dialog	10	16	2	no	free	download paper	10	39	Japanese	?	spontaneous	1 - 2	dialog	reverb (low)	human	quasi-fixed	head	stationary background	high	no	no	yes	no	no
Aurora-2	2000	public spaces	33	8 - 16	1	no	free given TIDigits (0.5 k$)	purchase (incl. HTK) paper features	33	214	US English	0.01	digits	1	no	simulated phone	human	N/S	no	various real environments (rescaled)	low	original	N/S	yes	no	yes
SPINE1, SPINE2	2000 - 2001	military	38	16	2	no	7.4 k$	purchase paper	?	100	US English	1	command, spontaneous	1 - 2	no	simulated radio	human	quasi-fixed	head	military (rescaled)	low	no	no	yes	no	no
Aurora-3 (subset of SpeechDat- Car)	2000 - 2003	car	?	16	4	no	1 k€	purchase (incl. HTK) papers	?	730	various	0.01	digits	1	no	reverb	human	quasi-fixed	head	car	low	headset	no	yes	no	no
RWCP Meeting Speech Corpus	2001	meeting	3.5	16 - 48	1	3	free	download paper	3.5	?	Japanese	?	spontaneous	1 - 5	meeting	reverb (low)	human	quasi-fixed	head	stationary background	high	headset	no	yes	no	no
RWCP Real Environment Speech Database	2001	domestic, office	?	16 - 48	84	no	free	download paper	?	5	US English, Japanese	?	read	1	no	real rir, reverb	loudspeaker	various	no, pivoting arm	various (sum of events)	medium	original	yes	yes	no	yes
SpeechDat- Car	2001 - 2011	car	?	16	4	no	39 - 182 k€ per lang	purchase paper	?	300 per lang	various	?	digits, command, read, spontaneous	1	no	reverb	human	quasi-fixed	head	car	low	headset	no	yes	no	no
Aurora-4	2002	public spaces	?	8 - 16	1	no	free given WSJ0 (1.5 k$)	purchase paper HTK	?	101	US English	10	read	1	no	simulated phone	human	N/S	no	various real environments (rescaled)	low	original	N/S	yes	no	yes
TED	2002	seminar	47	16	1	no	0.5 k$	purchase paper	47	188	non-native English	?	lecture	1 or more	seminar	reverb	human	quasi-fixed	head	stationary background	high	lapel	no	partial	no	no
CUAVE	2002	speech overlap	3	44	1	1	free	download paper	3	36	US English	0.01	digits	1 - 2	full	reverb	human	quasi-fixed	head	stationary background	high	no	no	yes	no	no
CU-Move Microphone Array Data	2002 - 2011	car	286	44	6 - 8	no	25 k$	purchase paper	286	172	US English	12	digits, command, read, dialog	1	no	reverb	human	quasi-fixed	head	car	low	no	no	yes	no	no
CENSREC-1 (Aurora-2J)	2003	public spaces	?	8	1	no	free	download paper	?	214	Japanese	0.01	digits	1	no	simulated phone	human	N/S	no	various real environments (rescaled)	low	original	N/S	yes	no	yes
AVICAR	2004	car	40	16	7	4	free	download paper	40	87	US English, non-native English	1	read	1	no	reverb	human	quasi-fixed	head	moving car, windows open or closed	low	no	no	yes	no	no
AV16.3	2004	meeting	1.5	16	16	3	free	download paper	1.5	12	N/S	N/S	spontaneous	1 - 3	full	reverb	human	various	head, walk	stationary background	high	no	partial	no	no	no
ICSI Meeting Corpus	2004	meeting	72	16	6	no	2.8 k$	purchase info paper	72	53	US English, other English	13	meeting	3 - 10	meeting	reverb	human	quasi-fixed	head	meeting	high	headset, lapel	no	yes	yes	ad-hoc
NIST Meeting Pilot Corpus Speech	2004	meeting	15	16	7	no	5.5 k$	purchase paper	15	61	US English	6	meeting	3 - 9	meeting	reverb	human	various	head, walk	stationary background	high	headset, lapel	no	yes	no	no
CHIL Meetings	2004 - 2007	seminar, meeting	60	44	79 - 147	6 - 9	3.5 k€	purchase paper	?	?	non-native English	?	seminar, meeting	3 - 20	seminar, meeting	reverb	human	quasi-fixed	head	meeting (scenarized)	high	headset	yes	yes	yes	no
SPEECON	2004 - 2011	public space, domestic, office, car	?	16	3	no	75 k€ per lang	purchase paper	?	600 per lang	various	?	command, read, spontaneous	1	no	reverb	human	quasi-fixed	head	various real environments	medium	headset	no	yes	no	no
CENSREC-2	2005	car	?	16	1	no	free	download paper	?	214	Japanese	0.01	digits	1	no	reverb	human	quasi-fixed	head	car	low	headset	no	yes	no	no
CENSREC-3	2005	car	?	16	1	no	21 k¥	purchase paper	?	311	Japanese	0.05	read	1	no	reverb	human	quasi-fixed	head	car	low	headset	no	yes	no	no
Aurora-5	2006	public spaces, domestic, office, car	?	8	1	no	free given TIDigits (0.5 k$)	purchase (incl. HTK) paper	?	225	US English	0.01	digits	1	no	no, simulated rir, real rir	loudspeaker	fixed	no	various real environments (rescaled)	low	original	no	yes	no	yes
AMI	2006	meeting	100	16	16	6	free	download paper	?	189	UK English, other English	8	meeting	most often 4	meeting (18% overlap)	reverb	human	quasi-fixed	head	stationary background	high	headset, lapel	yes	yes	yes	no
PASCAL SSC	2006	speech overlap	8.8	25	1	no	free	download paper	8.8	34	UK English	0.05	command	2	full	no	human	N/S	no	no	N/S	original	N/S	yes	no	no
HIWIRE	2007	airplane	21	16	1	no	0.05 k€	purchase paper	21	81	non-native English	0.1	command	1	no	no	human	N/S	no	airplane (rescaled)	low	original	N/S	yes	no	no
NOIZEUS	2007	public spaces	0.6	8	1	no	free	download paper	0.6	6	US English	0.1	read	1	no	simulated phone	human	N/S	no	various real environments (rescaled)	low	original	N/S	no	no	no
UT-Drive	2007	car	40	25	5	2	25 k$	download paper	40	25	US English	2.4	command, dialog	1 - 2	dialog	reverb	human	quasi-fixed	head	car	low	headset (low quality)	no	partial	no	no
SASSEC, SiSEC under- determined	2007 - 2011	cocktail party	0.3	16	2	no	free	download paper	0.3	16	N/S	N/S	read	3 - 4	full	simulated rir, real rir, reverb	no, loudspeaker	fixed	no	no	N/S	original, spatial image	yes	no	no	no
MC-WSJ-AV, PASCAL SSC2, 2012_MMA, REVERB RealData	2007 - 2014	speech overlap	10	16	8 - 40	partial	1.5 k$	purchase paper paper info video HTK Kaldi results results	?	45	UK English	10	read	1 - 2	full	reverb	human	various	head, walk	stationary background	high	headset, lapel	yes	yes	no	no
CENSREC-4 (Simulated)	2008	public spaces, domestic, office, car	?	16	1	no	free	download paper	?	214	Japanese	0.01	digits	1	no	real rir	dummy	fixed	no	various real environments (rescaled)	low	original	no	yes	no	yes
CENSREC-4 (Real)	2008	public spaces, domestic, office, car	?	16	1	no	free	download paper	?	10	Japanese	0.01	digits	1	no	reverb	human	quasi-fixed	head	various real environments	low	headset	no	yes	no	yes
DICIT	2008	domestic	6	48	16	2	free	download paper	1	?	Italian	?	command	4	no	reverb	human	various	head, walk	domestic (scenarized)	medium	headset, tv	yes	yes	no	yes
SiSEC head-geometry	2008	speech overlap	1.9	16	2	no	free	download paper	1.9	?	N/S	N/S	read	2	full	real rir	loudspeaker	various	no	no	N/S	original, spatial image	yes	no	no	no
COSINE	2009	dialog	38	48	20	no	free	download paper	11	91	US English, non-native English	5	spontaneous	2 - 7	dialog	reverb	human	various	head, walk	various real environments	low	headset, throat mic	no	yes	no	no
SiSEC real-world noise	2010	public spaces	0.3	16	2 - 4	no	free	download paper	0.3	6	N/S	N/S	read	1 - 3	full	no, reverb (other room)	loudspeaker	various	no	various real environments (rescaled)	low	original, spatial image	yes	no	no	no
SiSEC dynamic	2010 - 2011	cocktail party	0.2	16	2 - 4	no	free	download paper	0.2	?	N/S	N/S	read	?	full (2 at a time)	reverb	loudspeaker	various	simulated	no	N/S	original, spatial image	yes	no	no	no
CHiME 1, CHiME 2 Grid	2011 - 2012	domestic	70	16 - 48	2	no	free	download paper HTK results results	12	34	UK English	0.05	command	1	no	real rir	dummy	quasi-fixed	simulated head	domestic (added without rescaling)	low	yes	yes	yes	no	no
CHiME 2 WSJ0	2012	domestic	78	16	2	no	free given WSJ0 (1.5 k$)	download paper HTK Kaldi results	33	101	US English	11	read	1	no	real rir	dummy	fixed	no	domestic (added without rescaling)	low	yes	yes	yes	no	no
ETAPE	2012	TV/radio debates, outdoor interviews	42	16	1	1	?	download paper	32	347	French	16	spontaneous	1 or more	dialog (up to 10% overlap)	reverb (some)	human	quasi-fixed	head	various real environments	high	no	N/S	yes	no	yes
GALE	2013	TV dialog	120 - 251 per lang	16	1	no	3.5 - 7 k$ per lang	purchase	108 - 234 per lang	?	Mandarin, Arabic	?	spontaneous	1 or more	dialog	no	human	quasi-fixed	head	no	N/S	no	N/S	yes	no	no
REVERB SimData	2013	domestic, office	25	16	8	no	free given WSJCAM0 (1.75 k$)	purchase paper HTK Kaldi results results	25	130	UK English	10	read	1	no	real rir	loudspeaker	various	no	random noise	high	original, spatial image	yes	yes	no	yes
Sheffield Wargames Corpus	2013	cocktail party	7	48	92	3	free	download paper	?	9	UK English	?	spontaneous	4	multiple dialogs	reverb	human	various	head, walk	background music	medium	headset	yes	yes	no	no
DIRHA	2014	domestic	11	48	40	no	free (partial avail.)	download paper	4	90	various	3.8	command, read, spontaneous	1 or more	simulated	real rir	loudspeaker	various	no	domestic (added without rescaling)	low	yes	yes	yes	no	yes

General attributes:

year of release
scenario: car, cocktail party, domestic, lecture, meeting, office, public space, TV...
total duration (h) (multiple channels counted only once)
sampling rate (kHz)
number of distant or noisy microphones
number of video cameras
cost for non-members of ELRA and LDC (cost for members is lower or free)
links: download data, reference papers, software baselines, evaluation results...

Speech attributes:

duration of speech (h) (overlapping speech counted only once)
number of unique speakers
language
number of unique words (differs from assumed vocabulary size, which is somewhat arbitrary)
speaking style: digits, command, read, spontaneous...
number of speakers present in the room
type of speaker overlap: no overlap, simulated overlap, dialogue, meeting, full overlap...

Channel attributes:

channel type: none, simulated room impulse response, convolution by a recorded room impulse response, reverberant recording...
speaker radiation: loudspeaker, dummy head with mouth simulator, human...
speaker location: at a fixed position in the room, at a quasi-fixed position (e.g., seated), at different positions...
speaker movements: no movement, head movements, walking...

Noise attributes:

noise type: stationary background noise (e.g., air-conditioning), car noise, meeting noises, domestic noises, outdoor noises...

Available ground truth:

reference speech signal: original (at the mouth), headset or lapel (slightly differs from the signal at the mouth), spatial image (at the microphones)...
speaker location and orientation
words uttered
paralinguistic attributes: nodding, gaze, communication intent, emotion... (excluding speaker attributes such as age, gender, or native language)
noise events: type and time of individual noise events

Impulse response datasets

The table below provides a list of impulse response (IR) datasets with detailed attributes. The meaning of each attribute is detailed below.

Disclaimer: Only datasets that are publicly available and include some reverberation (not only HRTFs) are listed.

Datasets	General attributes							Channel
Datasets	rel. year	envir.	total IRs	sam. rate (kHz)	mics	cost	links	chan. type	rooms	speak. radiat.	speak. loc.	speak. moves	mic. direc.	mic. loc.	mic. moves
RWCP Real Environment Acoustic Database	2001	varechoic room, office	364	16 - 48	84	free	download paper	real	7	dummy	9 (far)	yes	omni	fixed	no
SASSEC, SiSEC under- determined	2007 - 2011	office	?	16	2	free	download paper	simulated, real	4	no, loudspeaker	?	no	omni	fixed	no
SiSEC head-geometry	2008	office	38	16	2	free (partial avail.)	download paper	real	1	loudspeaker	19 (far)	no	binaural	fixed	no
Aachen Impulse Response	2009 - 2012	various	214	48	2	free	download paper	real	8	loudspeaker	13 (far)	no	omni, binaural, phone	fixed	no
CAMIL	2010 - 2012	office	32400	16	2	free	download paper	real	1	loudspeaker	fixed	no	binaural	16200 (close)	yes
CHiME 2 Grid	2012	domestic	242	16 - 48	2	free	download paper	real	1	dummy	121 (close)	no	binaural	fixed	no
AVASM	2013	office	864	16	2	free	download paper	real	1	loudspeaker	432 (close)	no	binaural	fixed	no
DIRHA	2014	domestic	9200	48	40	free (partial avail.)	download paper	real	5	loudspeaker	57 (far)	no	omni	fixed	no

General attributes:

year of release
recording environment: car, domestic, lecture, meeting, office, public space...
total IRs: total number of single-channel impulse responses
sampling rate (kHz)
number of microphones
cost
links: download data, reference papers, software baselines, evaluation results...

Channel attributes:

channel type: simulated or real impulse response
speaker radiation: loudspeaker, mouth simulator
speaker location: at a fixed position in the room, or number of different positions (closely spaced or far)
speaker movements: no movement, moves while recording
microphone directivity: omnidirectional, cardioid, binaural...
microphone location: at a fixed position in the room, or number of different positions (closely spaced or far)
microphone movements: no movement, moves while recording

Text datasets

Other datasets

This section lists all other relevant datasets that have not been annotated or made publicly available yet.

Speech datasets:

BABEL (not yet available)
Broadcast news, HUB4 (no noise and 4.5% speaker overlap, less than ETAPE)
CIAIR In-Car Speech Database (availability unknown)
Dyrholm/Sawada/Parra (about 1 min long)
NEMISIG (unavailable)
RATS (not yet available)
Rich Transcription (RT) (dataset gathered from other sets, e.g. CHIL, ICSI, ISL, AMI...)
Settlers of Catan (unannotated, more info)
Flying MEMS microphone array (unannotated, more info)

Contribute a dataset

To contribute a new dataset, please

create an account and login
go to the section above corresponding to your type of dataset; if the table does not exist yet, you may create it
click on the "Edit" link at the top of the table and add a new line for your dataset (the lines are ordered by year of release)
fill all columns as much as possible, following the detailed list of attributes below the table
click on the "Save page" link at the bottom of the page to save your modifications

We currently cannot provide storage space for large datasets. Please upload the dataset at a stable URL on the website of your institution or elsewhere and provide its URL only. If this is not possible, please contact the resources sharing working group.

Contribute a software baseline

To contribute a new software baseline, please

create an account and login
fill an entry for your software on the Software page, if not done yet
go to the section above corresponding to the dataset for which your baseline was designed
click on the "Edit" link at the top of the table and add a link to your software in the corresponding "links" cell
click on the "Save page" link at the bottom of the page to save your modifications

We currently cannot provide storage space for large software. Please upload your software at a stable URL on the website of your institution or elsewhere and provide its URL only. If this is not possible, please contact the resources sharing working group.

Contribute an evaluation result

To contribute a new research result, please

create an account and login
go to the section above corresponding to the dataset for which this result was obtained
click on the "Edit" link at the top of the table and add a link to your result in the corresponding "links" cell
make sure that the link (e.g., a paper or another webpage) contains the following information: authors, link to a paper/report containing objective evaluation results, link to derived data (output transcriptions, intermediary data, etc)
click on the "Save page" link at the bottom of the page to save your modifications

In order to save storage space, please do not upload the paper on this wiki, but link it as much as possible from your institutional archive, from another public archive (e.g., arxiv) or from the publisher website (e.g., ieexplore).

Not logged in

Search

Navigation

Tools