Datasets
Contents
Speech datasets
The table below aims to provide a list of speech datasets with detailed attributes and links to software baselines and research results. Each dataset may be used for one or more applications: automatic speech recognition, speaker identification and verification, source localization, speech enhancement and separation...
Disclaimer: Only datasets that are publicly available, suitable for robust speech processing research, and longer than 5 min are listed.
The meaning of each attribute is detailed below.
Datasets | General attributes | Speech | Channel | Noise | Ground truth | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
rel. year | use case | total time (h) | sam. rate (kHz) | dist. or noisy mics | video cams | cost | links | speak. time (h) | uniq. speak. | lang. | uniq. words (k) | speak. style | speak. / rec. | overl. type | chan. type | speak. radiat. | speak. loc. | speak. moves | noise type | ref. signal | speak. loc., orient. | words | non- verb. traits | noise events | |
ShATR | 1994 | meeting | 0.6 | 48 | 3 | no | free | download | 0.6 | 5 | UK English | 1 | spontaneous | 5 | multiple dialogs | reverb | human | quasi-fixed | head | meeting | headset | yes | yes | no | yes |
LLSEC | 1996 | dialog | 1.4 | 16 | 4 | no | free | download | ? | 12 | N/S | N/S | read, spontaneous | 2 | dialog | reverb | human | quasi-fixed | head | hallway, restaurant (scenarized) | no | yes | no | no | no |
RWCP Spoken Dialog Corpus | 1996 - 1997 | dialog | 10 | 16 | 2 | no | free | download | 10 | 39 | Japanese | ? | spontaneous | 1 - 2 | dialog | reverb (low) | human | quasi-fixed | head | stationary background | no | no | yes | no | no |
Aurora-2 | 2000 | public spaces | 33 | 8 - 16 | 1 | no | free given TIDigits (0.5 k$) | purchase (incl. HTK) | 33 | 214 | US English | 0.01 | digits | 1 | no | simulated phone | human | N/S | no | various real environments | original | N/S | yes | no | yes |
SPINE1, SPINE2 | 2000 - 2001 | military | 38 | 16 | 2 | no | 7.4 k$ | purchase | ? | 100 | US English | 1 | command, spontaneous | 1 - 2 | no | simulated radio | human | quasi-fixed | head | military | no | no | yes | no | no |
Aurora-3 (subset of SpeechDat- Car) | 2000 - 2003 | car | ? | 16 | 4 | no | 1 k€ | purchase (incl. HTK) | ? | ? | various | ? | digits, command, read, spontaneous | 1 | no | reverb | human | quasi-fixed | head | car | headset | no | yes | no | no |
RWCP Meeting Speech Corpus | 2001 | meeting | 3.5 | 16 - 48 | 1 | 3 | free | download | 3.5 | ? | Japanese | ? | spontaneous | 1 - 5 | meeting | reverb (low) | human | quasi-fixed | head | stationary background | headset | no | yes | no | no |
RWCP Real Environment Speech and Acoustic Database | 2001 | domestic, office | ? | 16 - 48 | 30 | no | free | download | ? | 5 | Japanese | ? | read | 1 | no | real rir, reverb | loudspeaker | various | no, pivoting arm | stationary background | original | yes | yes | no | yes |
SpeechDat- Car | 2001 - 2011 | car | ? | 16 | 4 | no | 39 - 182 k€ per lang | purchase | ? | 300 per lang | various | ? | digits, command, read, spontaneous | 1 | no | reverb | human | quasi-fixed | head | car | headset | no | yes | no | no |
Aurora-4 | 2002 | public spaces | ? | 8 - 16 | 1 | no | free given WSJ0 (1.5 k$) | purchase | ? | 101 | US English | 10 | read | 1 | no | simulated phone | human | N/S | no | various real environments | original | N/S | yes | no | yes |
TED | 2002 | seminar | 47 | 16 | 1 | no | 0.5 k$ | purchase | 47 | 188 | non-native English | ? | lecture | 1 or more | seminar | reverb | human | quasi-fixed | head | stationary background | lapel | no | partial | no | no |
CUAVE | 2002 | cocktail party | 3 | 44 | 1 | 1 | free | download | 3 | 36 | US English | 0.01 | digits | 1 - 2 | full | reverb | human | quasi-fixed | head | stationary background | no | no | yes | no | no |
CU-Move Microphone Array Data | 2002 - 2011 | car | 286 | 44 | 6 - 8 | no | 25 k$ | purchase | 286 | 172 | US English | 12 | digits, command, read, dialog | 1 | no | reverb | human | quasi-fixed | head | car | no | no | yes | no | no |
CENSREC-1 (Aurora-2J) | 2003 | public spaces | ? | 8 | 1 | no | free | download | ? | 214 | Japanese | 0.01 | digits | 1 | no | simulated phone | human | N/S | no | various real environments | original | N/S | yes | no | yes |
AVICAR | 2004 | car | 29 | 16 | 7 | 4 | free | download | 29 | 86 | US English, non-native English | 1 | read | 1 | no | reverb | human | quasi-fixed | head | car | no | no | yes | no | no |
AV16.3 | 2004 | meeting | 1.5 | 16 | 16 | 3 | free | download | 1.5 | 12 | N/S | N/S | spontaneous | 1 - 3 | full | reverb | human | various | walk | stationary background | no | yes | no | no | no |
ICSI Meeting Corpus | 2004 | meeting | 72 | 16 | 6 | no | 2.8 k$ | purchase | 72 | 53 | US English | 13 | meeting | 3 - 10 | meeting | reverb | human | quasi-fixed | head | stationary background | headset, lapel | no | yes | yes | no |
NIST Meeting Pilot Corpus Speech | 2004 | meeting | 15 | 16 | 7 | no | 5.5 k$ | purchase | 15 | 61 | US English | 6 | meeting | 3 - 9 | meeting | reverb | human | various | walk | stationary background | headset, lapel | no | yes | no | no |
CHIL Meetings | 2004 - 2007 | seminar, meeting | 60 | 44 | 79 - 147 | 6 - 9 | 3.5 k€ | purchase | ? | ? | non-native English | ? | seminar, meeting | 3 - 20 | seminar, meeting | reverb | human | quasi-fixed | head | meeting (scenarized) | headset | yes | yes | yes | no |
SPEECON | 2004 - 2011 | public space, domestic, office, car | ? | 16 | 3 | no | 75 k€ per lang | purchase | ? | 600 per lang | various | ? | command, read, spontaneous | 1 | no | reverb | human | quasi-fixed | head | various real environments | headset | no | yes | no | no |
CENSREC-2 | 2005 | car | ? | 16 | 1 | no | free | download | ? | 214 | Japanese | 0.01 | digits | 1 | no | reverb | human | quasi-fixed | head | car | headset | no | yes | no | no |
CENSREC-3 | 2005 | car | ? | 16 | 1 | no | 21 k¥ | purchase | ? | 311 | Japanese | 0.05 | read | 1 | no | reverb | human | quasi-fixed | head | car | headset | no | yes | no | no |
Aurora-5 | 2006 | public spaces, domestic, office, car | ? | 8 | 1 | no | free given TIDigits (0.5 k$) | purchase (incl. HTK) | ? | 225 | US English | 0.01 | digits | 1 | no | no, simulated rir, real rir | loudspeaker | N/S | no | various real environments | original | no | yes | no | yes |
AMI | 2006 | meeting | 100 | 16 | 16 | 6 | free | download | ? | 189 | UK English | 8 | meeting | 4 | meeting (18% overlap) | reverb | human | quasi-fixed | head | stationary background | headset, lapel | yes | yes | yes | no |
PASCAL SSC | 2006 | cocktail party | 8.8 | 25 | 1 | no | free | download | 8.8 | 34 | UK English | 0.05 | command | 2 | full | no | human | N/S | no | no | original | N/S | yes | no | no |
HIWIRE | 2007 | airplane | 21 | 16 | 1 | no | 0.05 k€ | purchase | 21 | 81 | non-native English | 0.1 | command | 1 | no | no | human | N/S | head | airplane | original | N/S | yes | no | no |
UT-Drive | 2007 | car | 40 | 25 | 5 | 2 | 25 k$ | download | 40 | 25 | US English | 2.4 | command, dialog | 1 - 2 | dialog | reverb | human | quasi-fixed | head | car | headset (low quality) | no | partial | no | no |
SASSEC, SiSEC under- determined | 2007 - 2011 | cocktail party | 0.3 | 16 | 2 | no | free | download | 0.3 | 16 | N/S | N/S | read | 3 - 4 | full | simulated rir, real rir, reverb | no, loudspeaker | fixed | no | no | original, spatial image | yes | no | no | no |
MC-WSJ-AV, PASCAL SSC2, 2012_MMA, REVERB RealData | 2007 - 2014 | cocktail party | 10 | 16 | 8 - 40 | no | 1.5 k$ | purchase | ? | 45 | UK English | 10 | read | 1 - 2 | full | reverb | human | various | walk | stationary background | headset, lapel | yes | yes | no | no |
CENSREC-4 (Simulated) | 2008 | public spaces, domestic, office, car | ? | 16 | 1 | no | free | download | ? | 214 | Japanese | 0.01 | digits | 1 | no | real rir | dummy | fixed | no | various real environments | original | no | yes | no | yes |
CENSREC-4 (Real) | 2008 | public spaces, domestic, office, car | ? | 16 | 1 | no | free | download | ? | 10 | Japanese | 0.01 | digits | 1 | no | reverb | human | quasi-fixed | head | various real environments | headset | no | yes | no | yes |
DICIT | 2008 | domestic | 6 | 48 | 16 | 2 | free | download | 1 | ? | Italian | ? | command | 4 | no | reverb | human | various | walk | domestic (scenarized) | headset, tv | yes | yes | no | yes |
SiSEC head-geometry | 2008 | cocktail party | 1.9 | 16 | 2 | no | free | download | 1.9 | ? | N/S | N/S | read | 2 | full | real rir | loudspeaker | various | no | no | original, spatial image | yes | no | no | no |
COSINE | 2009 | dialog | 38 | 48 | 20 | no | free | download | 11 | 91 | US English, non-native English | 5 | spontaneous | 2 - 7 | dialog | reverb | human | various | walk | various real environments | headset, throat mic | no | yes | no | no |
SiSEC real-world noise | 2010 | public spaces | 0.3 | 16 | 2 - 4 | no | free | download | 0.3 | 6 | N/S | N/S | read | 1 - 3 | full | no, reverb (other room) | loudspeaker | various | no | various real environments | original, spatial image | yes | no | no | no |
SiSEC dynamic | 2010 - 2011 | cocktail party | 0.2 | 16 | 2 - 4 | no | free | download | 0.2 | ? | N/S | N/S | read | many but only 2 simultaneous | full | reverb | loudspeaker | various | simulated | no | original, spatial image | yes | no | no | no |
CHiME 1, CHiME 2 Grid | 2011 - 2012 | domestic | 70 | 16 - 48 | 2 | no | free | download | 12 | 34 | UK English | 0.05 | command | 1 | no | real rir | dummy | quasi-fixed | simulated head | domestic | yes | yes | yes | no | no |
CHiME 2 WSJ0 | 2012 | domestic | 78 | 16 | 2 | no | free given WSJ0 (1.5 k$) | download | 33 | 101 | US English | 11 | read | 1 | no | real rir | dummy | fixed | no | domestic | yes | yes | yes | no | no |
ETAPE | 2012 | TV/radio debates, outdoor interviews | 42 | 16 | 1 | 1 | ? | download | 32 | 347 | French | 16 | spontaneous | 1 or more | dialog (up to 10% overlap) | reverb (some) | human | quasi-fixed | head | various real environments | no | N/S | yes | no | yes |
GALE (Chinese broadcast conversation) | 2013 | TV dialog | 120 | 16 | 1 | no | 3.5 k$ | purchase | 108 | ? | Mandarin | ? | spontaneous | 1 or more | dialog | no | human | quasi-fixed | head | no | no | N/S | yes | no | no |
GALE (Arabic broadcast conversation) | 2013 | TV dialog | 251 | 16 | 1 | no | 7 k$ | purchase | 234 | ? | Arabic | ? | spontaneous | 1 or more | dialog | no | human | quasi-fixed | head | no | no | N/S | yes | no | no |
REVERB SimData | 2013 | domestic, office | 25 | 16 | 8 | no | free given WSJCAM0 (1.75 k$) | purchase | 25 | 130 | UK English | 10 | read | 1 | no | real rir | loudspeaker | fixed | no | stationary background | original, spatial image | yes | yes | no | yes |
DIRHA | 2014 | domestic | 3.8 | 48 | 40 | no | free | download | 1.3 | 30 | various | ? | command, read, spontaneous | 1 or more | simulated | real rir | loudspeaker | various | no | domestic (sum of events) | yes | yes | yes | no | yes |
General attributes:
- year of release
- use case
- total duration
- sampling rate
- number of distant or noisy microphones
- number of video cameras
- cost
- links
Speech attributes:
- speaking duration
- number of unique speakers
- language
- number of unique words
- speaking style
- number of speakers present in the room
- type of speaker overlap
Channel attributes:
- channel type
- speaker radiation
- speaker location
- speaker movements
Noise attributes:
- noise type
Available ground truth:
- reference speech signal
- speaker location and orientation
- words uttered
- nonverbal traits
- noise events
Text datasets
Other datasets
Contribute a dataset
To contribute a new dataset, please
- create an account and login
- go to the wiki page above corresponding to your application; if it does not exist yet, you may create it
- click on the "Edit" link at the top of the page and add a new section for your dataset (the datasets are ordered by year of collection)
- click on the "Save page" link at the bottom of the page to save your modifications
Please make sure to provide the following information:
- name of the dataset and year of collection
- authors, institution, contact information
- link to the dataset and to side resources (lexicon, language model, etc)
- short description (nature of the data, license, etc) and link to a paper/report describing the dataset, if any
- at least 1 research result obtained for this dataset (see below)
We currently cannot provide storage space for large datasets. Please upload the dataset at a stable URL on the website of your institution or elsewhere and provide its URL only. If this is not possible, please contact the resources sharing working group.
Contribute a research result
To contribute a new research result, please
- create an account and login
- go to the wiki page and the section corresponding to the dataset for which this result was obtained
- click on the "Edit" link on the right of the section header and add a new item for your result
- click on the "Save page" link at the bottom of the page to save your modifications
Please make sure to provide the following information:
- authors, paper/report title, means of publication
- link to the pdf of the paper
- link to derived data (output transcriptions, intermediary data, etc)
- Code and instructions to reproduce experiments (if available)
In order to save storage space, please do not upload the paper on this wiki, but link it as much as possible from your institutional archive, from another public archive (e.g., arxiv) or from the publisher website (e.g., ieexplore).
We currently cannot provide storage space for large datasets. Please upload the derived data at a stable URL on the website of your institution or elsewhere and provide its URL only. If this is not possible, please contact the resources sharing working group.