Difference between revisions of "Datasets"
m |
m |
||
Line 14: | Line 14: | ||
!colspan="7" |Speech | !colspan="7" |Speech | ||
!colspan="4" |Channel | !colspan="4" |Channel | ||
− | !Noise | + | !colspan="2" |Noise |
!colspan="5" |Ground truth | !colspan="5" |Ground truth | ||
|- | |- | ||
Line 37: | Line 37: | ||
!scope="col" width="40px" | speak. moves | !scope="col" width="40px" | speak. moves | ||
!scope="col" width="40px" | noise type | !scope="col" width="40px" | noise type | ||
+ | !scope="col" width="40px" | avg. SNR | ||
!scope="col" width="40px" | ref. signal | !scope="col" width="40px" | ref. signal | ||
!scope="col" width="40px" | speak. loc., orient. | !scope="col" width="40px" | speak. loc., orient. | ||
Line 65: | Line 66: | ||
|{{yes|head}} | |{{yes|head}} | ||
|{{yes|meeting}} | |{{yes|meeting}} | ||
+ | |{{no|high}} | ||
|{{some|headset}} | |{{some|headset}} | ||
|{{yes}} | |{{yes}} | ||
Line 92: | Line 94: | ||
|{{yes|head}} | |{{yes|head}} | ||
|{{some|hallway, restaurant (scenarized)}} | |{{some|hallway, restaurant (scenarized)}} | ||
+ | |{{some|medium}} | ||
|{{no}} | |{{no}} | ||
|{{yes}} | |{{yes}} | ||
Line 119: | Line 122: | ||
|{{some|quasi-fixed}} | |{{some|quasi-fixed}} | ||
|{{yes|head}} | |{{yes|head}} | ||
− | |{{ | + | |{{yes|stationary background}} |
+ | |{{no|high}} | ||
|{{no}} | |{{no}} | ||
|{{no}} | |{{no}} | ||
Line 149: | Line 153: | ||
|{{no}} | |{{no}} | ||
|{{some|various real environments (added)}} | |{{some|various real environments (added)}} | ||
+ | |{{yes|low}} | ||
|{{yes|original}} | |{{yes|original}} | ||
|{{n/s}} | |{{n/s}} | ||
Line 177: | Line 182: | ||
|{{yes|head}} | |{{yes|head}} | ||
|{{some|military (added)}} | |{{some|military (added)}} | ||
+ | |{{yes|low}} | ||
|{{no}} | |{{no}} | ||
|{{no}} | |{{no}} | ||
Line 205: | Line 211: | ||
|{{yes|head}} | |{{yes|head}} | ||
|{{yes|car}} | |{{yes|car}} | ||
+ | |{{yes|low}} | ||
|{{some|headset}} | |{{some|headset}} | ||
|{{no}} | |{{no}} | ||
Line 232: | Line 239: | ||
|{{some|quasi-fixed}} | |{{some|quasi-fixed}} | ||
|{{yes|head}} | |{{yes|head}} | ||
− | |{{ | + | |{{yes|stationary background}} |
+ | |{{no|high}} | ||
|{{some|headset}} | |{{some|headset}} | ||
|{{no}} | |{{no}} | ||
Line 261: | Line 269: | ||
|{{some|no, pivoting arm}} | |{{some|no, pivoting arm}} | ||
|{{some|various (sum of events)}} | |{{some|various (sum of events)}} | ||
+ | |{{some|medium}} | ||
|{{yes|original}} | |{{yes|original}} | ||
|{{yes}} | |{{yes}} | ||
Line 289: | Line 298: | ||
|{{yes|head}} | |{{yes|head}} | ||
|{{yes|car}} | |{{yes|car}} | ||
+ | |{{yes|low}} | ||
|{{some|headset}} | |{{some|headset}} | ||
|{{no}} | |{{no}} | ||
Line 318: | Line 328: | ||
|{{no}} | |{{no}} | ||
|{{some|various real environments (added)}} | |{{some|various real environments (added)}} | ||
+ | |{{yes|low}} | ||
|{{yes|original}} | |{{yes|original}} | ||
|{{n/s}} | |{{n/s}} | ||
Line 345: | Line 356: | ||
|{{some|quasi-fixed}} | |{{some|quasi-fixed}} | ||
|{{yes|head}} | |{{yes|head}} | ||
− | |{{ | + | |{{yes|stationary background}} |
+ | |{{no|high}} | ||
|{{some|lapel}} | |{{some|lapel}} | ||
|{{no}} | |{{no}} | ||
Line 373: | Line 385: | ||
|{{some|quasi-fixed}} | |{{some|quasi-fixed}} | ||
|{{yes|head}} | |{{yes|head}} | ||
− | |{{ | + | |{{yes|stationary background}} |
+ | |{{no|high}} | ||
|{{no}} | |{{no}} | ||
|{{no}} | |{{no}} | ||
Line 402: | Line 415: | ||
|{{yes|head}} | |{{yes|head}} | ||
|{{yes|car}} | |{{yes|car}} | ||
+ | |{{yes|low}} | ||
|{{no}} | |{{no}} | ||
|{{no}} | |{{no}} | ||
Line 430: | Line 444: | ||
|{{no}} | |{{no}} | ||
|{{some|various real environments (added)}} | |{{some|various real environments (added)}} | ||
+ | |{{yes|low}} | ||
|{{yes|original}} | |{{yes|original}} | ||
|{{n/s}} | |{{n/s}} | ||
Line 458: | Line 473: | ||
|{{yes|head}} | |{{yes|head}} | ||
|{{yes|car}} | |{{yes|car}} | ||
+ | |{{yes|low}} | ||
|{{no}} | |{{no}} | ||
|{{no}} | |{{no}} | ||
Line 485: | Line 501: | ||
|{{yes|various}} | |{{yes|various}} | ||
|{{yes|head, walk}} | |{{yes|head, walk}} | ||
− | |{{ | + | |{{yes|stationary background}} |
+ | |{{no|high}} | ||
|{{no}} | |{{no}} | ||
|{{some|partial}} | |{{some|partial}} | ||
Line 515: | Line 532: | ||
|{{yes|head}} | |{{yes|head}} | ||
|{{yes|meeting}} | |{{yes|meeting}} | ||
+ | |{{no|high}} | ||
|{{some|headset, lapel}} | |{{some|headset, lapel}} | ||
|{{no}} | |{{no}} | ||
Line 542: | Line 560: | ||
|{{yes|various}} | |{{yes|various}} | ||
|{{yes|head, walk}} | |{{yes|head, walk}} | ||
− | |{{ | + | |{{yes|stationary background}} |
+ | |{{no|high}} | ||
|{{some|headset, lapel}} | |{{some|headset, lapel}} | ||
|{{no}} | |{{no}} | ||
Line 571: | Line 590: | ||
|{{yes|head}} | |{{yes|head}} | ||
|{{some|meeting (scenarized)}} | |{{some|meeting (scenarized)}} | ||
+ | |{{no|high}} | ||
|{{some|headset}} | |{{some|headset}} | ||
|{{yes}} | |{{yes}} | ||
Line 599: | Line 619: | ||
|{{yes|head}} | |{{yes|head}} | ||
|{{yes|various real environments}} | |{{yes|various real environments}} | ||
+ | |{{some|medium}} | ||
|{{some|headset}} | |{{some|headset}} | ||
|{{no}} | |{{no}} | ||
Line 627: | Line 648: | ||
|{{yes|head}} | |{{yes|head}} | ||
|{{yes|car}} | |{{yes|car}} | ||
+ | |{{yes|low}} | ||
|{{some|headset}} | |{{some|headset}} | ||
|{{no}} | |{{no}} | ||
Line 655: | Line 677: | ||
|{{yes|head}} | |{{yes|head}} | ||
|{{yes|car}} | |{{yes|car}} | ||
+ | |{{yes|low}} | ||
|{{some|headset}} | |{{some|headset}} | ||
|{{no}} | |{{no}} | ||
Line 683: | Line 706: | ||
|{{no}} | |{{no}} | ||
|{{some|various real environments (added)}} | |{{some|various real environments (added)}} | ||
+ | |{{yes|low}} | ||
|{{yes|original}} | |{{yes|original}} | ||
|{{no}} | |{{no}} | ||
Line 710: | Line 734: | ||
|{{some|quasi-fixed}} | |{{some|quasi-fixed}} | ||
|{{yes|head}} | |{{yes|head}} | ||
− | |{{ | + | |{{yes|stationary background}} |
+ | |{{no|high}} | ||
|{{some|headset, lapel}} | |{{some|headset, lapel}} | ||
|{{yes}} | |{{yes}} | ||
Line 739: | Line 764: | ||
|{{no}} | |{{no}} | ||
|{{no}} | |{{no}} | ||
+ | |{{n/s}} | ||
|{{yes|original}} | |{{yes|original}} | ||
|{{n/s}} | |{{n/s}} | ||
Line 766: | Line 792: | ||
|{{n/s}} | |{{n/s}} | ||
|{{no}} | |{{no}} | ||
− | |{{yes| | + | |{{some|airplane (added)}} |
+ | |{{yes|low}} | ||
|{{yes|original}} | |{{yes|original}} | ||
|{{n/s}} | |{{n/s}} | ||
Line 795: | Line 822: | ||
|{{no}} | |{{no}} | ||
|{{some|various real environments (added)}} | |{{some|various real environments (added)}} | ||
+ | |{{yes|low}} | ||
|{{yes|original}} | |{{yes|original}} | ||
|{{n/s}} | |{{n/s}} | ||
Line 823: | Line 851: | ||
|{{yes|head}} | |{{yes|head}} | ||
|{{yes|car}} | |{{yes|car}} | ||
+ | |{{yes|low}} | ||
|{{some|headset (low quality)}} | |{{some|headset (low quality)}} | ||
|{{no}} | |{{no}} | ||
Line 851: | Line 880: | ||
|{{no}} | |{{no}} | ||
|{{no}} | |{{no}} | ||
+ | |{{n/s}} | ||
|{{yes|original, spatial image}} | |{{yes|original, spatial image}} | ||
|{{yes}} | |{{yes}} | ||
Line 885: | Line 915: | ||
|{{yes|various}} | |{{yes|various}} | ||
|{{yes|head, walk}} | |{{yes|head, walk}} | ||
− | |{{ | + | |{{yes|stationary background}} |
+ | |{{no|high}} | ||
|{{some|headset, lapel}} | |{{some|headset, lapel}} | ||
|{{yes}} | |{{yes}} | ||
Line 914: | Line 945: | ||
|{{no}} | |{{no}} | ||
|{{some|various real environments (added)}} | |{{some|various real environments (added)}} | ||
+ | |{{yes|low}} | ||
|{{yes|original}} | |{{yes|original}} | ||
|{{no}} | |{{no}} | ||
Line 942: | Line 974: | ||
|{{yes|head}} | |{{yes|head}} | ||
|{{yes|various real environments}} | |{{yes|various real environments}} | ||
+ | |{{yes|low}} | ||
|{{some|headset}} | |{{some|headset}} | ||
|{{no}} | |{{no}} | ||
Line 970: | Line 1,003: | ||
|{{yes|head, walk}} | |{{yes|head, walk}} | ||
|{{some|domestic (scenarized)}} | |{{some|domestic (scenarized)}} | ||
+ | |{{some|medium}} | ||
|{{some|headset, tv}} | |{{some|headset, tv}} | ||
|{{yes}} | |{{yes}} | ||
Line 998: | Line 1,032: | ||
|{{no}} | |{{no}} | ||
|{{no}} | |{{no}} | ||
+ | |{{n/s}} | ||
|{{yes|original, spatial image}} | |{{yes|original, spatial image}} | ||
|{{yes}} | |{{yes}} | ||
Line 1,026: | Line 1,061: | ||
|{{yes|head, walk}} | |{{yes|head, walk}} | ||
|{{yes|various real environments}} | |{{yes|various real environments}} | ||
+ | |{{yes|low}} | ||
|{{some|headset, throat mic}} | |{{some|headset, throat mic}} | ||
|{{no}} | |{{no}} | ||
Line 1,054: | Line 1,090: | ||
|{{no}} | |{{no}} | ||
|{{some|various real environments (added)}} | |{{some|various real environments (added)}} | ||
+ | |{{yes|low}} | ||
|{{yes|original, spatial image}} | |{{yes|original, spatial image}} | ||
|{{yes}} | |{{yes}} | ||
Line 1,082: | Line 1,119: | ||
|{{some|simulated}} | |{{some|simulated}} | ||
|{{no}} | |{{no}} | ||
+ | |{{n/s}} | ||
|{{yes|original, spatial image}} | |{{yes|original, spatial image}} | ||
|{{yes}} | |{{yes}} | ||
Line 1,113: | Line 1,151: | ||
|{{some|simulated head}} | |{{some|simulated head}} | ||
|{{yes|domestic}} | |{{yes|domestic}} | ||
+ | |{{yes|low}} | ||
|{{yes}} | |{{yes}} | ||
|{{yes}} | |{{yes}} | ||
Line 1,144: | Line 1,183: | ||
|{{no}} | |{{no}} | ||
|{{yes|domestic}} | |{{yes|domestic}} | ||
+ | |{{yes|low}} | ||
|{{yes}} | |{{yes}} | ||
|{{yes}} | |{{yes}} | ||
Line 1,172: | Line 1,212: | ||
|{{yes|head}} | |{{yes|head}} | ||
|{{yes|various real environments}} | |{{yes|various real environments}} | ||
+ | |{{no|high}} | ||
|{{no}} | |{{no}} | ||
|{{n/s}} | |{{n/s}} | ||
Line 1,199: | Line 1,240: | ||
|{{yes|head}} | |{{yes|head}} | ||
|{{no}} | |{{no}} | ||
+ | |{{n/s}} | ||
|{{no}} | |{{no}} | ||
|{{n/s}} | |{{n/s}} | ||
Line 1,226: | Line 1,268: | ||
|{{yes|head}} | |{{yes|head}} | ||
|{{no}} | |{{no}} | ||
+ | |{{n/s}} | ||
|{{no}} | |{{no}} | ||
|{{n/s}} | |{{n/s}} | ||
Line 1,257: | Line 1,300: | ||
|{{yes|various}} | |{{yes|various}} | ||
|{{no}} | |{{no}} | ||
− | |{{some| | + | |{{some|random noise (added)}} |
+ | |{{no|high}} | ||
|{{yes|original, spatial image}} | |{{yes|original, spatial image}} | ||
|{{yes}} | |{{yes}} | ||
Line 1,286: | Line 1,330: | ||
|{{yes|head, walk}} | |{{yes|head, walk}} | ||
|{{yes|background music}} | |{{yes|background music}} | ||
+ | |{{some|medium}} | ||
|{{some|headset}} | |{{some|headset}} | ||
|{{yes}} | |{{yes}} | ||
Line 1,314: | Line 1,359: | ||
|{{no}} | |{{no}} | ||
|{{some|domestic (sum of events)}} | |{{some|domestic (sum of events)}} | ||
+ | |{{yes|low}} | ||
|{{yes}} | |{{yes}} | ||
|{{yes}} | |{{yes}} |
Revision as of 20:02, 14 August 2014
Contents
Speech datasets
The table below aims to provide a list of speech datasets with detailed attributes and links to software baselines and evaluation results. Each dataset may be used for one or more applications: automatic speech recognition, speaker identification and verification, source localization, speech enhancement and separation... The meaning of each attribute is detailed below.
Disclaimer: Only datasets that are publicly available, (at least partially) annotated, suitable for research on robustness, and longer than 5 min are listed. Other relevant datasets are listed below.
If you would like to refer to this table, please cite J. Le Roux and E. Vincent, "A categorization of robust speech processing datasets", Mitsubishi Electric Research Laboratories Technical Report, Aug. 2014.
Datasets | General attributes | Speech | Channel | Noise | Ground truth | |||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
rel. year | use case | total time (h) | sam. rate (kHz) | dist. or noisy mics | video cams | cost (non- memb) | links | speak. time (h) | uniq. speak. | lang. | uniq. words (k) | speak. style | speak. / rec. | overl. type | chan. type | speak. radiat. | speak. loc. | speak. moves | noise type | avg. SNR | ref. signal | speak. loc., orient. | words | non- verb. traits | noise events | |
ShATR | 1994 | meeting | 0.6 | 48 | 3 | no | free | download | 0.6 | 5 | UK English | 1 | spontaneous | 5 | multiple dialogs | reverb | human | quasi-fixed | head | meeting | high | headset | yes | yes | no | yes |
LLSEC | 1996 | dialog | 1.4 | 16 | 4 | no | free | download | ? | 12 | N/S | N/S | read, spontaneous | 2 | dialog | reverb | human | quasi-fixed | head | hallway, restaurant (scenarized) | medium | no | yes | no | no | no |
RWCP Spoken Dialog Corpus | 1996 - 1997 | dialog | 10 | 16 | 2 | no | free | download | 10 | 39 | Japanese | ? | spontaneous | 1 - 2 | dialog | reverb (low) | human | quasi-fixed | head | stationary background | high | no | no | yes | no | no |
Aurora-2 | 2000 | public spaces | 33 | 8 - 16 | 1 | no | free given TIDigits (0.5 k$) | purchase (incl. HTK) | 33 | 214 | US English | 0.01 | digits | 1 | no | simulated phone | human | N/S | no | various real environments (added) | low | original | N/S | yes | no | yes |
SPINE1, SPINE2 | 2000 - 2001 | military | 38 | 16 | 2 | no | 7.4 k$ | purchase | ? | 100 | US English | 1 | command, spontaneous | 1 - 2 | no | simulated radio | human | quasi-fixed | head | military (added) | low | no | no | yes | no | no |
Aurora-3 (subset of SpeechDat- Car) | 2000 - 2003 | car | ? | 16 | 4 | no | 1 k€ | purchase (incl. HTK) | ? | 730 | various | 0.01 | digits | 1 | no | reverb | human | quasi-fixed | head | car | low | headset | no | yes | no | no |
RWCP Meeting Speech Corpus | 2001 | meeting | 3.5 | 16 - 48 | 1 | 3 | free | download | 3.5 | ? | Japanese | ? | spontaneous | 1 - 5 | meeting | reverb (low) | human | quasi-fixed | head | stationary background | high | headset | no | yes | no | no |
RWCP Real Environment Speech and Acoustic Database | 2001 | domestic, office | ? | 16 - 48 | 84 | no | free | download | ? | 5 | US English, Japanese | ? | read | 1 | no | real rir, reverb | loudspeaker | various | no, pivoting arm | various (sum of events) | medium | original | yes | yes | no | yes |
SpeechDat- Car | 2001 - 2011 | car | ? | 16 | 4 | no | 39 - 182 k€ per lang | purchase | ? | 300 per lang | various | ? | digits, command, read, spontaneous | 1 | no | reverb | human | quasi-fixed | head | car | low | headset | no | yes | no | no |
Aurora-4 | 2002 | public spaces | ? | 8 - 16 | 1 | no | free given WSJ0 (1.5 k$) | purchase | ? | 101 | US English | 10 | read | 1 | no | simulated phone | human | N/S | no | various real environments (added) | low | original | N/S | yes | no | yes |
TED | 2002 | seminar | 47 | 16 | 1 | no | 0.5 k$ | purchase | 47 | 188 | non-native English | ? | lecture | 1 or more | seminar | reverb | human | quasi-fixed | head | stationary background | high | lapel | no | partial | no | no |
CUAVE | 2002 | speech overlap | 3 | 44 | 1 | 1 | free | download | 3 | 36 | US English | 0.01 | digits | 1 - 2 | full | reverb | human | quasi-fixed | head | stationary background | high | no | no | yes | no | no |
CU-Move Microphone Array Data | 2002 - 2011 | car | 286 | 44 | 6 - 8 | no | 25 k$ | purchase | 286 | 172 | US English | 12 | digits, command, read, dialog | 1 | no | reverb | human | quasi-fixed | head | car | low | no | no | yes | no | no |
CENSREC-1 (Aurora-2J) | 2003 | public spaces | ? | 8 | 1 | no | free | download | ? | 214 | Japanese | 0.01 | digits | 1 | no | simulated phone | human | N/S | no | various real environments (added) | low | original | N/S | yes | no | yes |
AVICAR | 2004 | car | 29 | 16 | 7 | 4 | free | download | 29 | 86 | US English, non-native English | 1 | read | 1 | no | reverb | human | quasi-fixed | head | car | low | no | no | yes | no | no |
AV16.3 | 2004 | meeting | 1.5 | 16 | 16 | 3 | free | download | 1.5 | 12 | N/S | N/S | spontaneous | 1 - 3 | full | reverb | human | various | head, walk | stationary background | high | no | partial | no | no | no |
ICSI Meeting Corpus | 2004 | meeting | 72 | 16 | 6 | no | 2.8 k$ | purchase | 72 | 53 | US English, other English | 13 | meeting | 3 - 10 | meeting | reverb | human | quasi-fixed | head | meeting | high | headset, lapel | no | yes | yes | ad-hoc |
NIST Meeting Pilot Corpus Speech | 2004 | meeting | 15 | 16 | 7 | no | 5.5 k$ | purchase | 15 | 61 | US English | 6 | meeting | 3 - 9 | meeting | reverb | human | various | head, walk | stationary background | high | headset, lapel | no | yes | no | no |
CHIL Meetings | 2004 - 2007 | seminar, meeting | 60 | 44 | 79 - 147 | 6 - 9 | 3.5 k€ | purchase | ? | ? | non-native English | ? | seminar, meeting | 3 - 20 | seminar, meeting | reverb | human | quasi-fixed | head | meeting (scenarized) | high | headset | yes | yes | yes | no |
SPEECON | 2004 - 2011 | public space, domestic, office, car | ? | 16 | 3 | no | 75 k€ per lang | purchase | ? | 600 per lang | various | ? | command, read, spontaneous | 1 | no | reverb | human | quasi-fixed | head | various real environments | medium | headset | no | yes | no | no |
CENSREC-2 | 2005 | car | ? | 16 | 1 | no | free | download | ? | 214 | Japanese | 0.01 | digits | 1 | no | reverb | human | quasi-fixed | head | car | low | headset | no | yes | no | no |
CENSREC-3 | 2005 | car | ? | 16 | 1 | no | 21 k¥ | purchase | ? | 311 | Japanese | 0.05 | read | 1 | no | reverb | human | quasi-fixed | head | car | low | headset | no | yes | no | no |
Aurora-5 | 2006 | public spaces, domestic, office, car | ? | 8 | 1 | no | free given TIDigits (0.5 k$) | purchase (incl. HTK) | ? | 225 | US English | 0.01 | digits | 1 | no | no, simulated rir, real rir | loudspeaker | fixed | no | various real environments (added) | low | original | no | yes | no | yes |
AMI | 2006 | meeting | 100 | 16 | 16 | 6 | free | download | ? | 189 | UK English, other English | 8 | meeting | most often 4 | meeting (18% overlap) | reverb | human | quasi-fixed | head | stationary background | high | headset, lapel | yes | yes | yes | no |
PASCAL SSC | 2006 | speech overlap | 8.8 | 25 | 1 | no | free | download | 8.8 | 34 | UK English | 0.05 | command | 2 | full | no | human | N/S | no | no | N/S | original | N/S | yes | no | no |
HIWIRE | 2007 | airplane | 21 | 16 | 1 | no | 0.05 k€ | purchase | 21 | 81 | non-native English | 0.1 | command | 1 | no | no | human | N/S | no | airplane (added) | low | original | N/S | yes | no | no |
NOIZEUS | 2007 | public spaces | 0.6 | 8 | 1 | no | free | download | 0.6 | 6 | US English | 0.1 | read | 1 | no | simulated phone | human | N/S | no | various real environments (added) | low | original | N/S | no | no | no |
UT-Drive | 2007 | car | 40 | 25 | 5 | 2 | 25 k$ | download | 40 | 25 | US English | 2.4 | command, dialog | 1 - 2 | dialog | reverb | human | quasi-fixed | head | car | low | headset (low quality) | no | partial | no | no |
SASSEC, SiSEC under- determined | 2007 - 2011 | cocktail party | 0.3 | 16 | 2 | no | free | download | 0.3 | 16 | N/S | N/S | read | 3 - 4 | full | simulated rir, real rir, reverb | no, loudspeaker | fixed | no | no | N/S | original, spatial image | yes | no | no | no |
MC-WSJ-AV, PASCAL SSC2, 2012_MMA, REVERB RealData | 2007 - 2014 | speech overlap | 10 | 16 | 8 - 40 | partial | 1.5 k$ | purchase | ? | 45 | UK English | 10 | read | 1 - 2 | full | reverb | human | various | head, walk | stationary background | high | headset, lapel | yes | yes | no | no |
CENSREC-4 (Simulated) | 2008 | public spaces, domestic, office, car | ? | 16 | 1 | no | free | download | ? | 214 | Japanese | 0.01 | digits | 1 | no | real rir | dummy | fixed | no | various real environments (added) | low | original | no | yes | no | yes |
CENSREC-4 (Real) | 2008 | public spaces, domestic, office, car | ? | 16 | 1 | no | free | download | ? | 10 | Japanese | 0.01 | digits | 1 | no | reverb | human | quasi-fixed | head | various real environments | low | headset | no | yes | no | yes |
DICIT | 2008 | domestic | 6 | 48 | 16 | 2 | free | download | 1 | ? | Italian | ? | command | 4 | no | reverb | human | various | head, walk | domestic (scenarized) | medium | headset, tv | yes | yes | no | yes |
SiSEC head-geometry | 2008 | speech overlap | 1.9 | 16 | 2 | no | free | download | 1.9 | ? | N/S | N/S | read | 2 | full | real rir | loudspeaker | various | no | no | N/S | original, spatial image | yes | no | no | no |
COSINE | 2009 | dialog | 38 | 48 | 20 | no | free | download | 11 | 91 | US English, non-native English | 5 | spontaneous | 2 - 7 | dialog | reverb | human | various | head, walk | various real environments | low | headset, throat mic | no | yes | no | no |
SiSEC real-world noise | 2010 | public spaces | 0.3 | 16 | 2 - 4 | no | free | download | 0.3 | 6 | N/S | N/S | read | 1 - 3 | full | no, reverb (other room) | loudspeaker | various | no | various real environments (added) | low | original, spatial image | yes | no | no | no |
SiSEC dynamic | 2010 - 2011 | cocktail party | 0.2 | 16 | 2 - 4 | no | free | download | 0.2 | ? | N/S | N/S | read | ? | full (2 at a time) | reverb | loudspeaker | various | simulated | no | N/S | original, spatial image | yes | no | no | no |
CHiME 1, CHiME 2 Grid | 2011 - 2012 | domestic | 70 | 16 - 48 | 2 | no | free | download | 12 | 34 | UK English | 0.05 | command | 1 | no | real rir | dummy | quasi-fixed | simulated head | domestic | low | yes | yes | yes | no | no |
CHiME 2 WSJ0 | 2012 | domestic | 78 | 16 | 2 | no | free given WSJ0 (1.5 k$) | download | 33 | 101 | US English | 11 | read | 1 | no | real rir | dummy | fixed | no | domestic | low | yes | yes | yes | no | no |
ETAPE | 2012 | TV/radio debates, outdoor interviews | 42 | 16 | 1 | 1 | ? | download | 32 | 347 | French | 16 | spontaneous | 1 or more | dialog (up to 10% overlap) | reverb (some) | human | quasi-fixed | head | various real environments | high | no | N/S | yes | no | yes |
GALE (Chinese broadcast conversation) | 2013 | TV dialog | 120 | 16 | 1 | no | 3.5 k$ | purchase | 108 | ? | Mandarin | ? | spontaneous | 1 or more | dialog | no | human | quasi-fixed | head | no | N/S | no | N/S | yes | no | no |
GALE (Arabic broadcast conversation) | 2013 | TV dialog | 251 | 16 | 1 | no | 7 k$ | purchase | 234 | ? | Arabic | ? | spontaneous | 1 or more | dialog | no | human | quasi-fixed | head | no | N/S | no | N/S | yes | no | no |
REVERB SimData | 2013 | domestic, office | 25 | 16 | 8 | no | free given WSJCAM0 (1.75 k$) | purchase | 25 | 130 | UK English | 10 | read | 1 | no | real rir | loudspeaker | various | no | random noise (added) | high | original, spatial image | yes | yes | no | yes |
Sheffield Wargames Corpus | 2013 | cocktail party | 7 | 48 | 92 | 3 | free | download | ? | 9 | UK English | ? | spontaneous | 4 | multiple dialogs | reverb | human | various | head, walk | background music | medium | headset | yes | yes | no | no |
DIRHA | 2014 | domestic | 3.8 | 48 | 40 | no | free | download | 1.3 | 30 | various | ? | command, read, spontaneous | 1 or more | simulated | real rir | loudspeaker | various | no | domestic (sum of events) | low | yes | yes | yes | no | yes |
General attributes:
- year of release
- scenario: car, cocktail party, domestic, lecture, meeting, office, public space, TV...
- total duration (h) (multiple channels counted only once)
- sampling rate (kHz)
- number of distant or noisy microphones
- number of video cameras
- cost for non-members of ELRA and LDC (cost for members is lower or free)
- links: download data, reference papers, software baselines, evaluation results...
Speech attributes:
- duration of speech (h) (overlapping speech counted only once)
- number of unique speakers
- language
- number of unique words (differs from assumed vocabulary size, which is somewhat arbitrary)
- speaking style: digits, command, read, spontaneous...
- number of speakers present in the room
- type of speaker overlap: no overlap, simulated overlap, dialogue, meeting, full overlap...
Channel attributes:
- channel type: none, simulated room impulse response, convolution by a recorded room impulse response, reverberant recording...
- speaker radiation: loudspeaker, dummy head with mouth simulator, human...
- speaker location: at a fixed position in the room, at a quasi-fixed position (e.g., seated), at different positions...
- speaker movements: no movement, head movements, walking...
Noise attributes:
- noise type: stationary background noise (e.g., air-conditioning), car noise, meeting noises, domestic noises, outdoor noises...
Available ground truth:
- reference speech signal: original (at the mouth), headset or lapel (slightly differs from the signal at the mouth), spatial image (at the microphones)...
- speaker location and orientation
- words uttered
- paralinguistic attributes: nodding, gaze, communication intent, emotion...
- noise events: type and time of individual noise events
Text datasets
Other datasets
This section lists all other relevant datasets that have not been annotated or made publicly available yet.
Speech datasets:
- BABEL (not yet available)
- Broadcast news, HUB4 (no noise and 4.5% speaker overlap, less than ETAPE)
- CIAIR In-Car Speech Database (availability unknown)
- Dyrholm/Sawada/Parra (about 1 min long)
- NEMISIG (unavailable)
- RATS (not yet available)
- Rich Transcription (RT) (dataset gathered from other sets, e.g. CHIL, ICSI, ISL, AMI...)
- Settlers of Catan (unannotated, more info)
- Flying MEMS microphone array (unannotated, more info)
Contribute a dataset
To contribute a new dataset, please
- create an account and login
- go to the section above corresponding to your type of dataset; if the table does not exist yet, you may create it
- click on the "Edit" link at the top of the table and add a new line for your dataset (the lines are ordered by year of release)
- fill all columns as much as possible, following the detailed list of attributes below the table
- click on the "Save page" link at the bottom of the page to save your modifications
We currently cannot provide storage space for large datasets. Please upload the dataset at a stable URL on the website of your institution or elsewhere and provide its URL only. If this is not possible, please contact the resources sharing working group.
Contribute a software baseline
To contribute a new software baseline, please
- create an account and login
- fill an entry for your software on the Software page, if not done yet
- go to the section above corresponding to the dataset for which your baseline was designed
- click on the "Edit" link at the top of the table and add a link to your software in the corresponding "links" cell
- click on the "Save page" link at the bottom of the page to save your modifications
We currently cannot provide storage space for large software. Please upload your software at a stable URL on the website of your institution or elsewhere and provide its URL only. If this is not possible, please contact the resources sharing working group.
Contribute an evaluation result
To contribute a new research result, please
- create an account and login
- go to the section above corresponding to the dataset for which this result was obtained
- click on the "Edit" link at the top of the table and add a link to your result in the corresponding "links" cell
- make sure that the link (e.g., a paper or another webpage) contains the following information: authors, link to a paper/report containing objective evaluation results, link to derived data (output transcriptions, intermediary data, etc)
- click on the "Save page" link at the bottom of the page to save your modifications
In order to save storage space, please do not upload the paper on this wiki, but link it as much as possible from your institutional archive, from another public archive (e.g., arxiv) or from the publisher website (e.g., ieexplore).