Datasets
This page aims to provide a list of datasets with detailed attributes and links to corresponding research results (papers, numerical results, output transcriptions, intermediary data, etc). Each dataset may be used for one or more applications: automatic speech recognition, speaker identification and verification, source localization, speech enhancement and separation...
Disclaimer: Only publicly available datasets with a total duration longer than 5 min are listed.
Datasets | Data | Speech | Channel | Noise | Ground truth | |||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
release | scenario | total duration | sampling rate | mixture channels | cameras | available | cost | URL | reference paper | speech duration | unique speakers | language | unique words | speaking style | simultaneous speakers | speaker overlap | channel type | radiation | speaker location | speaker movements | noise type | speech signal | speaker location and orientation | words | nonverbal traits | noise events | ||
ShATR | 1994 | meeting | 37 min | 48000 | 3 (distant) | no | yes | free | http://spandh.dcs.shef.ac.uk/projects/shatrweb/ | g.brown@dcs.shef.ac.uk | Malcolm Crawford, Guy J. Brown, Martin Cooke and Phil Green, "Design, collection and analysis of a multi-simultaneous-speaker corpus," Proceedings of The Institute of Acoustics, 16(5):183-190. | 37 min | 5 | UK English | 1k | colloquial | 5 | multiple conversations | reverb | human | quasi-fixed | head | meeting | headset | yes | yes | no | yes |
LLSEC | 1996 | conversation | 1.4 h | 16000 | 4 (distant) | no | yes | free | https://www.ll.mit.edu/mission/cybersec/HLT/corpora/SpeechCorpora.html | jpc@ll.mit.edu | ? | ? | 12 | N/S | N/S | read/colloquial | 2 | conversation | reverb | human | quasi-fixed | head | hallway, restaurant | no | yes | no | no | no |
RWCP Spoken Dialog Corpus | 1996-1997 | conversation | 10 h | 16000 | 2 (close but cross-talk) | no | yes | free | http://research.nii.ac.jp/src/en/RWCP-SP96.html | src@nii.ac.jp | Kazuyo Tanaka, Satoru Hayamizu, Yoichi Yamashita, Kiyohiro Shikano, Shuichi Itahashi and Ryuichi Oka, "Design and data collection for a spoken dialog database in the Real World Computing (RWC) program," J. Acoust. Soc. Am. 100, 2759 (1996) | 10 h | 39 | Japanese | ? | colloquial | 1 or 2 | conversation | reverb | human | quasi-fixed | head | stationary background noise | no | no | yes | no | no |
Aurora-2 | 2000 | public spaces | 33 h | 8000-16000 | 1 (close) | no | yes | TIDigits | http://aurora.hsnr.de/download.html | hans-guenter.hirsch@hs-niederrhein.de | Hans-Gnter Hirsch, David Pearce, "The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions,", Proc. Interspeech 2000 | 33 h | 214 | US English | 11 | digits | 1 | no | no (simulated telephone channel) | human | N/S | no | various real environments | original | N/S | yes | no | yes |
SPINE1/SPINE2 | 2000-2001 | military | 38 h | 16000 | 2 (close) | no | yes | 2 x ($800 (audio) + $500 (transcripts)) + 3 x ($1000 (audio) + $600 (transcripts)) | https://catalog.ldc.upenn.edu/LDC2000S87 | jdwright@ldc.upenn.edu | T.H. Crystal et al., "Speech in noisy environments (SPINE) adds new dimension to speech recognition R&D", Proc. HLT 2002 | ? | 100 | US English | 1k | command/colloquial | 1 or 2 | no | no (simulated transmission channels) | human | quasi-fixed | head | military (pre-recorded noise played in sound booth while recording speech) | no | no | yes | no | no |
Aurora-3 (subset of SpeechDat-Car) | 2000-2003 | car | ? | 16000 | 3 (+1 GSM) (distant) | no | yes | 5 x 200 (Academics) / 5 x 1,000 (Companies) | http://catalog.elra.info/index.php?cPath=37_40 | ? | ? | Finnish, German, Spanish, Danish, Italian | ? | command (read/digits/keywords/spontaneous) | 1 | no | reverb | human | quasi-fixed | head | car | close-talk | no | yes | no | no | ||
RWCP Meeting Speech Corpus | 2001 | meeting | 3.5 h | 16000-48000 | 1 (distant) | 3 | yes | free | http://research.nii.ac.jp/src/en/RWCP-SP01.html | src@nii.ac.jp | Kazuyo Tanaka, Katunobu Itou, Masanori Ihara, Ryuichi Oka, "Constructing a Meeting Speech Corpus", IPSJ, 37-15, 2001 | 3.5 h | ? | Japanese | ? | colloquial | 1 to 5 | meeting | low reverb | human | quasi-fixed | head | stationary background noise | headset | no | yes | no | no |
RWCP Real Environment Speech and Acoustic Database | 2001 | domestic/office | ? | 16000-48000 | 30 (distant) | no | yes | free | http://research.nii.ac.jp/src/en/RWCP-SSD.html | s-nakamura@is.naist.jp | Satoshi Nakamura, Kazuo Hiyane, Futoshi Asano, Takanobu Nishiura, and Takeshi Yamada, "Acoustical Sound Database in Real Environments for Sound Scene Understanding and Hands-Free Speech Recognition," LREC 2000. | ? | 5 | Japanese | ? | read | 1 | no | real rir/reverb | loudspeaker | various | no/pivoting arm | stationary background noise | original | yes | yes | no | yes |
SpeechDat-Car | 2001-2011 | car | ? | 16000 | 3 (+1 GSM) (distant) | no | yes | 1.1 Million for all 10 languages. Each costs 39k to 182k | http://catalog.elra.info/index.php?cPath=37_41 | A. Moreno et al., "SPEECHDAT-CAR. A Large Speech Database for Automotive Environments," Proc. LREC 2000 | ? | 300/language | Multiple | ? | command (read/digits/keywords/spontaneous) | 1 | no | reverb | human | quasi-fixed | head | car | close-talk | no | yes | no | no | |
Aurora-4 | 2002 | public spaces | ? | 8000-16000 | 1 (close) | no | yes | WSJ0 | http://aurora.hsnr.de/download.html | hans-guenter.hirsch@hs-niederrhein.de | N. Parihar and J. Picone, "Aurora Working Group: DSR Front End LVCSR Evaluation AU/384/02," Tech. Rep., Inst. for Signal and Information Process, Mississippi State University, 2002 | ? | 101 | US English | 10k | read | 1 | no | no (simulated telephone channel) | human | N/S | no | various real environments | original | N/S | yes | no | yes |
TED | 2002 | seminar | 47 h | 16000 | 1 (distant) | no | yes | $275 (audio) + $250 (transcripts) | https://catalog.ldc.upenn.edu/LDC2002S04 | L. Lamel, F. Schiel, A. Fourcin, J. Mariani, and H. Tillman, "The translingual English database (TED)," Proc. ICSLP, 1994 | 47 h | 188 | English (mostly non-native) | ? | lecture | 1 or more | seminar | reverb | human | quasi-fixed | head | stationary background noise | lapel | no | yes (partial) | no | no | |
CUAVE | 2002 | cocktail party | 3 h | 44100 | 1 (distant) | 1 | yes | free | http://www.clemson.edu/ces/speech/cuave.htm | ksampat@clemson.edu | Eric K Patterson, Sabri Gurbuz, Zekeriya Tufekci and John N Gowdy, "Moving-Talker, Speaker-Independent Feature Study, and Baseline Results Using the CUAVE Multimodal Speech Corpus," EURASIP Journal on Advances in Signal Processing 2002, 2002:208541 | 3 h | 36 | US English | 10 | digits | 1 or 2 | full | reverb | human | quasi-fixed | head | stationary background noise | no | no | yes | no | no |
CU-Move ("Microphone Array Data"; downsampled data with more speakers but less channels exist) | 2002-2011 | car | 286 h | 44100 | 6 to 8 (distant) | no | yes | $25k with UT-Drive | http://crss.utdallas.edu/ | john.hansen@utdallas.edu | John H.L. Hansen, Pongtep Angkititrakul, Jay Plucienkowski, Stephen Gallant, Umit Yapanel, Bryan Pellom, Wayne Ward, and Ron Cole, ""CU-Move" : Analysis & Corpus Development for Interactive In-Vehicle Speech Systems", Interspeech 2001 | 286 h | 172 | US English | 12k | command/digits/read/dialogue | 1 | no | reverb | human | quasi-fixed | head | car | no | no | yes | no | no |
CENSREC-1 (Aurora-2J) | 2003 | public spaces | ? | 8000 | 1 (close) | no | yes | free | http://research.nii.ac.jp/src/en/CENSREC-1.html | S. Nakamura, K. Takeda, K. Yamamoto, T. Yamada, S. Kuroiwa, N. Kitaoka, T. Nishiura, A. Sasou, M. Mizumachi, C. Miyajima, M. Fujimoto, and T. Endo, "Aurora-2J, an evaluation framework for Japanese noisy speech recognition," IEICE Transactions on Information and Systems, vol. E88-D, no. 3:pp. 535544, 2005 | 214 | Japanese | 11 | digits | 1 | no | various microphones and simulated channels | human | N/S | no | various real environments | original | N/S | yes | no | yes | ||
AVICAR | 2004 | car | 29 h | 16000 | 7 (distant) | 4 | yes | free | http://www.isle.illinois.edu/sst/AVICAR/ | jhasegaw@illinois.edu | Bowon Lee, Mark Hasegawa-Johnson, Camille Goudeseune, Suketu Kamdar, Sarah Borys, Ming Liu, Thomas Huang, "AVICAR: Audio-Visual Speech Corpus in a Car Environment", Proc. Interspeech, 2004 | 29 h | 86 | US/non-native English | 1k | read | 1 | no | reverb | human | quasi-fixed | head | car | no | no | yes | no | no |
AV16.3 | 2004 | meeting | 1.5 h | 16000 | 16 (distant) | 3 | yes | free | http://www.idiap.ch/dataset/av16-3/ | odobez@idiap.ch | "AV16.3: an Audio-Visual Corpus for Speaker Localization and Tracking", by Guillaume Lathoud, Jean-Marc Odobez and Daniel Gatica-Perez, in Proceedings of the MLMI'04 Workshop, 2004. | 1.5 h | 12 | N/S | N/S | colloquial | 1 to 3 | full | reverb | human | various | walk | stationary background noise | no | yes | no | no | no |
ICSI Meeting Corpus | 2004 | meeting | 72 h | 16000 | 6 (distant) | no | yes | $1900 (audio) + $900 (transcripts) | https://catalog.ldc.upenn.edu/LDC2004S02 | mrcontact@icsi.berkeley.edu | A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, C. Wooters, "The ICSI meeting corpus," Proc. ICASSP, Apr. 2003 | 72 h | 53 | US English | 13k | meeting | 3 to 10 | meeting | reverb | human | quasi-fixed | head | stationary background noise | headset (some lapel) | no | yes | yes | no |
NIST Meeting Pilot Corpus Speech | 2004 | meeting | 15 h | 16000 | 7 (distant) | no (released but not currently available for download) | yes | $4000 (audio) + $1500 (transcripts) | https://catalog.ldc.upenn.edu/LDC2004S09 | john.garofolo@nist.gov | John S. Garofolo, Christophe D. Laprun, Martial Michel, Vincent M. Stanford and Elham Tabassi, "The NIST Meeting Room Pilot Corpus," Proc. LREC, 2004 | 15 h | 61 | US English | 6k | meeting | 3 to 9 | meeting | reverb | human | various | walk | stationary background noise | headset+lapel | no | yes | no | no |
CHIL Meetings | 2004-2007 | seminar/meeting | 60 h | 44100 | 79 to 147 (distant) | 6 to 9 | yes | 3 500 | http://catalog.elra.info/search.php | choukri@elda.org | D. Mostefa, N. Moreau, K. Choukri, G. Potamianos, S. Chu, A. Tyagi, J. Casas, J. Turmo, L. Cristoforetti, F. Tobia, A. Pnevmatikakis, V. Mylonakis, F. Talantzis, S. Burger, R. Stiefelhagen, K. Bernardin, C. Rochet, The CHIL audiovisual corpus for lecture and meeting analysis inside smart rooms, in LANGUAGE RESOURCES AND EVALUATION, vol. 41, n. 3-4, 2007, pp. 389-407 | ? | ? | non-native English | ? | lecture/meeting | 3 to 20 | seminar/meeting | reverb | human | quasi-fixed | head | meeting (scenarized) | headset | yes | yes | yes | no |
SPEECON | 2004-2011 | public space/domestic/office/car | ? | 16000 | 3 (distant) | no | yes | 29 x 75000 for all languages | http://catalog.elra.info/index.php?cPath=37 | diskra@appen.com | Dorota Iskra, Beate Grosskopf, Krzysztof Marasek, Henk van den Heuvel, Frank Diehl, Andreas Kiessling, "SPEECON Speech Databases for Consumer Devices: Database Specification and Validation", LREC p. 329-333, 2002. | ? | 600/language | Multiple | ? | command/read/spontaneous | 1 | no | reverb | human | quasi-fixed | head | various real environments | headset | no | yes | no | no |
CENSREC-2 | 2005 | car | ? | 16000 | 1 (distant) | no | yes | free | http://research.nii.ac.jp/src/en/CENSREC-2.html | src@nii.ac.jp | S. Nakamura, M. Fujimoto, and K. Takeda, "CENSREC2: Corpus and evaluation environments for in car continuous digit speech recognition," Proc. ICSLP 2006 | ? | 214 | Japanese | 11 | digits | 1 | no | reverb | human | quasi-fixed | head | car | headset | no | yes | no | no |
CENSREC-3 | 2005 | car | ? | 16000 | 1 (distant) | no | yes | free except phonetically balanced training set: JPY 21000 (Universities) / JPY 105000 (Companies) | http://research.nii.ac.jp/src/en/CENSREC-3.html | src@nii.ac.jp | M. Fujimoto, K. Takeda, and S. Nakamura, "CENSREC-3: An evaluation framework for Japanese speech recognition in real driving-car environments," IEICE Transactions on Information and Systems, vol. E89-D, no. 11:pp. 27832793, 2006 | ? | 18 (+293 in training) | Japanese | 50 in evaluation; unknown but larger in phonetically-balanced utterances of training set | read | 1 | no | reverb | human | quasi-fixed | head | car | headset | no | yes | no | no |
Aurora-5 | 2006 | public spaces/domestic/office/car | ? | 8000 | 1 (distant) | no | yes | TIDigits | http://aurora.hsnr.de/download.html | hans-guenter.hirsch@hs-niederrhein.de | Hans-Gnter Hirsch, "Aurora-5 experimental framework for the performance evaluation of speech recognition in case of a hands-free speech input in noisy environments,", Tech Report, Niederrhein Univ. of Applied Sciences, 2007 | ? | 225 | US English | 11 | digits | 1 | no | real rir/simu/no + simulated telephone channel | loudspeaker | N/S | no | various real environments | original | no | yes | no | yes |
AMI | 2006 | meeting | 100 h | 16000 | 16 (distant) | 6 | yes | free | http://groups.inf.ed.ac.uk/ami/ | amicorpus@amiproject.org | Steve Renals, Thomas Hain, and Herv Bourlard. Interpretation of multiparty meetings: The AMI and AMIDA projects. In IEEE Workshop on Hands-Free Speech Communication and Microphone Arrays, 2008. HSCMA 2008, pages 115-118, 2008 | ? | 189 | UK English | 8k | meeting | 4 (18% overlap) | meeting | reverb | human | quasi-fixed | head | stationary background noise | headset+lapel | yes | yes | yes | no |
PASCAL SSC | 2006 | cocktail party | 18.5 min (+ 8.5h clean training data) | 25000 | 1 (mixing console) | no | yes (website to be restored) | free | m.cooke@ikerbasque.org | Martin Cooke, John R. Hershey, Steven J. Rennie, "Monaural speech separation and recognition challenge," Computer, Speech and Language, 2010 | 18.5 min (+ 8.5h clean training data) | 34 | UK English | 51 | command | 2 | full | no | human | N/S | no | no | original | N/S | yes | no | no | |
HIWIRE | 2007 | airplane | 21 h | 16000 | 1 (close) | no | yes | 50 | http://catalog.elra.info/product_info.php?products_id=1088&language=en | segura@ugr.es | J.C. Segura, T. Ehrette, A. Potamianos, D. Fohr, I. Illina, P.-A. Breton, V. Clot, R. Gemello, M. Matassoni, P. Maragos, "The HIWIRE database, a noisy and non-native English speech corpus for cockpit communication" | 21 h | 81 | non-native English | 133 | command | 1 | no | no | human | N/S | head | airplane | original | N/S | yes | no | no |
UT-Drive | 2007 | car | 40 h | 25000 | 5 (distant) | 2 | yes | $25k with CU-Move | http://crss.utdallas.edu/ | john.hansen@utdallas.edu | P. Angkititrakul, M. Petracca, A. Sathyanarayana, J.H.L. Hansen, "UTDrive: Driver Behavior and Speech Interactive Systems for In-Vehicle Environments," Intelligent Vehicles Symposium, 2007 | 40 h | 25 (more exist but not included in latest release 3.0) | US English | 2.4k (but transcription is incomplete) | command/conversation | 1 to 2 | conversation | reverb | human | quasi-fixed | head | car | headset (but problem w/ recording quality) | no | yes (partial) | no | no |
SASSEC/SiSEC underdetermined | 2007-2011 | cocktail party | 19 min | 16000 | 2 (distant) | no | yes | free | http://sisec2011.wiki.irisa.fr/tiki-index.php?page=Underdetermined+speech+and+music+mixtures | araki.shoko@lab.ntt.co.jp | The Signal Separation Evaluation Campaign (2007-2010): Achievements and Remaining Challenges, Emmanuel Vincent; Shoko Araki; Fabian J. Theis; Guido Nolte; Pau Bofill; Hiroshi Sawada; Alexey Ozerov; B. Vikrham Gowreesunker; Dominik Lutter; Ngoc Duong, Signal Processing, Elsevier, 2012, 92, pp. 1928-1936 | 19 min | 16 | N/S | N/S | read | 3 or 4 | full | reverb/real rir/simu | no | fixed | no | no | original+spatial image | yes | no | no | no |
MC-WSJ-AV/PASCAL SSC2/2012_MMA/REVERB RealData | 2007-2014 | cocktail party | 10 h | 16000 | 8 to 40 (distant) | no | yes | $1 500 | https://catalog.ldc.upenn.edu/LDC2014S03 | mike.lincoln@quoratetechnology.com | M. Lincoln, I. McCowan, J. Vepa, and H. K. Maganti, The multi-channel wall street journal audio visual corpus (MC-WSJ-AV): Specification and initial experiments, in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2005. + E. Zwyssig, F. Faubel, S. Renals and M. Lincoln, "Recognition of overlapping speech using digital MEMS microphone arrays", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013 | ? | 45 | UK English | 10k | read | 1 or 2 | full | reverb | human | various | walk | stationary background noise | headset+lapel | yes | yes | no | no |
CENSREC-4 (Simulated) | 2008 | public spaces/domestic/office/car | ? | 16000 | 1 (distant) | no | yes | free | http://research.nii.ac.jp/src/en/CENSREC-4.html | src@nii.ac.jp | T. Nishiura et al., "Evaluation Framework for Distant-talking Speech Recognition under Reverberant Environments Newest Part of the CENSREC Series", Proc. LREC 2008 | ? | 214 | Japanese | 11 | digits | 1 | no | real rir | mouth simulator | fixed | no | various real environments | original | no | yes | no | yes |
CENSREC-4 (Real) | 2008 | public spaces/domestic/office/car | ? | 16000 | 1 (distant) | no | yes | free | http://research.nii.ac.jp/src/en/CENSREC-4.html | src@nii.ac.jp | T. Nishiura et al., "Evaluation Framework for Distant-talking Speech Recognition under Reverberant Environments Newest Part of the CENSREC Series", Proc. LREC 2008 | ? | 10 | Japanese | 11 | digits | 1 | no | reverb | human | quasi-fixed | head | various real environments | headset | no | yes | no | yes |
DICIT | 2008 | domestic | 6 h | 48000 | 16 (distant) | 2 | yes | free | http://shine.fbk.eu/resources/dicit-acoustic-woz-data | omologo@fbk.eu | Alessio Brutti, Luca Cristoforetti, Walter Kellermann, Lutz Marquardt and Maurizio Omologo, WOZ Acoustic Data Collection for Interactive TV, Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), 2008. | 1 h | ? | Italian | ? | command | 4 | no | reverb | human | various | walk | domestic (scenarized) | headset+tv | yes | yes | no | yes |
SiSEC head-geometry | 2008 | cocktail party | 1.9 h | 16000 | 2 (distant) | no | yes | free | http://sisec2008.wiki.irisa.fr/tiki-index.php?page=Head-geometry%20mixtures%20of%20two%20speech%20sources%20in%20real%20environments,%20impinging%20from%20many%20directions | hendrik.kayser@uni-oldenburg.de | The Signal Separation Evaluation Campaign (2007-2010): Achievements and Remaining Challenges, Emmanuel Vincent; Shoko Araki; Fabian J. Theis; Guido Nolte; Pau Bofill; Hiroshi Sawada; Alexey Ozerov; B. Vikrham Gowreesunker; Dominik Lutter; Ngoc Duong, Signal Processing, Elsevier, 2012, 92, pp. 1928-1936 | 1.9 h | ? | N/S | N/S | read | 2 | full | real rir | loudspeaker | various | no | no | original+spatial image | yes | no | no | no |
COSINE | 2009 | conversation | 38 h | 48000 | 20 (distant) | no | yes | free | http://melodi.ee.washington.edu/cosine/ | cosine@melodi.ee.washington.edu | Alex Stupakov, Evan Hanusa, Deepak Vijaywargi, Dieter Fox, and Jeff Bilmes. The design and collection of COSINE, a multi-microphone in situ speech corpus recorded in noisy environments. Computer Speech and Langauge, 26:5266, 2011. | 11 h | 91 | US/non-native English | 5k | colloquial | 2 to 7 | conversation | reverb | human | various | walk | various real environments | headset+throat mic | no | yes | no | no |
SiSEC real-world noise | 2010 | public spaces | 20 min | 16000 | 2 to 4 (distant) | no | yes | free | http://sisec2010.wiki.irisa.fr/tiki-index.php?page=Source+separation+in+the+presence+of+real-world+background+noise | ito.nobutaka@lab.ntt.co.jp | The Signal Separation Evaluation Campaign (2007-2010): Achievements and Remaining Challenges, Emmanuel Vincent; Shoko Araki; Fabian J. Theis; Guido Nolte; Pau Bofill; Hiroshi Sawada; Alexey Ozerov; B. Vikrham Gowreesunker; Dominik Lutter; Ngoc Duong, Signal Processing, Elsevier, 2012, 92, pp. 1928-1936 | 20 min | 6 | N/S | N/S | read | 1 or 3 | full | no | loudspeaker | various | no | various real environments | original+spatial image | yes | no | no | no |
SiSEC dynamic | 2010-2011 | cocktail party | 11 min | 16000 | 2 to 4 (distant) | no | yes | free | http://sisec2010.wiki.irisa.fr/tiki-index.php?page=Determined+convolutive+mixtures+under+dynamic+conditions | francesco.nesta@gmail.com | The Signal Separation Evaluation Campaign (2007-2010): Achievements and Remaining Challenges, Emmanuel Vincent; Shoko Araki; Fabian J. Theis; Guido Nolte; Pau Bofill; Hiroshi Sawada; Alexey Ozerov; B. Vikrham Gowreesunker; Dominik Lutter; Ngoc Duong, Signal Processing, Elsevier, 2012, 92, pp. 1928-1936 | 11 min | ? | N/S | N/S | read | Many but only 2 simultaneous | simu | reverb | loudspeaker | various | simu | no | original+spatial image | yes | no | no | no |
CHiME 1/CHiME 2 Grid | 2011-2012 | domestic | 70 h with some overlap | 16000 | 2 (distant) | no | yes | free | http://spandh.dcs.shef.ac.uk/chime_challenge/chime2_task1.html | emmanuel.vincent@inria.fr | Vincent, E., Barker, J., Watanabe, S., Le Roux, J., Nesta, F. and Matassoni, M., "The second CHiME Speech Separation and Recognition Challenge: Datasets, tasks and baselines In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013, Vancouver | 12 h | 34 | UK English | 51 | command | 1 | no | real rir | dummy | quasi-fixed | simu | domestic | yes | yes | yes | no | no |
CHiME 2 WSJ0 | 2012 | domestic | 78 h with some overlap | 16000 | 2 (distant) | no | yes | WSJ0 | http://spandh.dcs.shef.ac.uk/chime_challenge/chime2_task2.html | francesco.nesta@gmail.com | Vincent, E., Barker, J., Watanabe, S., Le Roux, J., Nesta, F. and Matassoni, M., "The second CHiME Speech Separation and Recognition Challenge: Datasets, tasks and baselines In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013, Vancouver | 33 h | 101 | US English | 11k | read | 1 | no | real rir | dummy | fixed | no | domestic | yes | yes | yes | no | no |
ETAPE | 2012 | debates, outdoor interviews, and other TV/radio broadcasts selected for large speaker overlap and/or noise | 42 h | 16000 | 1 (mixing console) | 1 | yes | ? | ? | guillaume.gravier@irisa.fr | Guillaume Gravier, Gilles Adda, Niklas Paulsson, Matthieu Carr, Aude Giraudel, Olivier Galibert, The ETAPE corpus for the evaluation of speech-based TV content processing in the French language, LREC 2012. | 32 h | 347 | French | 16k | colloquial | 1 or more (7% overlap on average, up to 10% in debates) | conversation | some reverb | human | quasi-fixed | head | various real environments | no | N/S | yes | no | yes |
GALE (Chinese broadcast conversation) | 2013 | conversation (TV Broadcast) | 120 h | 16000 | 1 (mixing console) | no | yes | $2000 (audio) + $1500 (transcripts) | https://catalog.ldc.upenn.edu/LDC2013S04 | strassel@ldc.upenn.edu | 108 h | ? | Mandarin | ? | colloquial | 1 or more | conversation | no | human | quasi-fixed | head | no | no | N/S | yes | no | no | |
GALE (Arabic broadcast conversation) | 2013 | conversation (TV Broadcast) | 251 h | 16000 | 1 (mixing console) | no | yes | 2 x [$2000 (audio) + $1500 (transcripts)] | https://catalog.ldc.upenn.edu/LDC2013S02 | strassel@ldc.upenn.edu | 234 h | ? | Arabic | ? | colloquial | 1 or more | conversation | no | human | quasi-fixed | head | no | no | N/S | yes | no | no | |
REVERB SimData | 2013 | domestic/office | 25 h | 16000 | 8 (distant) | no | yes | WSJCAM0 | http://reverb2014.dereverberation.com/ | REVERB-challenge@lab.ntt.co.jp | Keisuke Kinoshita, Marc Delcroix, Takuya Yoshioka, Tomohiro Nakatani, Emanuel Habets, Reinhold Haeb-Umbach, Volker Leutnant, Armin Sehr, Walter Kellermann, Roland Maas, Sharon Gannot, Bhiksha Raj, "The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech", Proc. WASPAA 2013 | 25 h | 130 | UK English | 10k | read | 1 | no | real rir | loudspeaker | fixed | no | experimental room | original+spatial image | yes | yes | no | yes |
DIRHA | 2014 | domestic | 3.8 h | 48000 | 40 (distant) | no | yes | free | http://shine.fbk.eu/resources/dirha-ii-simulated-corpus | mravanelli@fbk.eu | Alessio Brutti, Mirco Ravanelli, Piergiorgio Svaizer, Maurizio Omologo, A speech event detection and localization task for multiroom environments, HSCMA 2014. | 1.3 h | 30 | Italian, German, Greek, Portuguese | various | various | 1 or more | simu | real rir | loudspeaker | various | no | domestic (sum of individual noises) | yes | yes | yes | no | yes |
Contents
Automatic speech recognition
1st CHiME Challenge (2011)
Artificially distorted version of the small vocabulary GRID audio-visual corpus (audio only). Binaural reverberated speech with speaker situated in front of the microphones. Additive household noises impinging from different directions. Clean-training, noisy-training, development and evaluation sets available, see
- Jon Barker, E. Vincent, N. Ma, H. Christensen, P. Green, "The PASCAL CHiME speech separation and recognition challenge", Computer Speech & Language, Volume 27, Issue 3, May 2013, Pages 621-633.
Available from Computer Speech and Language here
Corpus available here (no cost)
Resources
Baselines
- See the paper above for results for a wide range of techniques.
AURORA 5 (2007)
Artificially distorted version of the digits TI-DIGITS corpus. Additive noise and additive noise plus reverberant speech sets. Variable SNR range. Various mixed training sets, no evaluation set, see
- G. Hirsch "Aurora-5 Experimental Framework for the Performance Evaluation of Speech Recognition in Case of a Hands-free Speech Input in Noisy Environments", Niederrhein University of Applied Sciences, 2007.
Paper available online here (no cost)
Corpus available from LDC here
Resources
- Training recipe for HTK is provided with the corpora.
Baselines
- Reproducible baseline: The above cited paper includes a baseline for the ETSI Advanced Front-End.
AURORA 4 (2002)
Artificially distorted version of the 5K word Wall Street Journal corpus (WSJ0). Stationary and non-stationary noises added. Second recordings with distant mismatched microphone. Clean-training, mixed-training, noisy training and test sets available. No evaluation set, see
- G. Hirsch "Experimental Framework for the Performance Evaluation of Speech Recognition Front-ends on a Large Vocabulary Task", ETSI STQ Aurora DSR Working Group, 2002.
Paper available with the corpus.
Corpora available from ELRA here and here
Resources
- Training recipe for HTK available here. Note that this recipe is for Wall-Street Journal (WSJ0), which is the clean speech version of AURORA4. Small changes are needed in the feature extraction scripts to account for different file terminations.
Speaker identification and verification
Speech enhancement and separation
Other applications
Contribute a dataset
To contribute a new dataset, please
- create an account and login
- go to the wiki page above corresponding to your application; if it does not exist yet, you may create it
- click on the "Edit" link at the top of the page and add a new section for your dataset (the datasets are ordered by year of collection)
- click on the "Save page" link at the bottom of the page to save your modifications
Please make sure to provide the following information:
- name of the dataset and year of collection
- authors, institution, contact information
- link to the dataset and to side resources (lexicon, language model, etc)
- short description (nature of the data, license, etc) and link to a paper/report describing the dataset, if any
- at least 1 research result obtained for this dataset (see below)
We currently cannot provide storage space for large datasets. Please upload the dataset at a stable URL on the website of your institution or elsewhere and provide its URL only. If this is not possible, please contact the resources sharing working group.
Contribute a research result
To contribute a new research result, please
- create an account and login
- go to the wiki page and the section corresponding to the dataset for which this result was obtained
- click on the "Edit" link on the right of the section header and add a new item for your result
- click on the "Save page" link at the bottom of the page to save your modifications
Please make sure to provide the following information:
- authors, paper/report title, means of publication
- link to the pdf of the paper
- link to derived data (output transcriptions, intermediary data, etc)
- Code and instructions to reproduce experiments (if available)
In order to save storage space, please do not upload the paper on this wiki, but link it as much as possible from your institutional archive, from another public archive (e.g., arxiv) or from the publisher website (e.g., ieexplore).
We currently cannot provide storage space for large datasets. Please upload the derived data at a stable URL on the website of your institution or elsewhere and provide its URL only. If this is not possible, please contact the resources sharing working group.