Difference between revisions of "Datasets"

From rosp
m (Impulse response datasets)
 
(148 intermediate revisions by 6 users not shown)
Line 1: Line 1:
This page provides a number of datasets grouped by application, as well as research results (papers, numerical results, output transcriptions, intermediary data, etc) corresponding to each dataset.
 
  
{| class="wikitable sortable"
+
== [[Speech datasets]] ==
|
+
The table below aims to provide a list of speech datasets with detailed attributes and links to software baselines and evaluation results. Each dataset may be used for one or more applications: automatic speech recognition, speaker identification and verification, source localization, speech enhancement and separation... The meaning of each attribute is detailed [[#speech_attributes|below]].
|Data
+
 
|
+
Disclaimer: Only datasets that are '''publicly available''', (at least partially) '''annotated''', '''suitable for research on robustness''', and '''longer than 5 min''' are listed. Other relevant datasets are listed [[#Other datasets|below]].
|
+
 
|
+
If you would like to refer to this table, please cite
|
+
'''J. Le Roux and E. Vincent, "A categorization of robust speech processing datasets", Mitsubishi Electric Research Laboratories Technical Report, TR2014-116, Aug. 2014.'''
|
+
 
|
+
 
|
+
{| class="wikitable sortable" style="font-size:72%; border:gray solid 1px; text-align:center; width:auto; table-layout:fixed;"
|
+
|-
|
+
!style="width: 40px" rowspan="2" class="unsortable"|Datasets
|
+
!colspan="8" |General attributes
|Speech
+
!colspan="7" |Speech
|
+
!colspan="4" |Channel
|
+
!colspan="2" |Noise
|
+
!colspan="5" |Ground truth
|
+
|-
|
+
!scope="col" width="40px" | rel. year
|
+
!scope="col" width="40px" | use case
|Channel
+
!scope="col" width="40px" | total time (h)
|
+
!scope="col" width="40px" | sam. rate (kHz)
|
+
!scope="col" width="40px" | dist. or noisy mics
|
+
!scope="col" width="40px" | video cams
|Noise
+
!scope="col" width="40px" | cost (non- memb)
|Ground truth
+
!scope="col" width="40px" class="unsortable" | links
|
+
!scope="col" width="40px" | speak. time (h)
|
+
!scope="col" width="40px" | uniq. speak.
|
+
!scope="col" width="40px" | lang.
|
+
!scope="col" width="40px" | uniq. words (k)
|
+
!scope="col" width="40px" | speak. style
|
+
!scope="col" width="40px" | speak. / rec.
|----
+
!scope="col" width="40px" | overl. type
|
+
!scope="col" width="40px" | chan. type
|release
+
!scope="col" width="40px" | speak. radiat.
|scenario
+
!scope="col" width="40px" | speak. loc.
|duration
+
!scope="col" width="40px" | speak. moves
|sampling
+
!scope="col" width="40px" | noise type
|mixture channels
+
!scope="col" width="40px" | avg. SNR
|cameras
+
!scope="col" width="40px" | ref. signal
|available
+
!scope="col" width="40px" | speak. loc., orient.
|cost
+
!scope="col" width="40px" | words
|URL
+
!scope="col" width="40px" | non- verb. traits
|email
+
!scope="col" width="40px" | noise events
|reference
+
|-
|duration
+
!ShATR
|speakers
 
|language
 
|vocab (unique words)
 
|style
 
|sources
 
|overlap
 
|type
 
|radiation
 
|location
 
|move
 
|type
 
|signal
 
|loc./ori.
 
|words
 
|nonverbal
 
|noise
 
|
 
|
 
|----
 
|ShATR
 
 
|1994
 
|1994
 
|meeting
 
|meeting
|37 min
+
|{{no|0.6}}
|48000
+
|{{yes|48}}
|3 (distant)
+
|{{some|3}}
 
|{{no}}
 
|{{no}}
|{{yes}}
+
|{{yes|free}}
|free
+
|[http://spandh.dcs.shef.ac.uk/projects/shatrweb/ download]
|http://spandh.dcs.shef.ac.uk/projects/shatrweb/
+
[http://spandh.dcs.shef.ac.uk/projects/shatrweb/papers/ioa94.html paper]
|g.brown@dcs.shef.ac.uk
+
|{{no|0.6}}
|Malcolm Crawford, Guy J. Brown, Martin Cooke and Phil Green, "Design, collection and analysis of a multi-simultaneous-speaker corpus," Proceedings of The Institute of Acoustics, 16(5):183-190.
+
|{{no|5}}
|37 min
 
|5
 
 
|UK English
 
|UK English
|1k
+
|{{some|1}}
|colloquial
+
|{{yes|spontaneous}}
 
|5
 
|5
|multiple conversations
+
|multiple dialogs
|reverb
+
|{{yes|reverb}}
|human
+
|{{yes|human}}
|quasi-fixed
+
|{{some|quasi-fixed}}
|head
+
|{{yes|head}}
|meeting
+
|{{yes|meeting}}
|headset
+
|{{no|high}}
 +
|{{some|headset}}
 
|{{yes}}
 
|{{yes}}
 
|{{yes}}
 
|{{yes}}
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
|
+
|-
|
+
!LLSEC
|----
 
|LLSEC
 
 
|1996
 
|1996
|conversation
+
|dialog
|1.4 h
+
|{{some|1.4}}
|16000
+
|{{some|16}}
|4 (distant)
+
|{{yes|4}}
 
|{{no}}
 
|{{no}}
|{{yes}}
+
|{{yes|free}}
|free
+
|[https://www.ll.mit.edu/mission/cybersec/HLT/corpora/SpeechCorpora.html download]
|https://www.ll.mit.edu/mission/cybersec/HLT/corpora/SpeechCorpora.html
 
|jpc@ll.mit.edu
 
|{{dunno}}
 
 
|{{dunno}}
 
|{{dunno}}
|12
+
|{{some|12}}
 
|{{n/s}}
 
|{{n/s}}
 
|{{n/s}}
 
|{{n/s}}
|read/colloquial
+
|{{yes|read, spontaneous}}
 
|2
 
|2
|conversation
+
|dialog
|reverb
+
|{{yes|reverb}}
|human
+
|{{yes|human}}
|quasi-fixed
+
|{{some|quasi-fixed}}
|head
+
|{{yes|head}}
|hallway, restaurant
+
|{{some|hallway, restaurant (scenarized)}}
 +
|{{some|medium}}
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
Line 127: Line 101:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|
+
|-
|
+
!MicArray
|----
+
|1996
|RWCP Spoken Dialog Corpus
+
|office
|1996-1997
+
|{{no|0.2}}
|conversation
+
|{{some|16}}
|10 h
+
|{{yes|9 - 16}}
|16000
+
|{{no}}
|2 (close but cross-talk)
+
|{{yes|free}}
 +
|[http://www.speech.cs.cmu.edu/databases/micarray/ download]
 +
[http://www.cs.cmu.edu/afs/cs/user/robust/www/Thesis/tms_thesis.pdf paper]
 +
|{{no|0.2}}
 +
|{{some|14}}
 +
|US English
 +
|{{no|0.07}}
 +
|{{no|digits, command}}
 +
|1
 +
|no
 +
|{{yes|reverb}}
 +
|{{yes|human}}
 +
|{{some|quasi-fixed}}
 +
|{{yes|head}}
 +
|{{yes|stationary background}}
 +
|{{some|medium}}
 +
|{{some|headset}}
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
|free
+
|{{no}}
|http://research.nii.ac.jp/src/en/RWCP-SP96.html
+
|{{no}}
|src@nii.ac.jp
+
|-
|Kazuyo Tanaka, Satoru Hayamizu, Yoichi Yamashita, Kiyohiro Shikano, Shuichi Itahashi and Ryuichi Oka, "Design and data collection for a spoken dialog database in the Real World Computing (RWC) program," J. Acoust. Soc. Am. 100, 2759 (1996)
+
!RWCP Spoken Dialog Corpus
|10 h
+
|1996 - 1997
|39
+
|dialog
 +
|{{yes|10}}
 +
|{{some|16}}
 +
|{{some|2}}
 +
|{{no}}
 +
|{{yes|free}}
 +
|[http://research.nii.ac.jp/src/en/RWCP-SP96.html download]
 +
[http://scitation.aip.org/content/asa/journal/jasa/100/4/10.1121/1.416338 paper]
 +
|{{yes|10}}
 +
|{{some|39}}
 
|Japanese
 
|Japanese
 
|{{dunno}}
 
|{{dunno}}
|colloquial
+
|{{yes|spontaneous}}
|1 or 2
+
|1 - 2
|conversation
+
|dialog
|reverb
+
|{{yes|reverb (low)}}
|human
+
|{{yes|human}}
|quasi-fixed
+
|{{some|quasi-fixed}}
|head
+
|{{yes|head}}
|stationary background noise
+
|{{yes|stationary background}}
 +
|{{no|high}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
Line 159: Line 159:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|
+
|-
|
+
!SUSAS
|----
+
|1999
|Aurora-2
+
|stress
 +
|{{dunno}}
 +
|{{no|8}}
 +
|{{no|1}}
 +
|{{no}}
 +
|{{some|0.5k$}}
 +
|[https://catalog.ldc.upenn.edu/LDC99S78 download]
 +
[https://catalog.ldc.upenn.edu/LDC99T33 download]
 +
[https://catalog.ldc.upenn.edu/docs/LDC99S78/susas_rev1b4.ps paper]
 +
|{{dunno}}
 +
|{{some|36}}
 +
|US English
 +
|{{no|0.035}}
 +
|{{no|command}}
 +
|1
 +
|no
 +
|{{yes|reverb}}
 +
|{{yes|human}}
 +
|{{some|quasi-fixed}}
 +
|{{yes|head}}
 +
|{{yes|stationary background}}
 +
|{{no|high}}
 +
|{{no}}
 +
|{{no}}
 +
|{{yes}}
 +
|{{yes}}
 +
|{{no}}
 +
|-
 +
!Aurora-2
 
|2000
 
|2000
 
|public spaces
 
|public spaces
|33 h
+
|{{yes|33}}
|8000-16000
+
|{{some|8 - 16}}
|1 (close)
+
|{{no|1}}
 
|{{no}}
 
|{{no}}
|{{yes}}
+
|{{some|free given TIDigits (0.5 k$)}}
|TIDigits
+
|[http://catalog.elra.info/product_info.php?cPath=37_40&products_id=693 purchase] (incl. HTK)
|http://aurora.hsnr.de/download.html
+
[http://www.isca-speech.org/archive_open/asr2000/asr0_181.html paper]
|hans-guenter.hirsch@hs-niederrhein.de
+
[http://aurora.hsnr.de/download.html features]
|Hans-Gnter Hirsch, David Pearce, "The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions,", Proc. Interspeech 2000
+
|{{yes|33}}
|33 h
+
|{{yes|214}}
|214
 
 
|US English
 
|US English
|11
+
|{{no|0.01}}
|digits
+
|{{no|digits}}
 
|1
 
|1
|{{no}}
+
|no
|{{no}} (simulated telephone channel)
+
|{{some|simulated phone}}
|human
+
|{{yes|human}}
 
|{{n/s}}
 
|{{n/s}}
 
|{{no}}
 
|{{no}}
|various real environments
+
|{{some|various real environments (rescaled)}}
|original
+
|{{yes|low}}
 +
|{{yes|original}}
 
|{{n/s}}
 
|{{n/s}}
 
|{{yes}}
 
|{{yes}}
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
|
+
|-
|
+
!SPINE1, SPINE2
|----
+
|2000 - 2001
|SPINE1/SPINE2
 
|2000-2001
 
 
|military
 
|military
|38 h
+
|{{yes|38}}
|16000
+
|{{some|16}}
|2 (close)
+
|{{some|2}}
 
|{{no}}
 
|{{no}}
|{{yes}}
+
|{{no|7.4 k$}}
|2 x ($800 (audio) + $500 (transcripts)) + 3 x ($1000 (audio) + $600 (transcripts))
+
|[https://catalog.ldc.upenn.edu/search?q%5Bname_cont%5D=SPINE purchase]
|https://catalog.ldc.upenn.edu/LDC2000S87
+
[http://dl.acm.org/citation.cfm?id=1289199 paper]
|jdwright@ldc.upenn.edu
 
|T.H. Crystal et al., "Speech in noisy environments (SPINE) adds new dimension to speech recognition R&D", Proc. HLT 2002
 
 
|{{dunno}}
 
|{{dunno}}
|100
+
|{{yes|100}}
 
|US English
 
|US English
|1k
+
|{{some|1}}
|command/colloquial
+
|{{yes|command, spontaneous}}
|1 or 2
+
|1 - 2
|{{no}}
+
|no
|{{no}} (simulated transmission channels)
+
|{{some|simulated radio}}
|human
+
|{{yes|human}}
|quasi-fixed
+
|{{some|quasi-fixed}}
|head
+
|{{yes|head}}
|military (pre-recorded noise played in sound booth while recording speech)
+
|{{some|military (rescaled)}}
 +
|{{yes|low}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
Line 223: Line 248:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|
+
|-
|
+
!Aurora-3 (subset of SpeechDat- Car)
|----
+
|2000 - 2003
|Aurora-3 (subset of SpeechDat-Car)
 
|2000-2003
 
 
|car
 
|car
 
|{{dunno}}
 
|{{dunno}}
|16000
+
|{{some|16}}
|3 (+1 GSM)  (distant)
+
|{{yes|4}}
 
|{{no}}
 
|{{no}}
|{{yes}}
+
|{{some|1 k€}}
|5 x 200 (Academics) / 5 x 1,000 (Companies)
+
|[http://catalog.elra.info/index.php?cPath=37_40 purchase] (incl. HTK)
|http://catalog.elra.info/index.php?cPath=37_40
+
[http://aurora.hsnr.de/aurora-3/reports.html papers]
|
 
|
 
 
|{{dunno}}
 
|{{dunno}}
|{{dunno}}
+
|{{yes|730}}
|Finnish, German, Spanish, Danish, Italian
+
|various
|{{dunno}}
+
|{{no|0.01}}
|command (read/digits/keywords/spontaneous)
+
|{{no|digits}}
 
|1
 
|1
|{{no}}
+
|no
|reverb
+
|{{yes|reverb}}
|human
+
|{{yes|human}}
|quasi-fixed
+
|{{some|quasi-fixed}}
|head
+
|{{yes|head}}
|car
+
|{{yes|car}}
|close-talk
+
|{{yes|low}}
 +
|{{some|headset}}
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|
+
|-
|
+
!RWCP Meeting Speech Corpus
|----
 
|RWCP Meeting Speech Corpus
 
 
|2001
 
|2001
 
|meeting
 
|meeting
|3.5 h
+
|{{some|3.5}}
|16000-48000
+
|{{yes|16 - 48}}
|1 (distant)
+
|{{no|1}}
|3
+
|{{yes|3}}
|{{yes}}
+
|{{yes|free}}
|free
+
|[http://research.nii.ac.jp/src/en/RWCP-SP01.html download]
|http://research.nii.ac.jp/src/en/RWCP-SP01.html
+
[http://id.nii.ac.jp/1001/00057420/ paper]
|src@nii.ac.jp
+
|{{some|3.5}}
|Kazuyo Tanaka, Katunobu Itou, Masanori Ihara, Ryuichi Oka, "Constructing a Meeting Speech Corpus", IPSJ, 37-15, 2001
 
|3.5 h
 
 
|{{dunno}}
 
|{{dunno}}
 
|Japanese
 
|Japanese
 
|{{dunno}}
 
|{{dunno}}
|colloquial
+
|{{yes|spontaneous}}
|1 to 5
+
|1 - 5
 
|meeting
 
|meeting
|low reverb
+
|{{yes|reverb (low)}}
|human
+
|{{yes|human}}
|quasi-fixed
+
|{{some|quasi-fixed}}
|head
+
|{{yes|head}}
|stationary background noise
+
|{{yes|stationary background}}
|headset
+
|{{no|high}}
 +
|{{some|headset}}
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|
+
|-
|
+
!RWCP Real Environment Speech Database
|----
 
|RWCP Real Environment Speech and Acoustic Database
 
 
|2001
 
|2001
|domestic/office
+
|domestic, office
 
|{{dunno}}
 
|{{dunno}}
|16000-48000
+
|{{yes|16 - 48}}
|30 (distant)
+
|{{yes|84}}
 
|{{no}}
 
|{{no}}
|{{yes}}
+
|{{yes|free}}
|free
+
|[http://research.nii.ac.jp/src/en/RWCP-SSD.html download]
|http://research.nii.ac.jp/src/en/RWCP-SSD.html
+
[http://www.lrec-conf.org/proceedings/lrec2000/html/summary/356.htm paper]
|s-nakamura@is.naist.jp
 
|Satoshi Nakamura, Kazuo Hiyane, Futoshi Asano, Takanobu Nishiura, and Takeshi Yamada, "Acoustical Sound Database in Real Environments for Sound Scene Understanding and Hands-Free Speech Recognition," LREC 2000.
 
 
|{{dunno}}
 
|{{dunno}}
|5
+
|{{no|5}}
|Japanese
+
|US English, Japanese
 
|{{dunno}}
 
|{{dunno}}
|read
+
|{{some|read}}
 
|1
 
|1
|{{no}}
+
|no
|real rir/reverb
+
|{{yes|real rir, reverb}}
|loudspeaker
+
|{{no|loudspeaker}}
|various
+
|{{yes|various}}
|{{no}}/pivoting arm
+
|{{some|no, pivoting arm}}
|stationary background noise
+
|{{some|various (sum of events)}}
|original
+
|{{some|medium}}
 +
|{{yes|original}}
 
|{{yes}}
 
|{{yes}}
 
|{{yes}}
 
|{{yes}}
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
|
+
|-
|
+
!SpeechDat- Car
|----
+
|2001 - 2011
|SpeechDat-Car
 
|2001-2011
 
 
|car
 
|car
 
|{{dunno}}
 
|{{dunno}}
|16000
+
|{{some|16}}
|3 (+1 GSM)  (distant)
+
|{{yes|4}}
 
|{{no}}
 
|{{no}}
|{{yes}}
+
|{{no|39 - 182 k€ per lang}}
|1.1 Million  for all 10 languages. Each costs 39k  to 182k
+
|[http://catalog.elra.info/search.php purchase]
|http://catalog.elra.info/index.php?cPath=37_41
+
[http://www.lrec-conf.org/proceedings/lrec2000/html/summary/373.htm paper]
|
 
|A. Moreno et al., "SPEECHDAT-CAR. A Large Speech Database for Automotive Environments," Proc. LREC 2000
 
 
|{{dunno}}
 
|{{dunno}}
|300/language
+
|{{yes|300 per lang}}
|Multiple
+
|various
 
|{{dunno}}
 
|{{dunno}}
|command (read/digits/keywords/spontaneous)
+
|{{yes|digits, command, read, spontaneous}}
 
|1
 
|1
|{{no}}
+
|no
|reverb
+
|{{yes|reverb}}
|human
+
|{{yes|human}}
|quasi-fixed
+
|{{some|quasi-fixed}}
|head
+
|{{yes|head}}
|car
+
|{{yes|car}}
|close-talk
+
|{{yes|low}}
 +
|{{some|headset}}
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|
+
|-
|
+
!Aurora-4
|----
 
|Aurora-4
 
 
|2002
 
|2002
 
|public spaces
 
|public spaces
 
|{{dunno}}
 
|{{dunno}}
|8000-16000
+
|{{some|8 - 16}}
|1 (close)
+
|{{no|1}}
 
|{{no}}
 
|{{no}}
|{{yes}}
+
|{{some|free given WSJ0 (1.5 k$)}}
|WSJ0
+
|[http://catalog.elra.info/index.php?cPath=37_40 purchase]
|http://aurora.hsnr.de/download.html
+
[http://aurora.hsnr.de/aurora-4/reports.html paper]
|hans-guenter.hirsch@hs-niederrhein.de
+
[http://www.keithv.com/software/htk/ HTK]
|N. Parihar and J. Picone, "Aurora Working Group: DSR Front End LVCSR Evaluation AU/384/02," Tech. Rep., Inst. for Signal and Information Process, Mississippi State University, 2002
 
 
|{{dunno}}
 
|{{dunno}}
|101
+
|{{yes|101}}
 
|US English
 
|US English
|10k
+
|{{yes|10}}
|read
+
|{{some|read}}
 
|1
 
|1
|{{no}}
+
|no
|{{no}} (simulated telephone channel)
+
|{{some|simulated phone}}
|human
+
|{{yes|human}}
 
|{{n/s}}
 
|{{n/s}}
 
|{{no}}
 
|{{no}}
|various real environments
+
|{{some|various real environments (rescaled)}}
|original
+
|{{yes|low}}
 +
|{{yes|original}}
 
|{{n/s}}
 
|{{n/s}}
 
|{{yes}}
 
|{{yes}}
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
|
+
|-
|
+
!TED
|----
 
|TED
 
 
|2002
 
|2002
 
|seminar
 
|seminar
|47 h
+
|{{yes|47}}
|16000
+
|{{some|16}}
|1 (distant)
+
|{{no|1}}
 
|{{no}}
 
|{{no}}
|{{yes}}
+
|{{some|0.5 k$}}
|$275 (audio) + $250 (transcripts)
+
|[https://catalog.ldc.upenn.edu/LDC2002S04 purchase]
|https://catalog.ldc.upenn.edu/LDC2002S04
+
[http://perso.limsi.fr/lamel/icslp94ted.pdf paper]
|
+
|{{yes|47}}
|L. Lamel, F. Schiel, A. Fourcin, J. Mariani, and H. Tillman, "The translingual English database (TED)," Proc. ICSLP, 1994
+
|{{yes|188}}
|47 h
+
|non-native English
|188
 
|English (mostly non-native)
 
 
|{{dunno}}
 
|{{dunno}}
|lecture
+
|{{some|lecture}}
 
|1 or more
 
|1 or more
 
|seminar
 
|seminar
|reverb
+
|{{yes|reverb}}
|human
+
|{{yes|human}}
|quasi-fixed
+
|{{some|quasi-fixed}}
|head
+
|{{yes|head}}
|stationary background noise
+
|{{yes|stationary background}}
|lapel
+
|{{no|high}}
 +
|{{some|lapel}}
 
|{{no}}
 
|{{no}}
|{{yes}} (partial)
+
|{{some|partial}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|
+
|-
|
+
!CUAVE
|----
 
|CUAVE
 
 
|2002
 
|2002
|cocktail party
+
|speech overlap
|3 h
+
|{{some|3}}
|44100
+
|{{yes|44}}
|1 (distant)
+
|{{no|1}}
|1
+
|{{some|1}}
|{{yes}}
+
|{{yes|free}}
|free
+
|[http://media.clemson.edu/cuave/CUAVE-092908.iso download]
|http://www.clemson.edu/ces/speech/cuave.htm
+
[http://asp.eurasipjournals.com/content/2002/11/208541 paper]
|ksampat@clemson.edu
+
|{{some|3}}
|Eric K Patterson, Sabri Gurbuz, Zekeriya Tufekci and John N Gowdy, "Moving-Talker, Speaker-Independent Feature Study, and Baseline Results Using the CUAVE Multimodal Speech Corpus," EURASIP Journal on Advances in Signal Processing 2002, 2002:208541
+
|{{some|36}}
|3 h
 
|36
 
 
|US English
 
|US English
|10
+
|{{no|0.01}}
|digits
+
|{{no|digits}}
|1 or 2
+
|1 - 2
 
|full
 
|full
|reverb
+
|{{yes|reverb}}
|human
+
|{{yes|human}}
|quasi-fixed
+
|{{some|quasi-fixed}}
|head
+
|{{yes|head}}
|stationary background noise
+
|{{yes|stationary background}}
 +
|{{no|high}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
Line 447: Line 452:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|
+
|-
|
+
!CU-Move Microphone Array Data
|----
+
|2002 - 2011
|CU-Move ("Microphone Array Data"; downsampled data with more speakers but less channels exist)
 
|2002-2011
 
 
|car
 
|car
|286 h
+
|{{yes|286}}
|44100
+
|{{yes|44}}
|6 to 8 (distant)
+
|{{yes|6 - 8}}
 
|{{no}}
 
|{{no}}
|{{yes}}
+
|{{no|25 k$}}
|$25k with UT-Drive
+
|[http://crss.utdallas.edu/ purchase]
|http://crss.utdallas.edu/
+
[http://www.isca-speech.org/archive/eurospeech_2001/e01_2023.html paper]
|john.hansen@utdallas.edu
+
|{{yes|286}}
|John H.L. Hansen, Pongtep Angkititrakul, Jay Plucienkowski, Stephen Gallant, Umit Yapanel, Bryan Pellom, Wayne Ward, and Ron Cole, ""CU-Move" : Analysis & Corpus Development for Interactive In-Vehicle Speech Systems", Interspeech 2001
+
|{{yes|172}}
|286 h
 
|172
 
 
|US English
 
|US English
|12k
+
|{{yes|12}}
|command/digits/read/dialogue
+
|{{yes|digits, command, read, dialog}}
 
|1
 
|1
 +
|no
 +
|{{yes|reverb}}
 +
|{{yes|human}}
 +
|{{some|quasi-fixed}}
 +
|{{yes|head}}
 +
|{{yes|car}}
 +
|{{yes|low}}
 +
|{{no}}
 +
|{{no}}
 +
|{{yes}}
 +
|{{no}}
 
|{{no}}
 
|{{no}}
|reverb
+
|-
|human
+
!PDA
|quasi-fixed
+
|2003
|head
+
|office
|car
+
|{{some|1.6-3}}
 +
|{{some|11 - 16}}
 +
|{{some|1 - 4}}
 
|{{no}}
 
|{{no}}
 +
|{{yes|free}}
 +
|[http://www.speech.cs.cmu.edu/databases/pda/ download]
 +
[http://www.sapaworkshops.org/2004/papers/52.pdf paper]
 +
[http://www.cs.cmu.edu/afs/cs/user/robust/www/Thesis/mseltzer_phdthesis.pdf paper]
 +
|{{some|1.6 - 3}}
 +
|{{some|11 - 16}}
 +
|US English
 +
|{{some|1 - 2}}
 +
|{{some|read}}
 +
|1
 +
|no
 +
|{{yes|reverb}}
 +
|{{yes|human}}
 +
|{{some|quasi-fixed}}
 +
|{{yes|head}}
 +
|{{yes|stationary background}}
 +
|{{yes|low}}
 +
|{{some|headset}}
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|
+
|-
|
+
!CENSREC-1 (Aurora-2J)
|----
 
|CENSREC-1 (Aurora-2J)
 
 
|2003
 
|2003
 
|public spaces
 
|public spaces
 
|{{dunno}}
 
|{{dunno}}
|8000
+
|{{no|8}}
|1 (close)
+
|{{no|1}}
 
|{{no}}
 
|{{no}}
|{{yes}}
+
|{{yes|free}}
|free
+
|[http://research.nii.ac.jp/src/en/CENSREC-1.html download]
|http://research.nii.ac.jp/src/en/CENSREC-1.html
+
[http://ir.nul.nagoya-u.ac.jp/jspui/bitstream/2237/15046/1/425.pdf paper]
|
+
|{{dunno}}
|S. Nakamura, K. Takeda, K. Yamamoto, T. Yamada, S. Kuroiwa, N. Kitaoka, T. Nishiura, A. Sasou, M. Mizumachi, C. Miyajima, M. Fujimoto, and T. Endo, "Aurora-2J, an evaluation framework for Japanese noisy speech recognition," IEICE Transactions on Information and Systems, vol. E88-D, no. 3:pp. 535544, 2005
+
|{{yes|214}}
|
 
|214
 
 
|Japanese
 
|Japanese
|11
+
|{{no|0.01}}
|digits
+
|{{no|digits}}
 
|1
 
|1
|{{no}}
+
|no
|various microphones and simulated channels
+
|{{some|simulated phone}}
|human
+
|{{yes|human}}
 
|{{n/s}}
 
|{{n/s}}
 
|{{no}}
 
|{{no}}
|various real environments
+
|{{some|various real environments (rescaled)}}
|original
+
|{{yes|low}}
 +
|{{yes|original}}
 
|{{n/s}}
 
|{{n/s}}
 
|{{yes}}
 
|{{yes}}
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
|
+
|-
|
+
!AVICAR
|----
 
|AVICAR
 
 
|2004
 
|2004
 
|car
 
|car
|29 h
+
|{{yes|40}}
|16000
+
|{{some|16}}
|7 (distant)
+
|{{yes|7}}
|4
+
|{{yes|4}}
|{{yes}}
+
|{{yes|free}}
|free
+
|[http://www.isle.illinois.edu/sst/AVICAR/ download]
|http://www.isle.illinois.edu/sst/AVICAR/
+
[http://www.isca-speech.org/archive/interspeech_2004/i04_2489.html paper]
|jhasegaw@illinois.edu
+
|{{yes|40}}
|Bowon Lee, Mark Hasegawa-Johnson, Camille Goudeseune, Suketu Kamdar, Sarah Borys, Ming Liu, Thomas Huang, "AVICAR: Audio-Visual Speech Corpus in a Car Environment", Proc. Interspeech, 2004
+
|{{some|87}}
|29 h
+
|US English, non-native English
|86
+
|{{some|1}}
|US/non-native English
+
|{{some|read}}
|1k
 
|read
 
 
|1
 
|1
|{{no}}
+
|no
|reverb
+
|{{yes|reverb}}
|human
+
|{{yes|human}}
|quasi-fixed
+
|{{some|quasi-fixed}}
|head
+
|{{yes|head}}
|car
+
|{{yes|moving car, windows open or closed}}
 +
|{{yes|low}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
Line 543: Line 569:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|
+
|-
|
+
!AV16.3
|----
 
|AV16.3
 
 
|2004
 
|2004
 
|meeting
 
|meeting
|1.5 h
+
|{{some|1.5}}
|16000
+
|{{some|16}}
|16 (distant)
+
|{{yes|16}}
|3
+
|{{yes|3}}
|{{yes}}
+
|{{yes|free}}
|free
+
|[http://www.idiap.ch/dataset/av16-3/ download]
|http://www.idiap.ch/dataset/av16-3/
+
[http://publications.idiap.ch/index.php/publications/show/353 paper]
|odobez@idiap.ch
+
|{{some|1.5}}
|"AV16.3: an Audio-Visual Corpus for Speaker Localization and Tracking", by Guillaume Lathoud, Jean-Marc Odobez and Daniel Gatica-Perez, in Proceedings of the MLMI'04 Workshop, 2004.
+
|{{some|12}}
|1.5 h
 
|12
 
 
|{{n/s}}
 
|{{n/s}}
 
|{{n/s}}
 
|{{n/s}}
|colloquial
+
|{{yes|spontaneous}}
|1 to 3
+
|1 - 3
 
|full
 
|full
|reverb
+
|{{yes|reverb}}
|human
+
|{{yes|human}}
|various
+
|{{yes|various}}
|walk
+
|{{yes|head, walk}}
|stationary background noise
+
|{{yes|stationary background}}
 +
|{{no|high}}
 
|{{no}}
 
|{{no}}
|{{yes}}
+
|{{some|partial}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|
+
|-
|
+
!ICSI Meeting Corpus
|----
 
|ICSI Meeting Corpus
 
 
|2004
 
|2004
 
|meeting
 
|meeting
|72 h
+
|{{yes|72}}
|16000
+
|{{some|16}}
|6 (distant)
+
|{{yes|6}}
 
|{{no}}
 
|{{no}}
|{{yes}}
+
|{{no|2.8 k$}}
|$1900 (audio) + $900 (transcripts)
+
|[https://catalog.ldc.upenn.edu/search?q%5Bname_cont%5D=ICSI purchase]
|https://catalog.ldc.upenn.edu/LDC2004S02
+
[http://www1.icsi.berkeley.edu/Speech/mr/ info]
|mrcontact@icsi.berkeley.edu
+
[http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=1198793 paper]
|A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, C. Wooters, "The ICSI meeting corpus," Proc. ICASSP, Apr. 2003
+
|{{yes|72}}
|72 h
+
|{{some|53}}
|53
+
|US English, other English
|US English
+
|{{yes|13}}
|13k
+
|{{yes|meeting}}
|meeting
+
|3 - 10
|3 to 10
 
 
|meeting
 
|meeting
|reverb
+
|{{yes|reverb}}
|human
+
|{{yes|human}}
|quasi-fixed
+
|{{some|quasi-fixed}}
|head
+
|{{yes|head}}
|stationary background noise
+
|{{yes|meeting}}
|headset (some lapel)
+
|{{no|high}}
 +
|{{some|headset, lapel}}
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
 
|{{yes}}
 
|{{yes}}
|{{no}}
+
|{{some|ad-hoc}}
|
+
|-
|
+
!NIST Meeting Pilot Corpus Speech
|----
 
|NIST Meeting Pilot Corpus Speech
 
 
|2004
 
|2004
 
|meeting
 
|meeting
|15 h
+
|{{yes|15}}
|16000
+
|{{some|16}}
|7 (distant)
+
|{{yes|7}}
|{{no}} (released but not currently available for download)
+
|{{no}}
|{{yes}}
+
|{{no|5.5 k$}}
|$4000 (audio) + $1500 (transcripts)
+
|[https://catalog.ldc.upenn.edu/search?q%5Bname_cont%5D=NIST%20Meeting purchase]
|https://catalog.ldc.upenn.edu/LDC2004S09
+
[http://www.lrec-conf.org/proceedings/lrec2004/summaries/137.htm paper]
|john.garofolo@nist.gov
+
|{{yes|15}}
|John S. Garofolo, Christophe D. Laprun, Martial Michel, Vincent M. Stanford and Elham Tabassi, "The NIST Meeting Room Pilot Corpus," Proc. LREC, 2004
+
|{{some|61}}
|15 h
 
|61
 
 
|US English
 
|US English
|6k
+
|{{some|6}}
|meeting
+
|{{yes|meeting}}
|3 to 9
+
|3 - 9
 
|meeting
 
|meeting
|reverb
+
|{{yes|reverb}}
|human
+
|{{yes|human}}
|various
+
|{{yes|various}}
|walk
+
|{{yes|head, walk}}
|stationary background noise
+
|{{yes|stationary background}}
|headset+lapel
+
|{{no|high}}
 +
|{{some|headset, lapel}}
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|
+
|-
|
+
!CHIL Meetings
|----
+
|2004 - 2007
|CHIL Meetings
+
|seminar, meeting
|2004-2007
+
|{{yes|60}}
|seminar/meeting
+
|{{yes|44}}
|60 h
+
|{{yes|79 - 147}}
|44100
+
|{{yes|6 - 9}}
|79 to 147 (distant)
+
|{{no|3.5 k€}}
|6 to 9
+
|[http://catalog.elra.info/search.php purchase]
|{{yes}}
+
[http://link.springer.com/article/10.1007%2Fs10579-007-9054-4 paper]
|3 500
 
|http://catalog.elra.info/search.php
 
|choukri@elda.org
 
|D. Mostefa, N. Moreau, K. Choukri, G. Potamianos, S. Chu, A. Tyagi, J. Casas, J. Turmo, L. Cristoforetti, F. Tobia, A. Pnevmatikakis, V. Mylonakis, F. Talantzis, S. Burger, R. Stiefelhagen, K. Bernardin, C. Rochet, The CHIL audiovisual corpus for lecture and meeting analysis inside smart rooms, in LANGUAGE RESOURCES AND EVALUATION, vol. 41, n. 3-4, 2007, pp. 389-407
 
 
|{{dunno}}
 
|{{dunno}}
 
|{{dunno}}
 
|{{dunno}}
 
|non-native English
 
|non-native English
 
|{{dunno}}
 
|{{dunno}}
|lecture/meeting
+
|{{yes|seminar, meeting}}
|3 to 20
+
|3 - 20
|seminar/meeting
+
|seminar, meeting
|reverb
+
|{{yes|reverb}}
|human
+
|{{yes|human}}
|quasi-fixed
+
|{{some|quasi-fixed}}
|head
+
|{{yes|head}}
|meeting (scenarized)
+
|{{some|meeting (scenarized)}}
|headset
+
|{{no|high}}
 +
|{{some|headset}}
 
|{{yes}}
 
|{{yes}}
 
|{{yes}}
 
|{{yes}}
 
|{{yes}}
 
|{{yes}}
 
|{{no}}
 
|{{no}}
|
+
|-
|
+
!SPEECON
|----
+
|2004 - 2011
|SPEECON
+
|public space, domestic, office, car
|2004-2011
 
|public space/domestic/office/car
 
 
|{{dunno}}
 
|{{dunno}}
|16000
+
|{{some|16}}
|3 (distant)
+
|{{some|3}}
 
|{{no}}
 
|{{no}}
|{{yes}}
+
|{{no|75 k€ per lang}}
|29 x 75000  for all languages
+
|[http://catalog.elra.info/search.php purchase]
|http://catalog.elra.info/index.php?cPath=37
+
[http://www.lrec-conf.org/proceedings/lrec2002/sumarios/177.htm paper]
|diskra@appen.com
 
|Dorota Iskra, Beate Grosskopf, Krzysztof Marasek, Henk van den Heuvel, Frank Diehl, Andreas Kiessling, "SPEECON  Speech Databases for Consumer Devices: Database Specification and Validation", LREC p. 329-333, 2002.
 
 
|{{dunno}}
 
|{{dunno}}
|600/language
+
|{{yes|600 per lang}}
|Multiple
+
|various
 
|{{dunno}}
 
|{{dunno}}
|command/read/spontaneous
+
|{{yes|command, read, spontaneous}}
 
|1
 
|1
|{{no}}
+
|no
|reverb
+
|{{yes|reverb}}
|human
+
|{{yes|human}}
|quasi-fixed
+
|{{some|quasi-fixed}}
|head
+
|{{yes|head}}
|various real environments
+
|{{yes|various real environments}}
|headset
+
|{{some|medium}}
 +
|{{some|headset}}
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|
+
|-
|
+
!CENSREC-2
|----
 
|CENSREC-2
 
 
|2005
 
|2005
 
|car
 
|car
 
|{{dunno}}
 
|{{dunno}}
|16000
+
|{{some|16}}
|1 (distant)
+
|{{no|1}}
 
|{{no}}
 
|{{no}}
|{{yes}}
+
|{{yes|free}}
|free
+
|[http://research.nii.ac.jp/src/en/CENSREC-2.html download]
|http://research.nii.ac.jp/src/en/CENSREC-2.html
+
[http://www.isca-speech.org/archive/interspeech_2006/i06_1726.html paper]
|src@nii.ac.jp
 
|S. Nakamura, M. Fujimoto, and K. Takeda, "CENSREC2: Corpus and evaluation environments for in car continuous digit speech recognition," Proc. ICSLP 2006
 
 
|{{dunno}}
 
|{{dunno}}
|214
+
|{{yes|214}}
 
|Japanese
 
|Japanese
|11
+
|{{no|0.01}}
|digits
+
|{{no|digits}}
 
|1
 
|1
|{{no}}
+
|no
|reverb
+
|{{yes|reverb}}
|human
+
|{{yes|human}}
|quasi-fixed
+
|{{some|quasi-fixed}}
|head
+
|{{yes|head}}
|car
+
|{{yes|car}}
|headset
+
|{{yes|low}}
 +
|{{some|headset}}
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|
+
|-
|
+
!CENSREC-3
|----
 
|CENSREC-3
 
 
|2005
 
|2005
 
|car
 
|car
 
|{{dunno}}
 
|{{dunno}}
|16000
+
|{{some|16}}
|1 (distant)
+
|{{no|1}}
 
|{{no}}
 
|{{no}}
|{{yes}}
+
|{{some|21 k¥}}
|free except phonetically balanced training set: JPY 21000 (Universities) / JPY 105000 (Companies)
+
|[http://research.nii.ac.jp/src/en/CENSREC-3.html purchase]
|http://research.nii.ac.jp/src/en/CENSREC-3.html
+
[http://ir.nul.nagoya-u.ac.jp/jspui/bitstream/2237/15050/1/429.pdf paper]
|src@nii.ac.jp
 
|M. Fujimoto, K. Takeda, and S. Nakamura, "CENSREC-3: An evaluation framework for Japanese speech recognition in real driving-car environments," IEICE Transactions on Information and Systems, vol. E89-D, no. 11:pp. 27832793, 2006
 
 
|{{dunno}}
 
|{{dunno}}
|18 (+293 in training)
+
|{{yes|311}}
 
|Japanese
 
|Japanese
|50 in evaluation; unknown but larger in phonetically-balanced utterances of training set
+
|{{no|0.05}}
|read
+
|{{some|read}}
 
|1
 
|1
|{{no}}
+
|no
|reverb
+
|{{yes|reverb}}
|human
+
|{{yes|human}}
|quasi-fixed
+
|{{some|quasi-fixed}}
|head
+
|{{yes|head}}
|car
+
|{{yes|car}}
|headset
+
|{{yes|low}}
 +
|{{some|headset}}
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|
+
|-
|
+
!Aurora-5
|----
 
|Aurora-5
 
 
|2006
 
|2006
|public spaces/domestic/office/car
+
|public spaces, domestic, office, car
 
|{{dunno}}
 
|{{dunno}}
|8000
+
|{{no|8}}
|1 (distant)
+
|{{no|1}}
 
|{{no}}
 
|{{no}}
|{{yes}}
+
|{{some|free given TIDigits (0.5 k$)}}
|TIDigits
+
|[http://catalog.elra.info/product_info.php?cPath=37_40&products_id=1015 purchase] (incl. HTK)
|http://aurora.hsnr.de/download.html
+
[http://aurora.hsnr.de/aurora-5/reports.html paper]
|hans-guenter.hirsch@hs-niederrhein.de
 
|Hans-Gnter Hirsch, "Aurora-5 experimental framework for the performance evaluation of speech recognition in case of a hands-free speech input in noisy environments,", Tech Report, Niederrhein Univ. of Applied Sciences, 2007
 
 
|{{dunno}}
 
|{{dunno}}
|225
+
|{{yes|225}}
 
|US English
 
|US English
|11
+
|{{no|0.01}}
|digits
+
|{{no|digits}}
 
|1
 
|1
 +
|no
 +
|{{some|no, simulated rir, real rir}}
 +
|{{no|loudspeaker}}
 +
|{{no|fixed}}
 
|{{no}}
 
|{{no}}
|real rir/simu/no + simulated telephone channel
+
|{{some|various real environments (rescaled)}}
|loudspeaker
+
|{{yes|low}}
|{{n/s}}
+
|{{yes|original}}
|{{no}}
 
|various real environments
 
|original
 
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
|
+
|-
|
+
!AMI
|----
 
|AMI
 
 
|2006
 
|2006
 
|meeting
 
|meeting
|100 h
+
|{{yes|100}}
|16000
+
|{{some|16}}
|16 (distant)
+
|{{yes|16}}
|6
+
|{{yes|6}}
|{{yes}}
+
|{{yes|free}}
|free
+
|[http://groups.inf.ed.ac.uk/ami/ download]
|http://groups.inf.ed.ac.uk/ami/
+
[http://ieeexplore.ieee.org/xpl/login.jsp?arnumber=4538700 paper]
|amicorpus@amiproject.org
 
|Steve Renals, Thomas Hain, and Herv Bourlard. Interpretation of multiparty meetings: The AMI and AMIDA projects. In IEEE Workshop on Hands-Free Speech Communication and Microphone Arrays, 2008. HSCMA 2008, pages 115-118, 2008
 
 
|{{dunno}}
 
|{{dunno}}
|189
+
|{{yes|189}}
|UK English
+
|UK English, other English
|8k
+
|{{some|8}}
|meeting
+
|{{yes|meeting}}
|4 (18% overlap)
+
|most often 4
|meeting
+
|meeting (18% overlap)
|reverb
+
|{{yes|reverb}}
|human
+
|{{yes|human}}
|quasi-fixed
+
|{{some|quasi-fixed}}
|head
+
|{{yes|head}}
|stationary background noise
+
|{{yes|stationary background}}
|headset+lapel
+
|{{no|high}}
 +
|{{some|headset, lapel}}
 
|{{yes}}
 
|{{yes}}
 
|{{yes}}
 
|{{yes}}
 
|{{yes}}
 
|{{yes}}
 
|{{no}}
 
|{{no}}
|
+
|-
|
+
!PASCAL SSC
|----
 
|PASCAL SSC
 
 
|2006
 
|2006
|cocktail party
+
|speech overlap
|18.5 min (+ 8.5h clean training data)
+
|{{some|8.8}}
|25000
+
|{{yes|25}}
|1 (mixing console)
+
|{{no|1}}
 
|{{no}}
 
|{{no}}
|{{yes}} (website to be restored)
+
|{{yes|free}}
|free
+
|[http://staffwww.dcs.shef.ac.uk/people/M.Cooke/SpeechSeparationChallenge.htm download]
|
+
[http://www.sciencedirect.com/science/article/pii/S0885230809000205 paper]
|m.cooke@ikerbasque.org
+
|{{some|8.8}}
|Martin Cooke, John R. Hershey, Steven J. Rennie, "Monaural speech separation and recognition challenge," Computer, Speech and Language, 2010
+
|{{some|34}}
|18.5 min (+ 8.5h clean training data)
 
|34
 
 
|UK English
 
|UK English
|51
+
|{{no|0.05}}
|command
+
|{{no|command}}
 
|2
 
|2
 
|full
 
|full
 
|{{no}}
 
|{{no}}
|human
+
|{{yes|human}}
 
|{{n/s}}
 
|{{n/s}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|original
+
|{{n/s}}
 +
|{{yes|original}}
 
|{{n/s}}
 
|{{n/s}}
 
|{{yes}}
 
|{{yes}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|
+
|-
|
+
!HIWIRE
|----
 
|HIWIRE
 
 
|2007
 
|2007
 
|airplane
 
|airplane
|21 h
+
|{{yes|21}}
|16000
+
|{{some|16}}
|1 (close)
+
|{{no|1}}
 
|{{no}}
 
|{{no}}
|{{yes}}
+
|{{some|0.05 k€}}
|50
+
|[http://catalog.elra.info/product_info.php?products_id=1088&language=en purchase]
|http://catalog.elra.info/product_info.php?products_id=1088&language=en
+
[http://cvsp.cs.ntua.gr/projects/pub/HIWIRE/WebHome/HIWIRE_db_description_paper.pdf paper]
|segura@ugr.es
+
|{{yes|21}}
|J.C. Segura, T. Ehrette, A. Potamianos, D. Fohr, I. Illina, P.-A. Breton, V. Clot, R. Gemello, M. Matassoni, P. Maragos, "The HIWIRE database, a noisy and non-native English speech corpus for cockpit communication"
+
|{{some|81}}
|21 h
 
|81
 
 
|non-native English
 
|non-native English
|133
+
|{{no|0.1}}
|command
+
|{{no|command}}
 
|1
 
|1
 +
|no
 +
|{{no}}
 +
|{{yes|human}}
 +
|{{n/s}}
 +
|{{no}}
 +
|{{some|airplane (rescaled)}}
 +
|{{yes|low}}
 +
|{{yes|original}}
 +
|{{n/s}}
 +
|{{yes}}
 +
|{{no}}
 
|{{no}}
 
|{{no}}
 +
|-
 +
!NOIZEUS
 +
|2007
 +
|public spaces
 +
|{{no|0.6}}
 +
|{{no|8}}
 +
|{{no|1}}
 
|{{no}}
 
|{{no}}
|human
+
|{{yes|free}}
 +
|[http://ecs.utdallas.edu/loizou/speech/noizeus/ download]
 +
[http://www.sciencedirect.com/science/article/pii/S0167639306001920 paper]
 +
|{{no|0.6}}
 +
|{{no|6}}
 +
|US English
 +
|{{no|0.1}}
 +
|{{some|read}}
 +
|1
 +
|no
 +
|{{some|simulated phone}}
 +
|{{yes|human}}
 
|{{n/s}}
 
|{{n/s}}
|head
+
|{{no}}
|airplane
+
|{{some|various real environments (rescaled)}}
|original
+
|{{yes|low}}
 +
|{{yes|original}}
 
|{{n/s}}
 
|{{n/s}}
|{{yes}}
 
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|
+
|{{no}}
|
+
|-
|----
+
!UT-Drive
|UT-Drive
 
 
|2007
 
|2007
 
|car
 
|car
|40 h
+
|{{yes|40}}
|25000
+
|{{yes|25}}
|5 (distant)
+
|{{yes|5}}
|2
+
|{{yes|2}}
|{{yes}}
+
|{{no|25 k$}}
|$25k with CU-Move
+
|[http://crss.utdallas.edu/ download]
|http://crss.utdallas.edu/
+
[http://ieeexplore.ieee.org/xpl/login.jsp?arnumber=4290175 paper]
|john.hansen@utdallas.edu
+
|{{yes|40}}
|P. Angkititrakul, M. Petracca, A. Sathyanarayana, J.H.L. Hansen, "UTDrive: Driver Behavior and Speech Interactive Systems for In-Vehicle Environments," Intelligent Vehicles Symposium, 2007
+
|{{some|25}}
|40 h
 
|25 (more exist but not included in latest release 3.0)
 
 
|US English
 
|US English
|2.4k (but transcription is incomplete)
+
|{{some|2.4}}
|command/conversation
+
|{{yes|command, dialog}}
|1 to 2
+
|1 - 2
|conversation
+
|dialog
|reverb
+
|{{yes|reverb}}
|human
+
|{{yes|human}}
|quasi-fixed
+
|{{some|quasi-fixed}}
|head
+
|{{yes|head}}
|car
+
|{{yes|car}}
|headset (but problem w/ recording quality)
+
|{{yes|low}}
 +
|{{some|headset (low quality)}}
 
|{{no}}
 
|{{no}}
|{{yes}} (partial)
+
|{{some|partial}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|
+
|-
|
+
!SASSEC, SiSEC under- determined
|----
+
|2007 - 2011
|SASSEC/SiSEC underdetermined
 
|2007-2011
 
 
|cocktail party
 
|cocktail party
|19 min
+
|{{no|0.3}}
|16000
+
|{{some|16}}
|2 (distant)
+
|{{some|2}}
 
|{{no}}
 
|{{no}}
|{{yes}}
+
|{{yes|free}}
|free
+
|[http://sisec2011.wiki.irisa.fr/tiki-index.php?page=Underdetermined+speech+and+music+mixtures download]
|http://sisec2011.wiki.irisa.fr/tiki-index.php?page=Underdetermined+speech+and+music+mixtures
+
[http://www.sciencedirect.com/science/article/pii/S0165168411003604 paper]
|araki.shoko@lab.ntt.co.jp
+
|{{no|0.3}}
|The Signal Separation Evaluation Campaign (2007-2010): Achievements and Remaining Challenges, Emmanuel Vincent; Shoko Araki; Fabian J. Theis; Guido Nolte; Pau Bofill; Hiroshi Sawada; Alexey Ozerov; B. Vikrham Gowreesunker; Dominik Lutter; Ngoc Duong, Signal Processing, Elsevier, 2012, 92, pp. 1928-1936
+
|{{some|16}}
|19 min
 
|16
 
 
|{{n/s}}
 
|{{n/s}}
 
|{{n/s}}
 
|{{n/s}}
|read
+
|{{some|read}}
|3 or 4
+
|3 - 4
 
|full
 
|full
|reverb/real rir/simu
+
|{{yes|simulated rir, real rir, reverb}}
|loudspeaker/{{no}}
+
|{{no|no, loudspeaker}}
|fixed
+
|{{no|fixed}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|original+spatial image
+
|{{n/s}}
 +
|{{yes|original, spatial image}}
 
|{{yes}}
 
|{{yes}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|
+
|-
|
+
!MC-WSJ-AV, PASCAL SSC2, 2012_MMA, REVERB RealData
|----
+
|2007 - 2014
|MC-WSJ-AV/PASCAL SSC2/2012_MMA/REVERB RealData
+
|speech overlap
|2007-2014
+
|{{yes|10}}
|cocktail party
+
|{{some|16}}
|10 h
+
|{{yes|8 - 40}}
|16000
+
|{{some|partial}}
|8 to 40 (distant)
+
|{{some|1.5 k$}}
|{{no}}
+
|[https://catalog.ldc.upenn.edu/LDC2014S03 purchase]
|{{yes}}
+
[http://ieeexplore.ieee.org/xpl/login.jsp?arnumber=1566470 paper]
|$1 500
+
[http://ieeexplore.ieee.org/xpl/login.jsp?arnumber=6639033 paper]
|https://catalog.ldc.upenn.edu/LDC2014S03
+
[http://www.cstr.ed.ac.uk/corpora/2012_MMA/ info]
|mike.lincoln@quoratetechnology.com
+
[http://scholar.google.co.uk/citations?view_op=view_citation&hl=en&user=8J_nG0wAAAAJ&citation_for_view=8J_nG0wAAAAJ:08ZZubdj9fEC video]
|M. Lincoln, I. McCowan, J. Vepa, and H. K. Maganti, The multi-channel wall street journal audio visual corpus (MC-WSJ-AV): Specification and initial experiments, in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2005. + E. Zwyssig, F. Faubel, S. Renals and M. Lincoln, "Recognition of overlapping speech using digital MEMS microphone arrays", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013
+
[http://reverb2014.dereverberation.com/tools/REVERB_TOOLS_FOR_ASR_ver2.0.tgz HTK]
 +
[http://www.mmk.ei.tum.de/~wen/REVERB_2014/kaldi_baseline.tar.gz Kaldi]
 +
[http://reverb2014.dereverberation.com/result_se.html results]
 +
[http://reverb2014.dereverberation.com/result_asr.html results]
 
|{{dunno}}
 
|{{dunno}}
|45
+
|{{some|45}}
 
|UK English
 
|UK English
|10k
+
|{{yes|10}}
|read
+
|{{some|read}}
|1 or 2
+
|1 - 2
 
|full
 
|full
|reverb
+
|{{yes|reverb}}
|human
+
|{{yes|human}}
|various
+
|{{yes|various}}
|walk
+
|{{yes|head, walk}}
|stationary background noise
+
|{{yes|stationary background}}
|headset+lapel
+
|{{no|high}}
 +
|{{some|headset, lapel}}
 
|{{yes}}
 
|{{yes}}
 
|{{yes}}
 
|{{yes}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|
+
|-
|
+
!CENSREC-4 (Simulated)
|----
 
|CENSREC-4 (Simulated)
 
 
|2008
 
|2008
|public spaces/domestic/office/car
+
|public spaces, domestic, office, car
 
|{{dunno}}
 
|{{dunno}}
|16000
+
|{{some|16}}
|1 (distant)
+
|{{no|1}}
 
|{{no}}
 
|{{no}}
|{{yes}}
+
|{{yes|free}}
|free
+
|[http://research.nii.ac.jp/src/en/CENSREC-4.html download]
|http://research.nii.ac.jp/src/en/CENSREC-4.html
+
[http://www.lrec-conf.org/proceedings/lrec2008/summaries/468.html paper]
|src@nii.ac.jp
 
|T. Nishiura et al., "Evaluation Framework for Distant-talking Speech Recognition under Reverberant Environments  Newest Part of the CENSREC Series", Proc. LREC 2008
 
 
|{{dunno}}
 
|{{dunno}}
|214
+
|{{yes|214}}
 
|Japanese
 
|Japanese
|11
+
|{{no|0.01}}
|digits
+
|{{no|digits}}
 
|1
 
|1
 +
|no
 +
|{{some|real rir}}
 +
|{{some|dummy}}
 +
|{{no|fixed}}
 
|{{no}}
 
|{{no}}
|real rir
+
|{{some|various real environments (rescaled)}}
|mouth simulator
+
|{{yes|low}}
|fixed
+
|{{yes|original}}
|{{no}}
 
|various real environments
 
|original
 
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
|
+
|-
|
+
!CENSREC-4 (Real)
|----
 
|CENSREC-4 (Real)
 
 
|2008
 
|2008
|public spaces/domestic/office/car
+
|public spaces, domestic, office, car
 
|{{dunno}}
 
|{{dunno}}
|16000
+
|{{some|16}}
|1 (distant)
+
|{{no|1}}
 
|{{no}}
 
|{{no}}
|{{yes}}
+
|{{yes|free}}
|free
+
|[http://research.nii.ac.jp/src/en/CENSREC-4.html download]
|http://research.nii.ac.jp/src/en/CENSREC-4.html
+
[http://www.lrec-conf.org/proceedings/lrec2008/summaries/468.html paper]
|src@nii.ac.jp
 
|T. Nishiura et al., "Evaluation Framework for Distant-talking Speech Recognition under Reverberant Environments  Newest Part of the CENSREC Series", Proc. LREC 2008
 
 
|{{dunno}}
 
|{{dunno}}
|10
+
|{{some|10}}
 
|Japanese
 
|Japanese
|11
+
|{{no|0.01}}
|digits
+
|{{no|digits}}
 
|1
 
|1
|{{no}}
+
|no
|reverb
+
|{{yes|reverb}}
|human
+
|{{yes|human}}
|quasi-fixed
+
|{{some|quasi-fixed}}
|head
+
|{{yes|head}}
|various real environments
+
|{{yes|various real environments}}
|headset
+
|{{yes|low}}
 +
|{{some|headset}}
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
|
+
|-
|
+
!DICIT
|----
 
|DICIT
 
 
|2008
 
|2008
 
|domestic
 
|domestic
|6 h
+
|{{some|6}}
|48000
+
|{{yes|48}}
|16 (distant)
+
|{{yes|16}}
|2
+
|{{yes|2}}
|{{yes}}
+
|{{yes|free}}
|free
+
|[http://shine.fbk.eu/resources/dicit-acoustic-woz-data download]
|http://shine.fbk.eu/resources/dicit-acoustic-woz-data
+
[http://www.lrec-conf.org/proceedings/lrec2008/summaries/584.html paper]
|omologo@fbk.eu
+
|{{some|1}}
|Alessio Brutti, Luca Cristoforetti, Walter Kellermann, Lutz Marquardt and Maurizio Omologo, WOZ Acoustic Data Collection for Interactive TV, Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), 2008.
 
|1 h
 
 
|{{dunno}}
 
|{{dunno}}
 
|Italian
 
|Italian
 
|{{dunno}}
 
|{{dunno}}
|command
+
|{{no|command}}
 
|4
 
|4
|{{no}}
+
|no
|reverb
+
|{{yes|reverb}}
|human
+
|{{yes|human}}
|various
+
|{{yes|various}}
|walk
+
|{{yes|head, walk}}
|domestic (scenarized)
+
|{{some|domestic (scenarized)}}
|headset+tv
+
|{{some|medium}}
 +
|{{some|headset, tv}}
 
|{{yes}}
 
|{{yes}}
 
|{{yes}}
 
|{{yes}}
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
|
+
|-
|
+
!SiSEC head-geometry
|----
 
|SiSEC head-geometry
 
 
|2008
 
|2008
|cocktail party
+
|speech overlap
|1.9 h
+
|{{some|1.9}}
|16000
+
|{{some|16}}
|2 (distant)
+
|{{some|2}}
 
|{{no}}
 
|{{no}}
|{{yes}}
+
|{{yes|free}}
|free
+
|[http://sisec2008.wiki.irisa.fr/tiki-index.php?page=Head-geometry%20mixtures%20of%20two%20speech%20sources%20in%20real%20environments,%20impinging%20from%20many%20directions download]
|http://sisec2008.wiki.irisa.fr/tiki-index.php?page=Head-geometry%20mixtures%20of%20two%20speech%20sources%20in%20real%20environments,%20impinging%20from%20many%20directions
+
[http://www.sciencedirect.com/science/article/pii/S0165168411003604 paper]
|hendrik.kayser@uni-oldenburg.de
+
|{{some|1.9}}
|The Signal Separation Evaluation Campaign (2007-2010): Achievements and Remaining Challenges, Emmanuel Vincent; Shoko Araki; Fabian J. Theis; Guido Nolte; Pau Bofill; Hiroshi Sawada; Alexey Ozerov; B. Vikrham Gowreesunker; Dominik Lutter; Ngoc Duong, Signal Processing, Elsevier, 2012, 92, pp. 1928-1936
 
|1.9 h
 
 
|{{dunno}}
 
|{{dunno}}
 
|{{n/s}}
 
|{{n/s}}
 
|{{n/s}}
 
|{{n/s}}
|read
+
|{{some|read}}
 
|2
 
|2
 
|full
 
|full
|real rir
+
|{{some|real rir}}
|loudspeaker
+
|{{no|loudspeaker}}
|various
+
|{{yes|various}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|original+spatial image
+
|{{n/s}}
 +
|{{yes|original, spatial image}}
 
|{{yes}}
 
|{{yes}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|
+
|-
|
+
!COSINE
|----
 
|COSINE
 
 
|2009
 
|2009
|conversation
+
|dialog
|38 h
+
|{{yes|38}}
|48000
+
|{{yes|48}}
|20 (distant)
+
|{{yes|20}}
 
|{{no}}
 
|{{no}}
|{{yes}}
+
|{{yes|free}}
|free
+
|[http://melodi.ee.washington.edu/cosine/ download]
|http://melodi.ee.washington.edu/cosine/
+
[http://www.sciencedirect.com/science/article/pii/S0885230811000143 paper]
|cosine@melodi.ee.washington.edu
+
|{{yes|11}}
|Alex Stupakov,  Evan Hanusa,  Deepak Vijaywargi,  Dieter Fox, and  Jeff Bilmes.  The design and collection of COSINE, a multi-microphone in situ speech corpus recorded in noisy environments.  Computer Speech and Langauge,  26:5266, 2011.
+
|{{some|91}}
|11 h
+
|US English, non-native English
|91
+
|{{some|5}}
|US/non-native English
+
|{{yes|spontaneous}}
|5k
+
|2 - 7
|colloquial
+
|dialog
|2 to 7
+
|{{yes|reverb}}
|conversation
+
|{{yes|human}}
|reverb
+
|{{yes|various}}
|human
+
|{{yes|head, walk}}
|various
+
|{{yes|various real environments}}
|walk
+
|{{yes|low}}
|various real environments
+
|{{some|headset, throat mic}}
|headset+throat mic
 
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|
+
|-
|
+
!SiSEC real-world noise
|----
 
|SiSEC real-world noise
 
 
|2010
 
|2010
 
|public spaces
 
|public spaces
|20 min
+
|{{no|0.3}}
|16000
+
|{{some|16}}
|2 to 4 (distant)
+
|{{yes|2 - 4}}
 
|{{no}}
 
|{{no}}
|{{yes}}
+
|{{yes|free}}
|free
+
|[http://sisec2010.wiki.irisa.fr/tiki-index.php?page=Source+separation+in+the+presence+of+real-world+background+noise download]
|http://sisec2010.wiki.irisa.fr/tiki-index.php?page=Source+separation+in+the+presence+of+real-world+background+noise
+
[http://www.sciencedirect.com/science/article/pii/S0165168411003604 paper]
|ito.nobutaka@lab.ntt.co.jp
+
|{{no|0.3}}
|The Signal Separation Evaluation Campaign (2007-2010): Achievements and Remaining Challenges, Emmanuel Vincent; Shoko Araki; Fabian J. Theis; Guido Nolte; Pau Bofill; Hiroshi Sawada; Alexey Ozerov; B. Vikrham Gowreesunker; Dominik Lutter; Ngoc Duong, Signal Processing, Elsevier, 2012, 92, pp. 1928-1936
+
|{{no|6}}
|20 min
 
|6
 
 
|{{n/s}}
 
|{{n/s}}
 
|{{n/s}}
 
|{{n/s}}
|read
+
|{{some|read}}
|1 or 3
+
|1 - 3
 
|full
 
|full
|reverb (other room)/{{no}}
+
|{{some|no, reverb (other room)}}
|loudspeaker
+
|{{no|loudspeaker}}
|various
+
|{{yes|various}}
 
|{{no}}
 
|{{no}}
|various real environments
+
|{{some|various real environments (rescaled)}}
|original+spatial image
+
|{{yes|low}}
 +
|{{yes|original, spatial image}}
 
|{{yes}}
 
|{{yes}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|
+
|-
|
+
!SiSEC dynamic
|----
+
|2010 - 2011
|SiSEC dynamic
 
|2010-2011
 
 
|cocktail party
 
|cocktail party
|11 min
+
|{{no|0.2}}
|16000
+
|{{some|16}}
|2 to 4 (distant)
+
|{{yes|2 - 4}}
 
|{{no}}
 
|{{no}}
|{{yes}}
+
|{{yes|free}}
|free
+
|[http://sisec2010.wiki.irisa.fr/tiki-index.php?page=Determined+convolutive+mixtures+under+dynamic+conditions download]
|http://sisec2010.wiki.irisa.fr/tiki-index.php?page=Determined+convolutive+mixtures+under+dynamic+conditions
+
[http://www.sciencedirect.com/science/article/pii/S0165168411003604 paper]
|francesco.nesta@gmail.com
+
|{{no|0.2}}
|The Signal Separation Evaluation Campaign (2007-2010): Achievements and Remaining Challenges, Emmanuel Vincent; Shoko Araki; Fabian J. Theis; Guido Nolte; Pau Bofill; Hiroshi Sawada; Alexey Ozerov; B. Vikrham Gowreesunker; Dominik Lutter; Ngoc Duong, Signal Processing, Elsevier, 2012, 92, pp. 1928-1936
 
|11 min
 
 
|{{dunno}}
 
|{{dunno}}
 
|{{n/s}}
 
|{{n/s}}
 
|{{n/s}}
 
|{{n/s}}
|read
+
|{{some|read}}
|Many but only 2 simultaneous
+
|{{dunno}}
|simu
+
|full (2 at a time)
|reverb
+
|{{yes|reverb}}
|loudspeaker
+
|{{no|loudspeaker}}
|various
+
|{{yes|various}}
|simu
+
|{{some|simulated}}
 
|{{no}}
 
|{{no}}
|original+spatial image
+
|{{n/s}}
 +
|{{yes|original, spatial image}}
 
|{{yes}}
 
|{{yes}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|
+
|-
|
+
!CHiME 1, CHiME 2 Grid
|----
+
|2011 - 2012
|CHiME 1/CHiME 2 Grid
 
|2011-2012
 
 
|domestic
 
|domestic
|70 h with some overlap
+
|{{yes|70}}
|16000
+
|{{yes|16 - 48}}
|2 (distant)
+
|{{some|2}}
 
|{{no}}
 
|{{no}}
|{{yes}}
+
|{{yes|free}}
|free
+
|[http://spandh.dcs.shef.ac.uk/chime_challenge/chime2013/chime2_task1.html download]
|http://spandh.dcs.shef.ac.uk/chime_challenge/chime2_task1.html
+
[http://ieeexplore.ieee.org/xpl/login.jsp?arnumber=6637622 paper]
|emmanuel.vincent@inria.fr
+
[http://spandh.dcs.shef.ac.uk/chime_challenge/chime2013/chime2_task1.html#tools HTK]
|Vincent, E., Barker, J., Watanabe, S., Le Roux, J., Nesta, F. and Matassoni, M., "The second CHiME Speech Separation and Recognition Challenge: Datasets, tasks and baselines'' In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013, Vancouver
+
[http://spandh.dcs.shef.ac.uk/projects/chime/PCC/results.html results]
|12 h
+
[http://spandh.dcs.shef.ac.uk/chime_challenge/chime2013/track1_results.html results]
|34
+
|{{yes|12}}
 +
|{{some|34}}
 
|UK English
 
|UK English
|51
+
|{{no|0.05}}
|command
+
|{{no|command}}
 
|1
 
|1
|{{no}}
+
|no
|real rir
+
|{{some|real rir}}
|dummy
+
|{{some|dummy}}
|quasi-fixed
+
|{{some|quasi-fixed}}
|simu
+
|{{some|simulated head}}
|domestic
+
|{{yes|domestic (added without rescaling)}}
 +
|{{yes|low}}
 
|{{yes}}
 
|{{yes}}
 
|{{yes}}
 
|{{yes}}
Line 1,247: Line 1,247:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|
+
|-
|
+
!CHiME 2 WSJ0
|----
 
|CHiME 2 WSJ0
 
 
|2012
 
|2012
 
|domestic
 
|domestic
|78 h with some overlap
+
|{{yes|78}}
|16000
+
|{{some|16}}
|2 (distant)
+
|{{some|2}}
 
|{{no}}
 
|{{no}}
|{{yes}}
+
|{{some|free given WSJ0 (1.5 k$)}}
|WSJ0
+
|[http://spandh.dcs.shef.ac.uk/chime_challenge/chime2013/chime2_task2.html download]
|http://spandh.dcs.shef.ac.uk/chime_challenge/chime2_task2.html
+
[http://ieeexplore.ieee.org/xpl/login.jsp?arnumber=6637622 paper]
|francesco.nesta@gmail.com
+
[http://spandh.dcs.shef.ac.uk/chime_challenge/chime2013/chime2_task2.html#tools HTK]
|Vincent, E., Barker, J., Watanabe, S., Le Roux, J., Nesta, F. and Matassoni, M., "The second CHiME Speech Separation and Recognition Challenge: Datasets, tasks and baselines'' In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013, Vancouver
+
[http://spandh.dcs.shef.ac.uk/chime_challenge/chime2013/WSJ0public/CHiME2012-WSJ0-Kaldi_0.03.tar.gz Kaldi]
|33 h
+
[http://spandh.dcs.shef.ac.uk/chime_challenge/chime2013/track2_results.html results]
|101
+
|{{yes|33}}
 +
|{{yes|101}}
 
|US English
 
|US English
|11k
+
|{{yes|11}}
|read
+
|{{some|read}}
 
|1
 
|1
 +
|no
 +
|{{some|real rir}}
 +
|{{some|dummy}}
 +
|{{no|fixed}}
 
|{{no}}
 
|{{no}}
|real rir
+
|{{yes|domestic (added without rescaling)}}
|dummy
+
|{{yes|low}}
|fixed
 
|{{no}}
 
|domestic
 
 
|{{yes}}
 
|{{yes}}
 
|{{yes}}
 
|{{yes}}
Line 1,279: Line 1,279:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|
+
|-
|
+
!ETAPE
|----
 
|ETAPE
 
 
|2012
 
|2012
|debates, outdoor interviews, and other TV/radio broadcasts selected for large speaker overlap and/or noise
+
|TV/radio debates, outdoor interviews
|42 h
+
|{{yes|42}}
|16000
+
|{{some|16}}
|1 (mixing console)
+
|{{no|1}}
|1
+
|{{some|1}}
|{{yes}}
 
|{{dunno}}
 
 
|{{dunno}}
 
|{{dunno}}
|guillaume.gravier@irisa.fr
+
|[http://www.afcp-parole.org/etape.html download]
|Guillaume Gravier, Gilles Adda, Niklas Paulsson, Matthieu Carr, Aude Giraudel, Olivier Galibert, The ETAPE corpus for the evaluation of speech-based TV content processing in the French language, LREC 2012.
+
[http://www.lrec-conf.org/proceedings/lrec2012/summaries/495.html paper]
|32 h
+
|{{yes|32}}
|347
+
|{{yes|347}}
 
|French
 
|French
|16k
+
|{{yes|16}}
|colloquial
+
|{{yes|spontaneous}}
|1 or more (7% overlap on average, up to 10% in debates)
+
|1 or more
|conversation
+
|dialog (up to 10% overlap)
|some reverb
+
|{{yes|reverb (some)}}
|human
+
|{{yes|human}}
|quasi-fixed
+
|{{some|quasi-fixed}}
|head
+
|{{yes|head}}
|various real environments
+
|{{yes|various real environments}}
 +
|{{no|high}}
 
|{{no}}
 
|{{no}}
 
|{{n/s}}
 
|{{n/s}}
Line 1,311: Line 1,308:
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
|
+
|-
|
+
!GALE
|----
 
|GALE (Chinese broadcast conversation)
 
 
|2013
 
|2013
|conversation (TV Broadcast)
+
|TV dialog
|120 h
+
|{{yes|120 - 251 per lang}}
|16000
+
|{{some|16}}
|1 (mixing console)
+
|{{no|1}}
 
|{{no}}
 
|{{no}}
|{{yes}}
+
|{{no|3.5 - 7 k$ per lang}}
|$2000 (audio) + $1500 (transcripts)
+
|[https://catalog.ldc.upenn.edu/search?q%5Bname_cont%5D=GALE purchase]
|https://catalog.ldc.upenn.edu/LDC2013S04
+
|{{yes|108 - 234 per lang}}
|strassel@ldc.upenn.edu
 
|
 
|108 h
 
 
|{{dunno}}
 
|{{dunno}}
|Mandarin
+
|Mandarin, Arabic
 
|{{dunno}}
 
|{{dunno}}
|colloquial
+
|{{yes|spontaneous}}
 
|1 or more
 
|1 or more
|conversation
+
|dialog
 
|{{no}}
 
|{{no}}
|human
+
|{{yes|human}}
|quasi-fixed
+
|{{some|quasi-fixed}}
|head
+
|{{yes|head}}
 
|{{no}}
 
|{{no}}
 +
|{{n/s}}
 
|{{no}}
 
|{{no}}
 
|{{n/s}}
 
|{{n/s}}
Line 1,343: Line 1,336:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|
+
|-
|
+
!REVERB SimData
|----
 
|GALE (Arabic broadcast conversation)
 
 
|2013
 
|2013
|conversation (TV Broadcast)
+
|domestic, office
|251 h
+
|{{yes|25}}
|16000
+
|{{some|16}}
|1 (mixing console)
+
|{{yes|8}}
 +
|{{no}}
 +
|{{some|free given WSJCAM0 (1.75 k$)}}
 +
|[http://reverb2014.dereverberation.com/ purchase]
 +
[http://ieeexplore.ieee.org/xpl/login.jsp?arnumber=6701894 paper]
 +
[http://reverb2014.dereverberation.com/tools/REVERB_TOOLS_FOR_ASR_ver2.0.tgz HTK]
 +
[http://www.mmk.ei.tum.de/~wen/REVERB_2014/kaldi_baseline.tar.gz Kaldi]
 +
[http://reverb2014.dereverberation.com/result_se.html results]
 +
[http://reverb2014.dereverberation.com/result_asr.html results]
 +
|{{yes|25}}
 +
|{{yes|130}}
 +
|UK English
 +
|{{yes|10}}
 +
|{{some|read}}
 +
|1
 +
|no
 +
|{{some|real rir}}
 +
|{{no|loudspeaker}}
 +
|{{yes|various}}
 
|{{no}}
 
|{{no}}
 +
|{{some|random noise}}
 +
|{{no|high}}
 +
|{{yes|original, spatial image}}
 +
|{{yes}}
 
|{{yes}}
 
|{{yes}}
|2 x [$2000 (audio) + $1500 (transcripts)]
 
|https://catalog.ldc.upenn.edu/LDC2013S02
 
|strassel@ldc.upenn.edu
 
|
 
|234 h
 
|{{dunno}}
 
|Arabic
 
|{{dunno}}
 
|colloquial
 
|1 or more
 
|conversation
 
|{{no}}
 
|human
 
|quasi-fixed
 
|head
 
|{{no}}
 
 
|{{no}}
 
|{{no}}
|{{n/s}}
 
 
|{{yes}}
 
|{{yes}}
|{{no}}
+
|-
|{{no}}
+
!Sheffield Wargames Corpus
|
 
|
 
|----
 
|REVERB SimData
 
 
|2013
 
|2013
|domestic/office
+
|cocktail party
|25 h
+
|{{some|7}}
|16000
+
|{{yes|48}}
|8 (distant)
+
|{{yes|92}}
|{{no}}
+
|{{yes|3}}
|{{yes}}
+
|{{yes|free}}
|WSJCAM0
+
|[http://mini.dcs.shef.ac.uk/data-2/ download]
|http://reverb2014.dereverberation.com/
+
[http://www.isca-speech.org/archive/interspeech_2013/i13_1116.html paper]
|REVERB-challenge@lab.ntt.co.jp
+
|{{dunno}}
|Keisuke Kinoshita, Marc Delcroix, Takuya Yoshioka, Tomohiro Nakatani, Emanuel Habets, Reinhold Haeb-Umbach, Volker Leutnant, Armin Sehr, Walter Kellermann, Roland Maas, Sharon Gannot, Bhiksha Raj, "The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech", Proc. WASPAA 2013
+
|{{no|9}}
|25 h
 
|130
 
 
|UK English
 
|UK English
|10k
+
|{{dunno}}
|read
+
|{{yes|spontaneous}}
|1
+
|4
|{{no}}
+
|multiple dialogs
|real rir
+
|{{yes|reverb}}
|loudspeaker
+
|{{yes|human}}
|fixed
+
|{{yes|various}}
|{{no}}
+
|{{yes|head, walk}}
|experimental room
+
|{{yes|background music}}
|original+spatial image
+
|{{some|medium}}
 +
|{{some|headset}}
 
|{{yes}}
 
|{{yes}}
 
|{{yes}}
 
|{{yes}}
 
|{{no}}
 
|{{no}}
|{{yes}}
+
|{{no}}
|
+
|-
|
+
!DIRHA
|----
 
|DIRHA
 
 
|2014
 
|2014
 
|domestic
 
|domestic
|3.8 h
+
|{{yes|11}}
|48000
+
|{{yes|48}}
|40 (distant)
+
|{{yes|40}}
 
|{{no}}
 
|{{no}}
|{{yes}}
+
|{{some|free (partial avail.)}}
|free
+
|[http://shine.fbk.eu/resources/dirha-ii-simulated-corpus download]
|http://shine.fbk.eu/resources/dirha-ii-simulated-corpus
+
[http://www.lrec-conf.org/proceedings/lrec2014/summaries/650.html paper]
|mravanelli@fbk.eu
+
|{{some|4}}
|Alessio Brutti, Mirco Ravanelli, Piergiorgio Svaizer, Maurizio Omologo, A speech event detection and localization task for multiroom environments, HSCMA 2014.
+
|{{some|90}}
|1.3 h
 
|30
 
|Italian, German, Greek, Portuguese
 
|various
 
 
|various
 
|various
 +
|{{some|3.8}}
 +
|{{yes|command, read, spontaneous}}
 
|1 or more
 
|1 or more
|simu
+
|simulated
|real rir
+
|{{some|real rir}}
|loudspeaker
+
|{{no|loudspeaker}}
|various
+
|{{yes|various}}
 
|{{no}}
 
|{{no}}
|domestic (sum of individual noises)
+
|{{yes|domestic (added without rescaling)}}
 +
|{{yes|low}}
 
|{{yes}}
 
|{{yes}}
 
|{{yes}}
 
|{{yes}}
Line 1,439: Line 1,427:
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
|
+
|-
|
+
!CHiME 3
|----
+
|2015
 +
|public spaces
 +
|{{yes|48}}
 +
|{{some|16}}
 +
|{{yes|6}}
 +
|{{no}}
 +
|{{some|free given WSJ0 (1.5 k$)}}
 +
|[http://spandh.dcs.shef.ac.uk/chime_challenge/download.html download]
 +
[https://hal.inria.fr/hal-01211376 paper]
 +
|{{yes|28}}
 +
|{{yes|113}}
 +
|US English
 +
|{{yes|11}}
 +
|{{some|read}}
 +
|1
 +
|no
 +
|{{yes|simulated, reverb}}
 +
|{{yes|human}}
 +
|{{yes|various}}
 +
|{{yes|head}}
 +
|{{yes|various real environments}}
 +
|{{yes|low}}
 +
|{{some|headset}}
 +
|{{no}}
 +
|{{yes}}
 +
|{{no}}
 +
|{{no}}
 
|}
 
|}
  
== [[Automatic speech recognition]] ==
+
<span id="speech_attributes"></span>
 +
'''General attributes''':
 +
* year of release
 +
* scenario: car, cocktail party, domestic, lecture, meeting, office, public space, TV...
 +
* total duration (h) (multiple channels counted only once)
 +
* sampling rate (kHz)
 +
* number of distant or noisy microphones
 +
* number of video cameras
 +
* cost for non-members of ELRA and LDC (cost for members is lower or free)
 +
* links: download data, reference papers, software baselines, evaluation results...
 +
'''Speech attributes''':
 +
* duration of speech (h) (overlapping speech counted only once)
 +
* number of unique speakers
 +
* language
 +
* number of unique words (differs from assumed vocabulary size, which is somewhat arbitrary)
 +
* speaking style: digits, command, read, spontaneous...
 +
* number of speakers present in the room
 +
* type of speaker overlap: no overlap, simulated overlap, dialogue, meeting, full overlap...
 +
'''Channel attributes''':
 +
* channel type: none, simulated room impulse response, convolution by a recorded room impulse response, reverberant recording...
 +
* speaker radiation: loudspeaker, dummy head with mouth simulator, human...
 +
* speaker location: at a fixed position in the room, at a quasi-fixed position (e.g., seated), at different positions...
 +
* speaker movements: no movement, head movements, walking...
 +
'''Noise attributes''':
 +
* noise type: stationary background noise (e.g., air-conditioning), car noise, meeting noises, domestic noises, outdoor noises...
 +
'''Available ground truth''':
 +
* reference speech signal: original (at the mouth), headset or lapel (slightly differs from the signal at the mouth), spatial image (at the microphones)...
 +
* speaker location and orientation
 +
* words uttered
 +
* paralinguistic attributes: nodding, gaze, communication intent, emotion... (excluding speaker attributes such as age, gender, or native language)
 +
* noise events: type and time of individual noise events
  
'''1st CHiME Challenge (2011)'''
+
== [[Impulse response datasets]] ==
 +
The table below provides a list of impulse response (IR) datasets with detailed attributes. The meaning of each attribute is detailed [[#ir_attributes|below]].
  
Artificially distorted version of the small vocabulary [http://spandh.dcs.shef.ac.uk/gridcorpus/ GRID audio-visual corpus] (audio only). Binaural reverberated speech with speaker situated in front of the microphones. Additive household noises impinging from different directions. Clean-training, noisy-training, development and evaluation sets available, see
+
Disclaimer: Only datasets that are '''publicly available''' and include some '''reverberation''' (not only HRTFs) are listed.
  
:Jon Barker, E. Vincent, N. Ma, H. Christensen, P. Green, "The PASCAL CHiME speech separation and recognition challenge", Computer Speech & Language, Volume 27, Issue 3, May 2013, Pages 621-633.
 
  
Available from Computer Speech and Language [http://www.sciencedirect.com/science/article/pii/S0885230812000861 here]  
+
{| class="wikitable sortable" style="font-size:72%; border:gray solid 1px; text-align:center; width:auto; table-layout:fixed;"
 
+
|-
Corpus available [http://spandh.dcs.shef.ac.uk/projects/chime/PCC/datasets.html here] (no cost)
+
!style="width: 40px" rowspan="2" class="unsortable"|Datasets
 
+
!colspan="7" |General attributes
''Resources''
+
!colspan="8" |Channel
 
+
!style="width: 40px" rowspan="2" |Room noise
*Training recipe of the challenge for [[Software#Automatic_speech_recognition|HTK]] [http://spandh.dcs.shef.ac.uk/projects/chime/PCC/evaluation.html here].
+
|-
 
+
!scope="col" width="40px" | rel. year
''Baselines''
+
!scope="col" width="40px" | envir.
 
+
!scope="col" width="40px" | total IRs
* See the paper above for results for a wide range of techniques.
+
!scope="col" width="40px" | sam. rate (kHz)
 
+
!scope="col" width="40px" | mics
 
+
!scope="col" width="40px" | cost
 
+
!scope="col" width="40px" class="unsortable" | links
'''AURORA 5 (2007)''' 
+
!scope="col" width="40px" | chan. type
 
+
!scope="col" width="40px" | rooms
Artificially distorted version of the digits [http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S10 TI-DIGITS corpus]. Additive noise and additive noise plus reverberant speech sets. Variable SNR range. Various mixed training sets, no evaluation set, see
+
!scope="col" width="40px" | speak. radiat.
 
+
!scope="col" width="40px" | speak. loc.
:G. Hirsch "Aurora-5 Experimental Framework for the Performance Evaluation of Speech Recognition in Case of a Hands-free Speech Input in Noisy Environments", Niederrhein University of Applied Sciences, 2007.
+
!scope="col" width="40px" | speak. moves
 
+
!scope="col" width="40px" | mic. direc.
Paper available online [http://aurora.hsnr.de/download/aurora5_v21.pdf here] (no cost)
+
!scope="col" width="40px" | mic. loc.
 +
!scope="col" width="40px" | mic. moves
 +
|-
 +
!RWCP Real Environment Acoustic Database
 +
|2001
 +
|varechoic room, office
 +
|{{some|364}}
 +
|{{yes|16 - 48}}
 +
|{{yes|84}}
 +
|{{yes|free}}
 +
|[http://research.nii.ac.jp/src/en/RWCP-SSD.html download]
 +
[http://www.lrec-conf.org/proceedings/lrec2000/html/summary/356.htm paper]
 +
|{{yes|real}}
 +
|{{some|7}}
 +
|{{some|dummy}}
 +
|{{no|9 (far)}}
 +
|{{yes}}
 +
|omni
 +
|{{no|fixed}}
 +
|{{no}}
 +
|{{no}}
 +
|-
 +
!SASSEC, SiSEC under- determined
 +
|2007 - 2011
 +
|office
 +
|{{dunno}}
 +
|{{some|16}}
 +
|{{some|2}}
 +
|{{yes|free}}
 +
|[http://sisec2011.wiki.irisa.fr/tiki-index.php?page=Underdetermined+speech+and+music+mixtures download]
 +
[http://www.sciencedirect.com/science/article/pii/S0165168411003604 paper]
 +
|{{yes|simulated, real}}
 +
|{{some|4}}
 +
|{{no|no, loudspeaker}}
 +
|{{dunno}}
 +
|{{no}}
 +
|omni
 +
|{{no|fixed}}
 +
|{{no}}
 +
|{{no}}
 +
|-
 +
!SiSEC head-geometry
 +
|2008
 +
|office
 +
|{{no|38}}
 +
|{{some|16}}
 +
|{{some|2}}
 +
|{{some|free (partial avail.)}}
 +
|[http://sisec2008.wiki.irisa.fr/tiki-index.php?page=Head-geometry%20mixtures%20of%20two%20speech%20sources%20in%20real%20environments,%20impinging%20from%20many%20directions download]
 +
[http://www.sciencedirect.com/science/article/pii/S0165168411003604 paper]
 +
|{{yes|real}}
 +
|{{no|1}}
 +
|{{no|loudspeaker}}
 +
|{{some|19 (far)}}
 +
|{{no}}
 +
|binaural
 +
|{{no|fixed}}
 +
|{{no}}
 +
|{{no}}
 +
|-
 +
!Aachen Impulse Response
 +
|2009 - 2012
 +
|various
 +
|{{some|214}}
 +
|{{yes|48}}
 +
|{{some|2}}
 +
|{{yes|free}}
 +
|[http://www.ind.rwth-aachen.de/de/forschung/tools-downloads/aachen-impulse-response-database/ download]
 +
[http://ieeexplore.ieee.org/xpl/login.jsp?arnumber=5201259 paper]
 +
|{{yes|real}}
 +
|{{some|8}}
 +
|{{no|loudspeaker}}
 +
|{{some|13 (far)}}
 +
|{{no}}
 +
|omni, binaural, phone
 +
|{{no|fixed}}
 +
|{{no}}
 +
|{{no}}
 +
|-
 +
!CAMIL
 +
|2010 - 2012
 +
|office
 +
|{{yes|32400}}
 +
|{{some|16}}
 +
|{{some|2}}
 +
|{{yes|free}}
 +
|[https://team.inria.fr/perception/the-camil-dataset/ download]
 +
[http://ieeexplore.ieee.org/xpl/login.jsp?arnumber=6637612 paper]
 +
|{{yes|real}}
 +
|{{no|1}}
 +
|{{no|loudspeaker}}
 +
|{{no|fixed}}
 +
|{{no}}
 +
|binaural
 +
|{{yes|16200 (close)}}
 +
|{{yes}}
 +
|{{no}}
 +
|-
 +
!CHiME 2 Grid
 +
|2012
 +
|domestic
 +
|{{some|242}}
 +
|{{yes|16 - 48}}
 +
|{{some|2}}
 +
|{{yes|free}}
 +
|[http://spandh.dcs.shef.ac.uk/chime_challenge/chime2_task1.html download]
 +
[http://ieeexplore.ieee.org/xpl/login.jsp?arnumber=6637622 paper]
 +
|{{yes|real}}
 +
|{{no|1}}
 +
|{{some|dummy}}
 +
|{{yes|121 (close)}}
 +
|{{no}}
 +
|binaural
 +
|{{no|fixed}}
 +
|{{no}}
 +
|{{yes}}
 +
|-
 +
!AVASM
 +
|2013
 +
|office
 +
|{{some|864}}
 +
|{{some|16}}
 +
|{{some|2}}
 +
|{{yes|free}}
 +
|[http://perception.inrialpes.fr/~Deleforge/AVASM_Dataset/ download]
 +
[http://www.eurasip.org/Proceedings/Eusipco/Eusipco2014/HTML/papers/1569923293.pdf paper]
 +
|{{yes|real}}
 +
|{{no|1}}
 +
|{{no|loudspeaker}}
 +
|{{yes|432 (close)}}
 +
|{{no}}
 +
|binaural
 +
|{{no|fixed}}
 +
|{{no}}
 +
|{{no}}
 +
|-
 +
!DIRHA
 +
|2014
 +
|domestic
 +
|{{yes|9200}}
 +
|{{yes|48}}
 +
|{{yes|40}}
 +
|{{some|free (partial avail.)}}
 +
|[http://shine.fbk.eu/resources/dirha-ii-simulated-corpus download]
 +
[http://www.lrec-conf.org/proceedings/lrec2014/summaries/650.html paper]
 +
|{{yes|real}}
 +
|{{some|5}}
 +
|{{no|loudspeaker}}
 +
|{{some|57 (far)}}
 +
|{{no}}
 +
|omni
 +
|{{no|fixed}}
 +
|{{no}}
 +
|{{yes}}
 +
|-
 +
!ACE
 +
|2015
 +
|office, meeting, lecture, lobby
 +
|{{some|700}}
 +
|{{yes|48}}
 +
|{{yes|50}}
 +
|{{yes|free}}
 +
|[http://www.ace-challenge.org download]
 +
[http://www.ace-challenge.org paper]
 +
|{{yes|real}}
 +
|{{some|7}}
 +
|{{no|loudspeaker}}
 +
|{{no|fixed}}
 +
|{{no}}
 +
|omni, laptop, mobile, cruciform, linear, spherical
 +
|{{some|2 (near, far)}}
 +
|{{no}}
 +
|{{yes| ambient, live babble, fan}}
 +
|}
  
Corpus available from LDC [http://catalog.elra.info/product_info.php?cPath=37_40&products_id=1015 here]
+
<span id="ir_attributes"></span>
 +
'''General attributes''':
 +
* year of release
 +
* recording environment: car, domestic, lecture, meeting, office, public space...
 +
* total IRs: total number of single-channel impulse responses
 +
* sampling rate (kHz)
 +
* number of microphones
 +
* cost
 +
* links: download data, reference papers, software baselines, evaluation results...
 +
'''Channel attributes''':
 +
* channel type: simulated or real impulse response
 +
* number of rooms:
 +
* speaker radiation: loudspeaker, mouth simulator
 +
* speaker location: at a fixed position in the room, or number of different positions (closely spaced or far)
 +
* speaker movements: no movement, moves while recording
 +
* microphone directivity: omnidirectional, cardioid, binaural...
 +
* microphone location: at a fixed position in the room, or number of different positions (closely spaced or far)
 +
* microphone movements: no movement, moves while recording
 +
'''Noise attributes''':
 +
* room noise: background noise recorded in the same room as the impulse responses
  
''Resources''
+
== [[Text datasets]] ==
  
*Training recipe for [[Software#Automatic_speech_recognition|HTK]] is provided with the corpora.
 
  
''Baselines''
+
== [[Other datasets]] ==
 +
This section lists all other relevant datasets that have not been annotated or made publicly available yet.
  
*''Reproducible baseline'': The above cited paper includes a baseline for the ETSI Advanced Front-End.
+
Speech datasets:
 
+
* [http://www.iarpa.gov/index.php/research-programs/babel BABEL] (not yet available)
 
+
* [https://catalog.ldc.upenn.edu/search?q%5Bname_cont%5D=HUB4 Broadcast news, HUB4] (no noise and 4.5% speaker overlap, less than ETAPE)
 
+
* [http://www.isca-speech.org/archive/interspeech_2004/i04_2789.html CIAIR In-Car Speech Database] (availability unknown)
'''AURORA 4 (2002)''' 
+
* [http://bme.ccny.cuny.edu/faculty/parra/bss/ Dyrholm/Sawada/Parra] (about 1 min long)
 
+
* [http://www.ee.columbia.edu/~dpwe/pubs/EllisSC14-proximity.pdf NEMISIG] (unavailable)
Artificially distorted version of the 5K word Wall Street Journal corpus ([http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC93S6A WSJ0]). Stationary and non-stationary noises added. Second recordings with distant mismatched microphone. Clean-training, mixed-training, noisy training and test sets available. No evaluation set, see
+
* [http://cs.uef.fi/odyssey2014/program/pdfs/21.pdf NFI-FRITS] (unavailable)
 
+
* [http://www.darpa.mil/Our_Work/I2O/Programs/Robust_Automatic_Transcription_of_Speech_%28RATS%29.aspx RATS] (not yet available)
:G. Hirsch "Experimental Framework for the Performance Evaluation of Speech Recognition Front-ends on a Large Vocabulary Task", ETSI STQ Aurora DSR Working Group, 2002.
+
* Rich Transcription (RT) (dataset gathered from other sets, e.g. CHIL, ICSI, ISL, AMI...)
 
+
* [http://scholar.google.co.uk/citations?view_op=view_citation&hl=en&user=8J_nG0wAAAAJ&citation_for_view=8J_nG0wAAAAJ:08ZZubdj9fEC Settlers of Catan] (unannotated, [http://meetingdiarisation.wordpress.com/2013/05/09/ready-for-recording-settlers-of-cattan-with-the-dmma-2-and-dmma-3/ more info])
Paper available with the corpus.
+
* [http://scholar.google.co.uk/citations?view_op=view_citation&hl=en&user=8J_nG0wAAAAJ&citation_for_view=8J_nG0wAAAAJ:08ZZubdj9fEC Flying MEMS microphone array] (unannotated, [http://meetingdiarisation.wordpress.com/2014/08/11/flying-digital-mems-microphone-array-dmma-3/ more info])
 
 
Corpora available from ELRA [http://catalog.elra.info/product_info.php?cPath=37_40&products_id=694 here] and [http://catalog.elra.info/product_info.php?cPath=37_40&products_id=695 here]
 
 
 
''Resources''
 
 
 
*Training recipe for [[Software#Automatic_speech_recognition|HTK]] available [http://www.keithv.com/software/htk/ here]. Note that this recipe is for Wall-Street Journal (WSJ0), which is the clean speech version of AURORA4. Small changes are needed in the feature extraction scripts to account for different file terminations.
 
 
 
== [[Speaker identification and verification]] ==
 
 
 
== [[Speech enhancement and separation]] ==
 
 
 
== [[Other applications]] ==
 
  
 
== Contribute a dataset ==
 
== Contribute a dataset ==
 
To contribute a new dataset, please
 
To contribute a new dataset, please
* [[Special:UserLogin|create an account]] and login
+
* [[Main_Page#Contribute|create an account]] and login
* go to the wiki page above corresponding to your application; if it does not exist yet, you may create it
+
* go to the section above corresponding to your type of dataset; if the table does not exist yet, you may create it
* click on the "Edit" link at the top of the page and add a new section for your dataset (the datasets are ordered by year of collection)
+
* click on the "Edit" link at the top of the table and add a new line for your dataset (the lines are ordered by year of release)
 +
* fill all columns as much as possible, following the detailed list of attributes below the table
 
* click on the "Save page" link at the bottom of the page to save your modifications
 
* click on the "Save page" link at the bottom of the page to save your modifications
 
Please make sure to provide the following information:
 
* name of the dataset and year of collection
 
* authors, institution, contact information
 
* link to the dataset and to side resources (lexicon, language model, etc)
 
* short description (nature of the data, license, etc) and link to a paper/report describing the dataset, if any
 
* at least 1 research result obtained for this dataset (see below)
 
  
 
We currently cannot provide storage space for large datasets. Please upload the dataset at a stable URL on the website of your institution or elsewhere and provide its URL only. If this is not possible, please contact the [[Main_Page#Working group contacts|resources sharing working group]].
 
We currently cannot provide storage space for large datasets. Please upload the dataset at a stable URL on the website of your institution or elsewhere and provide its URL only. If this is not possible, please contact the [[Main_Page#Working group contacts|resources sharing working group]].
  
== Contribute a research result ==
+
== Contribute a software baseline ==
To contribute a new research result, please
+
To contribute a new software baseline, please
* [[Special:UserLogin|create an account]] and login
+
* [[Main_Page#Contribute|create an account]] and login
* go to the wiki page and the section corresponding to the dataset for which this result was obtained
+
* fill an entry for your software on the [[Software]] page, if not done yet
* click on the "Edit" link on the right of the section header and add a new item for your result
+
* go to the section above corresponding to the dataset for which your baseline was designed
 +
* click on the "Edit" link at the top of the table and add a link to your software in the corresponding "links" cell
 
* click on the "Save page" link at the bottom of the page to save your modifications
 
* click on the "Save page" link at the bottom of the page to save your modifications
  
Please make sure to provide the following information:
+
We currently cannot provide storage space for large software. Please upload your software at a stable URL on the website of your institution or elsewhere and provide its URL only. If this is not possible, please contact the [[Main_Page#Working group contacts|resources sharing working group]].
* authors, paper/report title, means of publication
 
* link to the pdf of the paper
 
* link to derived data (output transcriptions, intermediary data, etc)
 
* Code and instructions to reproduce experiments (if available)
 
  
In order to save storage space, please do not upload the paper on this wiki, but link it as much as possible from your institutional archive, from another public archive (e.g., arxiv) or from the publisher website (e.g., ieexplore).
+
== Contribute an evaluation result ==
 +
To contribute a new research result, please
 +
* [[Main_Page#Contribute|create an account]] and login
 +
* go to the section above corresponding to the dataset for which this result was obtained
 +
* click on the "Edit" link at the top of the table and add a link to your result in the corresponding "links" cell
 +
* make sure that the link (e.g., a paper or another webpage) contains the following information: authors, link to a paper/report containing objective evaluation results, link to derived data (output transcriptions, intermediary data, etc)
 +
* click on the "Save page" link at the bottom of the page to save your modifications
  
We currently cannot provide storage space for large datasets. Please upload the derived data at a stable URL on the website of your institution or elsewhere and provide its URL only. If this is not possible, please contact the [[Main_Page#Working group contacts|resources sharing working group]].
+
In order to save storage space, please do not upload the paper on this wiki, but link it as much as possible from your institutional archive, from another public archive (e.g., arxiv) or from the publisher website (e.g., ieexplore).

Latest revision as of 12:38, 3 November 2015

Speech datasets

The table below aims to provide a list of speech datasets with detailed attributes and links to software baselines and evaluation results. Each dataset may be used for one or more applications: automatic speech recognition, speaker identification and verification, source localization, speech enhancement and separation... The meaning of each attribute is detailed below.

Disclaimer: Only datasets that are publicly available, (at least partially) annotated, suitable for research on robustness, and longer than 5 min are listed. Other relevant datasets are listed below.

If you would like to refer to this table, please cite J. Le Roux and E. Vincent, "A categorization of robust speech processing datasets", Mitsubishi Electric Research Laboratories Technical Report, TR2014-116, Aug. 2014.


Datasets General attributes Speech Channel Noise Ground truth
rel. year use case total time (h) sam. rate (kHz) dist. or noisy mics video cams cost (non- memb) links speak. time (h) uniq. speak. lang. uniq. words (k) speak. style speak. / rec. overl. type chan. type speak. radiat. speak. loc. speak. moves noise type avg. SNR ref. signal speak. loc., orient. words non- verb. traits noise events
ShATR 1994 meeting 0.6 48 3 no free download

paper

0.6 5 UK English 1 spontaneous 5 multiple dialogs reverb human quasi-fixed head meeting high headset yes yes no yes
LLSEC 1996 dialog 1.4 16 4 no free download ? 12 N/S N/S read, spontaneous 2 dialog reverb human quasi-fixed head hallway, restaurant (scenarized) medium no yes no no no
MicArray 1996 office 0.2 16 9 - 16 no free download

paper

0.2 14 US English 0.07 digits, command 1 no reverb human quasi-fixed head stationary background medium headset no yes no no
RWCP Spoken Dialog Corpus 1996 - 1997 dialog 10 16 2 no free download

paper

10 39 Japanese ? spontaneous 1 - 2 dialog reverb (low) human quasi-fixed head stationary background high no no yes no no
SUSAS 1999 stress ? 8 1 no 0.5k$ download

download paper

? 36 US English 0.035 command 1 no reverb human quasi-fixed head stationary background high no no yes yes no
Aurora-2 2000 public spaces 33 8 - 16 1 no free given TIDigits (0.5 k$) purchase (incl. HTK)

paper features

33 214 US English 0.01 digits 1 no simulated phone human N/S no various real environments (rescaled) low original N/S yes no yes
SPINE1, SPINE2 2000 - 2001 military 38 16 2 no 7.4 k$ purchase

paper

? 100 US English 1 command, spontaneous 1 - 2 no simulated radio human quasi-fixed head military (rescaled) low no no yes no no
Aurora-3 (subset of SpeechDat- Car) 2000 - 2003 car ? 16 4 no 1 k€ purchase (incl. HTK)

papers

? 730 various 0.01 digits 1 no reverb human quasi-fixed head car low headset no yes no no
RWCP Meeting Speech Corpus 2001 meeting 3.5 16 - 48 1 3 free download

paper

3.5 ? Japanese ? spontaneous 1 - 5 meeting reverb (low) human quasi-fixed head stationary background high headset no yes no no
RWCP Real Environment Speech Database 2001 domestic, office ? 16 - 48 84 no free download

paper

? 5 US English, Japanese ? read 1 no real rir, reverb loudspeaker various no, pivoting arm various (sum of events) medium original yes yes no yes
SpeechDat- Car 2001 - 2011 car ? 16 4 no 39 - 182 k€ per lang purchase

paper

? 300 per lang various ? digits, command, read, spontaneous 1 no reverb human quasi-fixed head car low headset no yes no no
Aurora-4 2002 public spaces ? 8 - 16 1 no free given WSJ0 (1.5 k$) purchase

paper HTK

? 101 US English 10 read 1 no simulated phone human N/S no various real environments (rescaled) low original N/S yes no yes
TED 2002 seminar 47 16 1 no 0.5 k$ purchase

paper

47 188 non-native English ? lecture 1 or more seminar reverb human quasi-fixed head stationary background high lapel no partial no no
CUAVE 2002 speech overlap 3 44 1 1 free download

paper

3 36 US English 0.01 digits 1 - 2 full reverb human quasi-fixed head stationary background high no no yes no no
CU-Move Microphone Array Data 2002 - 2011 car 286 44 6 - 8 no 25 k$ purchase

paper

286 172 US English 12 digits, command, read, dialog 1 no reverb human quasi-fixed head car low no no yes no no
PDA 2003 office 1.6-3 11 - 16 1 - 4 no free download

paper paper

1.6 - 3 11 - 16 US English 1 - 2 read 1 no reverb human quasi-fixed head stationary background low headset no yes no no
CENSREC-1 (Aurora-2J) 2003 public spaces ? 8 1 no free download

paper

? 214 Japanese 0.01 digits 1 no simulated phone human N/S no various real environments (rescaled) low original N/S yes no yes
AVICAR 2004 car 40 16 7 4 free download

paper

40 87 US English, non-native English 1 read 1 no reverb human quasi-fixed head moving car, windows open or closed low no no yes no no
AV16.3 2004 meeting 1.5 16 16 3 free download

paper

1.5 12 N/S N/S spontaneous 1 - 3 full reverb human various head, walk stationary background high no partial no no no
ICSI Meeting Corpus 2004 meeting 72 16 6 no 2.8 k$ purchase

info paper

72 53 US English, other English 13 meeting 3 - 10 meeting reverb human quasi-fixed head meeting high headset, lapel no yes yes ad-hoc
NIST Meeting Pilot Corpus Speech 2004 meeting 15 16 7 no 5.5 k$ purchase

paper

15 61 US English 6 meeting 3 - 9 meeting reverb human various head, walk stationary background high headset, lapel no yes no no
CHIL Meetings 2004 - 2007 seminar, meeting 60 44 79 - 147 6 - 9 3.5 k€ purchase

paper

? ? non-native English ? seminar, meeting 3 - 20 seminar, meeting reverb human quasi-fixed head meeting (scenarized) high headset yes yes yes no
SPEECON 2004 - 2011 public space, domestic, office, car ? 16 3 no 75 k€ per lang purchase

paper

? 600 per lang various ? command, read, spontaneous 1 no reverb human quasi-fixed head various real environments medium headset no yes no no
CENSREC-2 2005 car ? 16 1 no free download

paper

? 214 Japanese 0.01 digits 1 no reverb human quasi-fixed head car low headset no yes no no
CENSREC-3 2005 car ? 16 1 no 21 k¥ purchase

paper

? 311 Japanese 0.05 read 1 no reverb human quasi-fixed head car low headset no yes no no
Aurora-5 2006 public spaces, domestic, office, car ? 8 1 no free given TIDigits (0.5 k$) purchase (incl. HTK)

paper

? 225 US English 0.01 digits 1 no no, simulated rir, real rir loudspeaker fixed no various real environments (rescaled) low original no yes no yes
AMI 2006 meeting 100 16 16 6 free download

paper

? 189 UK English, other English 8 meeting most often 4 meeting (18% overlap) reverb human quasi-fixed head stationary background high headset, lapel yes yes yes no
PASCAL SSC 2006 speech overlap 8.8 25 1 no free download

paper

8.8 34 UK English 0.05 command 2 full no human N/S no no N/S original N/S yes no no
HIWIRE 2007 airplane 21 16 1 no 0.05 k€ purchase

paper

21 81 non-native English 0.1 command 1 no no human N/S no airplane (rescaled) low original N/S yes no no
NOIZEUS 2007 public spaces 0.6 8 1 no free download

paper

0.6 6 US English 0.1 read 1 no simulated phone human N/S no various real environments (rescaled) low original N/S no no no
UT-Drive 2007 car 40 25 5 2 25 k$ download

paper

40 25 US English 2.4 command, dialog 1 - 2 dialog reverb human quasi-fixed head car low headset (low quality) no partial no no
SASSEC, SiSEC under- determined 2007 - 2011 cocktail party 0.3 16 2 no free download

paper

0.3 16 N/S N/S read 3 - 4 full simulated rir, real rir, reverb no, loudspeaker fixed no no N/S original, spatial image yes no no no
MC-WSJ-AV, PASCAL SSC2, 2012_MMA, REVERB RealData 2007 - 2014 speech overlap 10 16 8 - 40 partial 1.5 k$ purchase

paper paper info video HTK Kaldi results results

? 45 UK English 10 read 1 - 2 full reverb human various head, walk stationary background high headset, lapel yes yes no no
CENSREC-4 (Simulated) 2008 public spaces, domestic, office, car ? 16 1 no free download

paper

? 214 Japanese 0.01 digits 1 no real rir dummy fixed no various real environments (rescaled) low original no yes no yes
CENSREC-4 (Real) 2008 public spaces, domestic, office, car ? 16 1 no free download

paper

? 10 Japanese 0.01 digits 1 no reverb human quasi-fixed head various real environments low headset no yes no yes
DICIT 2008 domestic 6 48 16 2 free download

paper

1 ? Italian ? command 4 no reverb human various head, walk domestic (scenarized) medium headset, tv yes yes no yes
SiSEC head-geometry 2008 speech overlap 1.9 16 2 no free download

paper

1.9 ? N/S N/S read 2 full real rir loudspeaker various no no N/S original, spatial image yes no no no
COSINE 2009 dialog 38 48 20 no free download

paper

11 91 US English, non-native English 5 spontaneous 2 - 7 dialog reverb human various head, walk various real environments low headset, throat mic no yes no no
SiSEC real-world noise 2010 public spaces 0.3 16 2 - 4 no free download

paper

0.3 6 N/S N/S read 1 - 3 full no, reverb (other room) loudspeaker various no various real environments (rescaled) low original, spatial image yes no no no
SiSEC dynamic 2010 - 2011 cocktail party 0.2 16 2 - 4 no free download

paper

0.2 ? N/S N/S read ? full (2 at a time) reverb loudspeaker various simulated no N/S original, spatial image yes no no no
CHiME 1, CHiME 2 Grid 2011 - 2012 domestic 70 16 - 48 2 no free download

paper HTK results results

12 34 UK English 0.05 command 1 no real rir dummy quasi-fixed simulated head domestic (added without rescaling) low yes yes yes no no
CHiME 2 WSJ0 2012 domestic 78 16 2 no free given WSJ0 (1.5 k$) download

paper HTK Kaldi results

33 101 US English 11 read 1 no real rir dummy fixed no domestic (added without rescaling) low yes yes yes no no
ETAPE 2012 TV/radio debates, outdoor interviews 42 16 1 1 ? download

paper

32 347 French 16 spontaneous 1 or more dialog (up to 10% overlap) reverb (some) human quasi-fixed head various real environments high no N/S yes no yes
GALE 2013 TV dialog 120 - 251 per lang 16 1 no 3.5 - 7 k$ per lang purchase 108 - 234 per lang ? Mandarin, Arabic ? spontaneous 1 or more dialog no human quasi-fixed head no N/S no N/S yes no no
REVERB SimData 2013 domestic, office 25 16 8 no free given WSJCAM0 (1.75 k$) purchase

paper HTK Kaldi results results

25 130 UK English 10 read 1 no real rir loudspeaker various no random noise high original, spatial image yes yes no yes
Sheffield Wargames Corpus 2013 cocktail party 7 48 92 3 free download

paper

? 9 UK English ? spontaneous 4 multiple dialogs reverb human various head, walk background music medium headset yes yes no no
DIRHA 2014 domestic 11 48 40 no free (partial avail.) download

paper

4 90 various 3.8 command, read, spontaneous 1 or more simulated real rir loudspeaker various no domestic (added without rescaling) low yes yes yes no yes
CHiME 3 2015 public spaces 48 16 6 no free given WSJ0 (1.5 k$) download

paper

28 113 US English 11 read 1 no simulated, reverb human various head various real environments low headset no yes no no

General attributes:

  • year of release
  • scenario: car, cocktail party, domestic, lecture, meeting, office, public space, TV...
  • total duration (h) (multiple channels counted only once)
  • sampling rate (kHz)
  • number of distant or noisy microphones
  • number of video cameras
  • cost for non-members of ELRA and LDC (cost for members is lower or free)
  • links: download data, reference papers, software baselines, evaluation results...

Speech attributes:

  • duration of speech (h) (overlapping speech counted only once)
  • number of unique speakers
  • language
  • number of unique words (differs from assumed vocabulary size, which is somewhat arbitrary)
  • speaking style: digits, command, read, spontaneous...
  • number of speakers present in the room
  • type of speaker overlap: no overlap, simulated overlap, dialogue, meeting, full overlap...

Channel attributes:

  • channel type: none, simulated room impulse response, convolution by a recorded room impulse response, reverberant recording...
  • speaker radiation: loudspeaker, dummy head with mouth simulator, human...
  • speaker location: at a fixed position in the room, at a quasi-fixed position (e.g., seated), at different positions...
  • speaker movements: no movement, head movements, walking...

Noise attributes:

  • noise type: stationary background noise (e.g., air-conditioning), car noise, meeting noises, domestic noises, outdoor noises...

Available ground truth:

  • reference speech signal: original (at the mouth), headset or lapel (slightly differs from the signal at the mouth), spatial image (at the microphones)...
  • speaker location and orientation
  • words uttered
  • paralinguistic attributes: nodding, gaze, communication intent, emotion... (excluding speaker attributes such as age, gender, or native language)
  • noise events: type and time of individual noise events

Impulse response datasets

The table below provides a list of impulse response (IR) datasets with detailed attributes. The meaning of each attribute is detailed below.

Disclaimer: Only datasets that are publicly available and include some reverberation (not only HRTFs) are listed.


Datasets General attributes Channel Room noise
rel. year envir. total IRs sam. rate (kHz) mics cost links chan. type rooms speak. radiat. speak. loc. speak. moves mic. direc. mic. loc. mic. moves
RWCP Real Environment Acoustic Database 2001 varechoic room, office 364 16 - 48 84 free download

paper

real 7 dummy 9 (far) yes omni fixed no no
SASSEC, SiSEC under- determined 2007 - 2011 office ? 16 2 free download

paper

simulated, real 4 no, loudspeaker ? no omni fixed no no
SiSEC head-geometry 2008 office 38 16 2 free (partial avail.) download

paper

real 1 loudspeaker 19 (far) no binaural fixed no no
Aachen Impulse Response 2009 - 2012 various 214 48 2 free download

paper

real 8 loudspeaker 13 (far) no omni, binaural, phone fixed no no
CAMIL 2010 - 2012 office 32400 16 2 free download

paper

real 1 loudspeaker fixed no binaural 16200 (close) yes no
CHiME 2 Grid 2012 domestic 242 16 - 48 2 free download

paper

real 1 dummy 121 (close) no binaural fixed no yes
AVASM 2013 office 864 16 2 free download

paper

real 1 loudspeaker 432 (close) no binaural fixed no no
DIRHA 2014 domestic 9200 48 40 free (partial avail.) download

paper

real 5 loudspeaker 57 (far) no omni fixed no yes
ACE 2015 office, meeting, lecture, lobby 700 48 50 free download

paper

real 7 loudspeaker fixed no omni, laptop, mobile, cruciform, linear, spherical 2 (near, far) no ambient, live babble, fan

General attributes:

  • year of release
  • recording environment: car, domestic, lecture, meeting, office, public space...
  • total IRs: total number of single-channel impulse responses
  • sampling rate (kHz)
  • number of microphones
  • cost
  • links: download data, reference papers, software baselines, evaluation results...

Channel attributes:

  • channel type: simulated or real impulse response
  • number of rooms:
  • speaker radiation: loudspeaker, mouth simulator
  • speaker location: at a fixed position in the room, or number of different positions (closely spaced or far)
  • speaker movements: no movement, moves while recording
  • microphone directivity: omnidirectional, cardioid, binaural...
  • microphone location: at a fixed position in the room, or number of different positions (closely spaced or far)
  • microphone movements: no movement, moves while recording

Noise attributes:

  • room noise: background noise recorded in the same room as the impulse responses

Text datasets

Other datasets

This section lists all other relevant datasets that have not been annotated or made publicly available yet.

Speech datasets:

Contribute a dataset

To contribute a new dataset, please

  • create an account and login
  • go to the section above corresponding to your type of dataset; if the table does not exist yet, you may create it
  • click on the "Edit" link at the top of the table and add a new line for your dataset (the lines are ordered by year of release)
  • fill all columns as much as possible, following the detailed list of attributes below the table
  • click on the "Save page" link at the bottom of the page to save your modifications

We currently cannot provide storage space for large datasets. Please upload the dataset at a stable URL on the website of your institution or elsewhere and provide its URL only. If this is not possible, please contact the resources sharing working group.

Contribute a software baseline

To contribute a new software baseline, please

  • create an account and login
  • fill an entry for your software on the Software page, if not done yet
  • go to the section above corresponding to the dataset for which your baseline was designed
  • click on the "Edit" link at the top of the table and add a link to your software in the corresponding "links" cell
  • click on the "Save page" link at the bottom of the page to save your modifications

We currently cannot provide storage space for large software. Please upload your software at a stable URL on the website of your institution or elsewhere and provide its URL only. If this is not possible, please contact the resources sharing working group.

Contribute an evaluation result

To contribute a new research result, please

  • create an account and login
  • go to the section above corresponding to the dataset for which this result was obtained
  • click on the "Edit" link at the top of the table and add a link to your result in the corresponding "links" cell
  • make sure that the link (e.g., a paper or another webpage) contains the following information: authors, link to a paper/report containing objective evaluation results, link to derived data (output transcriptions, intermediary data, etc)
  • click on the "Save page" link at the bottom of the page to save your modifications

In order to save storage space, please do not upload the paper on this wiki, but link it as much as possible from your institutional archive, from another public archive (e.g., arxiv) or from the publisher website (e.g., ieexplore).