Difference between revisions of "Datasets"

From rosp
m
m
Line 3: Line 3:
 
Disclaimer: Only publicly available datasets with a total duration longer than 5 min are listed.
 
Disclaimer: Only publicly available datasets with a total duration longer than 5 min are listed.
  
{| class="wikitable sortable"
+
{| class="wikitable sortable" style="font-size:85%; border:gray solid 1px; text-align:center; width:auto; table-layout:fixed;"
!scope="col" width="50px" rowspan="2" |Datasets
+
|-
!colspan="11" |Data
+
!style="width: 50px" rowspan="2" |Datasets
 +
!colspan="10" |Data
 
!colspan="7" |Speech
 
!colspan="7" |Speech
 
!colspan="4" |Channel
 
!colspan="4" |Channel
 
!Noise
 
!Noise
 
!colspan="5" |Ground truth
 
!colspan="5" |Ground truth
|----
+
|-
 
!scope="col" width="50px" | release
 
!scope="col" width="50px" | release
 
!scope="col" width="50px" | scenario
 
!scope="col" width="50px" | scenario
Line 17: Line 18:
 
!scope="col" width="50px" | mixture channels
 
!scope="col" width="50px" | mixture channels
 
!scope="col" width="50px" | cameras
 
!scope="col" width="50px" | cameras
!scope="col" width="50px" | available
 
 
!scope="col" width="50px" | cost
 
!scope="col" width="50px" | cost
!scope="col" width="50px" | URL
+
!scope="col" width="50px" | download
 
!scope="col" width="50px" | email
 
!scope="col" width="50px" | email
 
!scope="col" width="50px" | reference paper
 
!scope="col" width="50px" | reference paper
Line 39: Line 39:
 
!scope="col" width="50px" | nonverbal traits
 
!scope="col" width="50px" | nonverbal traits
 
!scope="col" width="50px" | noise events
 
!scope="col" width="50px" | noise events
|----
+
|-
 
!ShATR
 
!ShATR
 
|1994
 
|1994
Line 47: Line 47:
 
|3 (distant)
 
|3 (distant)
 
|{{no}}
 
|{{no}}
|{{yes}}
 
 
|free
 
|free
 
|http://spandh.dcs.shef.ac.uk/projects/shatrweb/
 
|http://spandh.dcs.shef.ac.uk/projects/shatrweb/
Line 69: Line 68:
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
|----
+
|-
 
!LLSEC
 
!LLSEC
 
|1996
 
|1996
Line 77: Line 76:
 
|4 (distant)
 
|4 (distant)
 
|{{no}}
 
|{{no}}
|{{yes}}
 
 
|free
 
|free
 
|https://www.ll.mit.edu/mission/cybersec/HLT/corpora/SpeechCorpora.html
 
|https://www.ll.mit.edu/mission/cybersec/HLT/corpora/SpeechCorpora.html
Line 99: Line 97:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|----
+
|-
 
!RWCP Spoken Dialog Corpus
 
!RWCP Spoken Dialog Corpus
 
|1996-1997
 
|1996-1997
Line 107: Line 105:
 
|2 (close but cross-talk)
 
|2 (close but cross-talk)
 
|{{no}}
 
|{{no}}
|{{yes}}
 
 
|free
 
|free
 
|http://research.nii.ac.jp/src/en/RWCP-SP96.html
 
|http://research.nii.ac.jp/src/en/RWCP-SP96.html
Line 129: Line 126:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|----
+
|-
 
!Aurora-2
 
!Aurora-2
 
|2000
 
|2000
Line 137: Line 134:
 
|1 (close)
 
|1 (close)
 
|{{no}}
 
|{{no}}
|{{yes}}
 
 
|TIDigits
 
|TIDigits
 
|http://aurora.hsnr.de/download.html
 
|http://aurora.hsnr.de/download.html
Line 159: Line 155:
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
|----
+
|-
 
!SPINE1/SPINE2
 
!SPINE1/SPINE2
 
|2000-2001
 
|2000-2001
Line 167: Line 163:
 
|2 (close)
 
|2 (close)
 
|{{no}}
 
|{{no}}
|{{yes}}
 
 
|2 x ($800 (audio) + $500 (transcripts)) + 3 x ($1000 (audio) + $600 (transcripts))
 
|2 x ($800 (audio) + $500 (transcripts)) + 3 x ($1000 (audio) + $600 (transcripts))
 
|https://catalog.ldc.upenn.edu/LDC2000S87
 
|https://catalog.ldc.upenn.edu/LDC2000S87
Line 189: Line 184:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|----
+
|-
 
!Aurora-3 (subset of SpeechDat-Car)
 
!Aurora-3 (subset of SpeechDat-Car)
 
|2000-2003
 
|2000-2003
Line 197: Line 192:
 
|3 (+1 GSM)  (distant)
 
|3 (+1 GSM)  (distant)
 
|{{no}}
 
|{{no}}
|{{yes}}
 
 
|5 x 200 (Academics) / 5 x 1,000 (Companies)
 
|5 x 200 (Academics) / 5 x 1,000 (Companies)
 
|http://catalog.elra.info/index.php?cPath=37_40
 
|http://catalog.elra.info/index.php?cPath=37_40
Line 219: Line 213:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|----
+
|-
 
!RWCP Meeting Speech Corpus
 
!RWCP Meeting Speech Corpus
 
|2001
 
|2001
Line 227: Line 221:
 
|1 (distant)
 
|1 (distant)
 
|3
 
|3
|{{yes}}
 
 
|free
 
|free
 
|http://research.nii.ac.jp/src/en/RWCP-SP01.html
 
|http://research.nii.ac.jp/src/en/RWCP-SP01.html
Line 249: Line 242:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|----
+
|-
 
!RWCP Real Environment Speech and Acoustic Database
 
!RWCP Real Environment Speech and Acoustic Database
 
|2001
 
|2001
Line 257: Line 250:
 
|30 (distant)
 
|30 (distant)
 
|{{no}}
 
|{{no}}
|{{yes}}
 
 
|free
 
|free
 
|http://research.nii.ac.jp/src/en/RWCP-SSD.html
 
|http://research.nii.ac.jp/src/en/RWCP-SSD.html
Line 279: Line 271:
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
|----
+
|-
 
!SpeechDat-Car
 
!SpeechDat-Car
 
|2001-2011
 
|2001-2011
Line 287: Line 279:
 
|3 (+1 GSM)  (distant)
 
|3 (+1 GSM)  (distant)
 
|{{no}}
 
|{{no}}
|{{yes}}
 
 
|1.1 Million  for all 10 languages. Each costs 39k  to 182k  
 
|1.1 Million  for all 10 languages. Each costs 39k  to 182k  
 
|http://catalog.elra.info/index.php?cPath=37_41
 
|http://catalog.elra.info/index.php?cPath=37_41
Line 309: Line 300:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|----
+
|-
 
!Aurora-4
 
!Aurora-4
 
|2002
 
|2002
Line 317: Line 308:
 
|1 (close)
 
|1 (close)
 
|{{no}}
 
|{{no}}
|{{yes}}
 
 
|WSJ0
 
|WSJ0
 
|http://aurora.hsnr.de/download.html
 
|http://aurora.hsnr.de/download.html
Line 339: Line 329:
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
|----
+
|-
 
!TED
 
!TED
 
|2002
 
|2002
Line 347: Line 337:
 
|1 (distant)
 
|1 (distant)
 
|{{no}}
 
|{{no}}
|{{yes}}
 
 
|$275 (audio) + $250 (transcripts)
 
|$275 (audio) + $250 (transcripts)
 
|https://catalog.ldc.upenn.edu/LDC2002S04
 
|https://catalog.ldc.upenn.edu/LDC2002S04
Line 369: Line 358:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|----
+
|-
 
!CUAVE
 
!CUAVE
 
|2002
 
|2002
Line 377: Line 366:
 
|1 (distant)
 
|1 (distant)
 
|1
 
|1
|{{yes}}
 
 
|free
 
|free
 
|http://www.clemson.edu/ces/speech/cuave.htm
 
|http://www.clemson.edu/ces/speech/cuave.htm
Line 399: Line 387:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|----
+
|-
 
!CU-Move ("Microphone Array Data"; downsampled data with more speakers but less channels exist)
 
!CU-Move ("Microphone Array Data"; downsampled data with more speakers but less channels exist)
 
|2002-2011
 
|2002-2011
Line 407: Line 395:
 
|6 to 8 (distant)
 
|6 to 8 (distant)
 
|{{no}}
 
|{{no}}
|{{yes}}
 
 
|$25k with UT-Drive
 
|$25k with UT-Drive
 
|http://crss.utdallas.edu/
 
|http://crss.utdallas.edu/
Line 429: Line 416:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|----
+
|-
 
!CENSREC-1 (Aurora-2J)
 
!CENSREC-1 (Aurora-2J)
 
|2003
 
|2003
Line 437: Line 424:
 
|1 (close)
 
|1 (close)
 
|{{no}}
 
|{{no}}
|{{yes}}
 
 
|free
 
|free
 
|http://research.nii.ac.jp/src/en/CENSREC-1.html
 
|http://research.nii.ac.jp/src/en/CENSREC-1.html
Line 459: Line 445:
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
|----
+
|-
 
!AVICAR
 
!AVICAR
 
|2004
 
|2004
Line 467: Line 453:
 
|7 (distant)
 
|7 (distant)
 
|4
 
|4
|{{yes}}
 
 
|free
 
|free
 
|http://www.isle.illinois.edu/sst/AVICAR/
 
|http://www.isle.illinois.edu/sst/AVICAR/
Line 489: Line 474:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|----
+
|-
 
!AV16.3
 
!AV16.3
 
|2004
 
|2004
Line 497: Line 482:
 
|16 (distant)
 
|16 (distant)
 
|3
 
|3
|{{yes}}
 
 
|free
 
|free
 
|http://www.idiap.ch/dataset/av16-3/
 
|http://www.idiap.ch/dataset/av16-3/
Line 519: Line 503:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|----
+
|-
 
!ICSI Meeting Corpus
 
!ICSI Meeting Corpus
 
|2004
 
|2004
Line 527: Line 511:
 
|6 (distant)
 
|6 (distant)
 
|{{no}}
 
|{{no}}
|{{yes}}
 
 
|$1900 (audio) + $900 (transcripts)
 
|$1900 (audio) + $900 (transcripts)
 
|https://catalog.ldc.upenn.edu/LDC2004S02
 
|https://catalog.ldc.upenn.edu/LDC2004S02
Line 549: Line 532:
 
|{{yes}}
 
|{{yes}}
 
|{{no}}
 
|{{no}}
|----
+
|-
 
!NIST Meeting Pilot Corpus Speech
 
!NIST Meeting Pilot Corpus Speech
 
|2004
 
|2004
Line 557: Line 540:
 
|7 (distant)
 
|7 (distant)
 
|{{no}} (released but not currently available for download)
 
|{{no}} (released but not currently available for download)
|{{yes}}
 
 
|$4000 (audio) + $1500 (transcripts)
 
|$4000 (audio) + $1500 (transcripts)
 
|https://catalog.ldc.upenn.edu/LDC2004S09
 
|https://catalog.ldc.upenn.edu/LDC2004S09
Line 579: Line 561:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|----
+
|-
 
!CHIL Meetings
 
!CHIL Meetings
 
|2004-2007
 
|2004-2007
Line 587: Line 569:
 
|79 to 147 (distant)
 
|79 to 147 (distant)
 
|6 to 9
 
|6 to 9
|{{yes}}
 
 
|3 500  
 
|3 500  
 
|http://catalog.elra.info/search.php
 
|http://catalog.elra.info/search.php
Line 609: Line 590:
 
|{{yes}}
 
|{{yes}}
 
|{{no}}
 
|{{no}}
|----
+
|-
 
!SPEECON
 
!SPEECON
 
|2004-2011
 
|2004-2011
Line 617: Line 598:
 
|3 (distant)
 
|3 (distant)
 
|{{no}}
 
|{{no}}
|{{yes}}
 
 
|29 x 75000  for all languages
 
|29 x 75000  for all languages
 
|http://catalog.elra.info/index.php?cPath=37
 
|http://catalog.elra.info/index.php?cPath=37
Line 639: Line 619:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|----
+
|-
 
!CENSREC-2
 
!CENSREC-2
 
|2005
 
|2005
Line 647: Line 627:
 
|1 (distant)
 
|1 (distant)
 
|{{no}}
 
|{{no}}
|{{yes}}
 
 
|free
 
|free
 
|http://research.nii.ac.jp/src/en/CENSREC-2.html
 
|http://research.nii.ac.jp/src/en/CENSREC-2.html
Line 669: Line 648:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|----
+
|-
 
!CENSREC-3
 
!CENSREC-3
 
|2005
 
|2005
Line 677: Line 656:
 
|1 (distant)
 
|1 (distant)
 
|{{no}}
 
|{{no}}
|{{yes}}
 
 
|free except phonetically balanced training set: JPY 21000 (Universities) / JPY 105000 (Companies)
 
|free except phonetically balanced training set: JPY 21000 (Universities) / JPY 105000 (Companies)
 
|http://research.nii.ac.jp/src/en/CENSREC-3.html
 
|http://research.nii.ac.jp/src/en/CENSREC-3.html
Line 699: Line 677:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|----
+
|-
 
!Aurora-5
 
!Aurora-5
 
|2006
 
|2006
Line 707: Line 685:
 
|1 (distant)
 
|1 (distant)
 
|{{no}}
 
|{{no}}
|{{yes}}
 
 
|TIDigits
 
|TIDigits
 
|http://aurora.hsnr.de/download.html
 
|http://aurora.hsnr.de/download.html
Line 729: Line 706:
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
|----
+
|-
 
!AMI
 
!AMI
 
|2006
 
|2006
Line 737: Line 714:
 
|16 (distant)
 
|16 (distant)
 
|6
 
|6
|{{yes}}
 
 
|free
 
|free
 
|http://groups.inf.ed.ac.uk/ami/
 
|http://groups.inf.ed.ac.uk/ami/
Line 759: Line 735:
 
|{{yes}}
 
|{{yes}}
 
|{{no}}
 
|{{no}}
|----
+
|-
 
!PASCAL SSC
 
!PASCAL SSC
 
|2006
 
|2006
Line 767: Line 743:
 
|1 (mixing console)
 
|1 (mixing console)
 
|{{no}}
 
|{{no}}
|{{yes}} (website to be restored)
 
 
|free
 
|free
 
|
 
|
Line 789: Line 764:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|----
+
|-
 
!HIWIRE
 
!HIWIRE
 
|2007
 
|2007
Line 797: Line 772:
 
|1 (close)
 
|1 (close)
 
|{{no}}
 
|{{no}}
|{{yes}}
 
 
|50  
 
|50  
 
|http://catalog.elra.info/product_info.php?products_id=1088&language=en
 
|http://catalog.elra.info/product_info.php?products_id=1088&language=en
Line 819: Line 793:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|----
+
|-
 
!UT-Drive
 
!UT-Drive
 
|2007
 
|2007
Line 827: Line 801:
 
|5 (distant)
 
|5 (distant)
 
|2
 
|2
|{{yes}}
 
 
|$25k with CU-Move
 
|$25k with CU-Move
 
|http://crss.utdallas.edu/
 
|http://crss.utdallas.edu/
Line 849: Line 822:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|----
+
|-
 
!SASSEC/SiSEC underdetermined
 
!SASSEC/SiSEC underdetermined
 
|2007-2011
 
|2007-2011
Line 857: Line 830:
 
|2 (distant)
 
|2 (distant)
 
|{{no}}
 
|{{no}}
|{{yes}}
 
 
|free
 
|free
 
|http://sisec2011.wiki.irisa.fr/tiki-index.php?page=Underdetermined+speech+and+music+mixtures
 
|http://sisec2011.wiki.irisa.fr/tiki-index.php?page=Underdetermined+speech+and+music+mixtures
Line 879: Line 851:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|----
+
|-
 
!MC-WSJ-AV/PASCAL SSC2/2012_MMA/REVERB RealData
 
!MC-WSJ-AV/PASCAL SSC2/2012_MMA/REVERB RealData
 
|2007-2014
 
|2007-2014
Line 887: Line 859:
 
|8 to 40 (distant)
 
|8 to 40 (distant)
 
|{{no}}
 
|{{no}}
|{{yes}}
 
 
|$1 500
 
|$1 500
 
|https://catalog.ldc.upenn.edu/LDC2014S03
 
|https://catalog.ldc.upenn.edu/LDC2014S03
Line 909: Line 880:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|----
+
|-
 
!CENSREC-4 (Simulated)
 
!CENSREC-4 (Simulated)
 
|2008
 
|2008
Line 917: Line 888:
 
|1 (distant)
 
|1 (distant)
 
|{{no}}
 
|{{no}}
|{{yes}}
 
 
|free
 
|free
 
|http://research.nii.ac.jp/src/en/CENSREC-4.html
 
|http://research.nii.ac.jp/src/en/CENSREC-4.html
Line 939: Line 909:
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
|----
+
|-
 
!CENSREC-4 (Real)
 
!CENSREC-4 (Real)
 
|2008
 
|2008
Line 947: Line 917:
 
|1 (distant)
 
|1 (distant)
 
|{{no}}
 
|{{no}}
|{{yes}}
 
 
|free
 
|free
 
|http://research.nii.ac.jp/src/en/CENSREC-4.html
 
|http://research.nii.ac.jp/src/en/CENSREC-4.html
Line 969: Line 938:
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
|----
+
|-
 
!DICIT
 
!DICIT
 
|2008
 
|2008
Line 977: Line 946:
 
|16 (distant)
 
|16 (distant)
 
|2
 
|2
|{{yes}}
 
 
|free
 
|free
 
|http://shine.fbk.eu/resources/dicit-acoustic-woz-data
 
|http://shine.fbk.eu/resources/dicit-acoustic-woz-data
Line 999: Line 967:
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
|----
+
|-
 
!SiSEC head-geometry
 
!SiSEC head-geometry
 
|2008
 
|2008
Line 1,007: Line 975:
 
|2 (distant)
 
|2 (distant)
 
|{{no}}
 
|{{no}}
|{{yes}}
 
 
|free
 
|free
 
|http://sisec2008.wiki.irisa.fr/tiki-index.php?page=Head-geometry%20mixtures%20of%20two%20speech%20sources%20in%20real%20environments,%20impinging%20from%20many%20directions
 
|http://sisec2008.wiki.irisa.fr/tiki-index.php?page=Head-geometry%20mixtures%20of%20two%20speech%20sources%20in%20real%20environments,%20impinging%20from%20many%20directions
Line 1,029: Line 996:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|----
+
|-
 
!COSINE
 
!COSINE
 
|2009
 
|2009
Line 1,037: Line 1,004:
 
|20 (distant)
 
|20 (distant)
 
|{{no}}
 
|{{no}}
|{{yes}}
 
 
|free
 
|free
 
|http://melodi.ee.washington.edu/cosine/
 
|http://melodi.ee.washington.edu/cosine/
Line 1,059: Line 1,025:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|----
+
|-
 
!SiSEC real-world noise
 
!SiSEC real-world noise
 
|2010
 
|2010
Line 1,067: Line 1,033:
 
|2 to 4 (distant)
 
|2 to 4 (distant)
 
|{{no}}
 
|{{no}}
|{{yes}}
 
 
|free
 
|free
 
|http://sisec2010.wiki.irisa.fr/tiki-index.php?page=Source+separation+in+the+presence+of+real-world+background+noise
 
|http://sisec2010.wiki.irisa.fr/tiki-index.php?page=Source+separation+in+the+presence+of+real-world+background+noise
Line 1,089: Line 1,054:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|----
+
|-
 
!SiSEC dynamic
 
!SiSEC dynamic
 
|2010-2011
 
|2010-2011
Line 1,097: Line 1,062:
 
|2 to 4 (distant)
 
|2 to 4 (distant)
 
|{{no}}
 
|{{no}}
|{{yes}}
 
 
|free
 
|free
 
|http://sisec2010.wiki.irisa.fr/tiki-index.php?page=Determined+convolutive+mixtures+under+dynamic+conditions
 
|http://sisec2010.wiki.irisa.fr/tiki-index.php?page=Determined+convolutive+mixtures+under+dynamic+conditions
Line 1,119: Line 1,083:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|----
+
|-
 
!CHiME 1/CHiME 2 Grid
 
!CHiME 1/CHiME 2 Grid
 
|2011-2012
 
|2011-2012
Line 1,127: Line 1,091:
 
|2 (distant)
 
|2 (distant)
 
|{{no}}
 
|{{no}}
|{{yes}}
 
 
|free
 
|free
 
|http://spandh.dcs.shef.ac.uk/chime_challenge/chime2_task1.html
 
|http://spandh.dcs.shef.ac.uk/chime_challenge/chime2_task1.html
Line 1,149: Line 1,112:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|----
+
|-
 
!CHiME 2 WSJ0
 
!CHiME 2 WSJ0
 
|2012
 
|2012
Line 1,157: Line 1,120:
 
|2 (distant)
 
|2 (distant)
 
|{{no}}
 
|{{no}}
|{{yes}}
 
 
|WSJ0
 
|WSJ0
 
|http://spandh.dcs.shef.ac.uk/chime_challenge/chime2_task2.html
 
|http://spandh.dcs.shef.ac.uk/chime_challenge/chime2_task2.html
Line 1,179: Line 1,141:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|----
+
|-
 
!ETAPE
 
!ETAPE
 
|2012
 
|2012
Line 1,187: Line 1,149:
 
|1 (mixing console)
 
|1 (mixing console)
 
|1
 
|1
|{{yes}}
 
 
|{{dunno}}
 
|{{dunno}}
 
|{{dunno}}
 
|{{dunno}}
Line 1,209: Line 1,170:
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
|----
+
|-
 
!GALE (Chinese broadcast conversation)
 
!GALE (Chinese broadcast conversation)
 
|2013
 
|2013
Line 1,217: Line 1,178:
 
|1 (mixing console)
 
|1 (mixing console)
 
|{{no}}
 
|{{no}}
|{{yes}}
 
 
|$2000 (audio) + $1500 (transcripts)
 
|$2000 (audio) + $1500 (transcripts)
 
|https://catalog.ldc.upenn.edu/LDC2013S04
 
|https://catalog.ldc.upenn.edu/LDC2013S04
Line 1,239: Line 1,199:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|----
+
|-
 
!GALE (Arabic broadcast conversation)
 
!GALE (Arabic broadcast conversation)
 
|2013
 
|2013
Line 1,247: Line 1,207:
 
|1 (mixing console)
 
|1 (mixing console)
 
|{{no}}
 
|{{no}}
|{{yes}}
 
 
|2 x [$2000 (audio) + $1500 (transcripts)]
 
|2 x [$2000 (audio) + $1500 (transcripts)]
 
|https://catalog.ldc.upenn.edu/LDC2013S02
 
|https://catalog.ldc.upenn.edu/LDC2013S02
Line 1,269: Line 1,228:
 
|{{no}}
 
|{{no}}
 
|{{no}}
 
|{{no}}
|----
+
|-
 
!REVERB SimData
 
!REVERB SimData
 
|2013
 
|2013
Line 1,277: Line 1,236:
 
|8 (distant)
 
|8 (distant)
 
|{{no}}
 
|{{no}}
|{{yes}}
 
 
|WSJCAM0
 
|WSJCAM0
 
|http://reverb2014.dereverberation.com/
 
|http://reverb2014.dereverberation.com/
Line 1,299: Line 1,257:
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
|----
+
|-
 
!DIRHA
 
!DIRHA
 
|2014
 
|2014
Line 1,307: Line 1,265:
 
|40 (distant)
 
|40 (distant)
 
|{{no}}
 
|{{no}}
|{{yes}}
 
 
|free
 
|free
 
|http://shine.fbk.eu/resources/dirha-ii-simulated-corpus
 
|http://shine.fbk.eu/resources/dirha-ii-simulated-corpus
Line 1,329: Line 1,286:
 
|{{no}}
 
|{{no}}
 
|{{yes}}
 
|{{yes}}
|----
+
|---[[User:Evincent|Evincent]] ([[User talk:Evincent|talk]]) 21:23, 6 August 2014 (CEST)
 
|}
 
|}
  

Revision as of 20:23, 6 August 2014

This page aims to provide a list of datasets with detailed attributes and links to corresponding research results (papers, numerical results, output transcriptions, intermediary data, etc). Each dataset may be used for one or more applications: automatic speech recognition, speaker identification and verification, source localization, speech enhancement and separation...

Disclaimer: Only publicly available datasets with a total duration longer than 5 min are listed.

Datasets Data Speech Channel Noise Ground truth
release scenario total duration sampling rate mixture channels cameras cost download email reference paper speech duration unique speakers language unique words speaking style simultaneous speakers speaker overlap channel type radiation speaker location speaker movements noise type speech signal speaker location and orientation words nonverbal traits noise events
ShATR 1994 meeting 37 min 48000 3 (distant) no free http://spandh.dcs.shef.ac.uk/projects/shatrweb/ g.brown@dcs.shef.ac.uk Malcolm Crawford, Guy J. Brown, Martin Cooke and Phil Green, "Design, collection and analysis of a multi-simultaneous-speaker corpus," Proceedings of The Institute of Acoustics, 16(5):183-190. 37 min 5 UK English 1k colloquial 5 multiple conversations reverb human quasi-fixed head meeting headset yes yes no yes
LLSEC 1996 conversation 1.4 h 16000 4 (distant) no free https://www.ll.mit.edu/mission/cybersec/HLT/corpora/SpeechCorpora.html jpc@ll.mit.edu ? ? 12 N/S N/S read/colloquial 2 conversation reverb human quasi-fixed head hallway, restaurant no yes no no no
RWCP Spoken Dialog Corpus 1996-1997 conversation 10 h 16000 2 (close but cross-talk) no free http://research.nii.ac.jp/src/en/RWCP-SP96.html src@nii.ac.jp Kazuyo Tanaka, Satoru Hayamizu, Yoichi Yamashita, Kiyohiro Shikano, Shuichi Itahashi and Ryuichi Oka, "Design and data collection for a spoken dialog database in the Real World Computing (RWC) program," J. Acoust. Soc. Am. 100, 2759 (1996) 10 h 39 Japanese ? colloquial 1 or 2 conversation reverb human quasi-fixed head stationary background noise no no yes no no
Aurora-2 2000 public spaces 33 h 8000-16000 1 (close) no TIDigits http://aurora.hsnr.de/download.html hans-guenter.hirsch@hs-niederrhein.de Hans-Gnter Hirsch, David Pearce, "The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions,", Proc. Interspeech 2000 33 h 214 US English 11 digits 1 no no (simulated telephone channel) human N/S no various real environments original N/S yes no yes
SPINE1/SPINE2 2000-2001 military 38 h 16000 2 (close) no 2 x ($800 (audio) + $500 (transcripts)) + 3 x ($1000 (audio) + $600 (transcripts)) https://catalog.ldc.upenn.edu/LDC2000S87 jdwright@ldc.upenn.edu T.H. Crystal et al., "Speech in noisy environments (SPINE) adds new dimension to speech recognition R&D", Proc. HLT 2002 ? 100 US English 1k command/colloquial 1 or 2 no no (simulated transmission channels) human quasi-fixed head military (pre-recorded noise played in sound booth while recording speech) no no yes no no
Aurora-3 (subset of SpeechDat-Car) 2000-2003 car ? 16000 3 (+1 GSM) (distant) no 5 x 200 (Academics) / 5 x 1,000 (Companies) http://catalog.elra.info/index.php?cPath=37_40 ? ? Finnish, German, Spanish, Danish, Italian ? command (read/digits/keywords/spontaneous) 1 no reverb human quasi-fixed head car close-talk no yes no no
RWCP Meeting Speech Corpus 2001 meeting 3.5 h 16000-48000 1 (distant) 3 free http://research.nii.ac.jp/src/en/RWCP-SP01.html src@nii.ac.jp Kazuyo Tanaka, Katunobu Itou, Masanori Ihara, Ryuichi Oka, "Constructing a Meeting Speech Corpus", IPSJ, 37-15, 2001 3.5 h ? Japanese ? colloquial 1 to 5 meeting low reverb human quasi-fixed head stationary background noise headset no yes no no
RWCP Real Environment Speech and Acoustic Database 2001 domestic/office ? 16000-48000 30 (distant) no free http://research.nii.ac.jp/src/en/RWCP-SSD.html s-nakamura@is.naist.jp Satoshi Nakamura, Kazuo Hiyane, Futoshi Asano, Takanobu Nishiura, and Takeshi Yamada, "Acoustical Sound Database in Real Environments for Sound Scene Understanding and Hands-Free Speech Recognition," LREC 2000. ? 5 Japanese ? read 1 no real rir/reverb loudspeaker various no/pivoting arm stationary background noise original yes yes no yes
SpeechDat-Car 2001-2011 car ? 16000 3 (+1 GSM) (distant) no 1.1 Million for all 10 languages. Each costs 39k to 182k http://catalog.elra.info/index.php?cPath=37_41 A. Moreno et al., "SPEECHDAT-CAR. A Large Speech Database for Automotive Environments," Proc. LREC 2000 ? 300/language Multiple ? command (read/digits/keywords/spontaneous) 1 no reverb human quasi-fixed head car close-talk no yes no no
Aurora-4 2002 public spaces ? 8000-16000 1 (close) no WSJ0 http://aurora.hsnr.de/download.html hans-guenter.hirsch@hs-niederrhein.de N. Parihar and J. Picone, "Aurora Working Group: DSR Front End LVCSR Evaluation AU/384/02," Tech. Rep., Inst. for Signal and Information Process, Mississippi State University, 2002 ? 101 US English 10k read 1 no no (simulated telephone channel) human N/S no various real environments original N/S yes no yes
TED 2002 seminar 47 h 16000 1 (distant) no $275 (audio) + $250 (transcripts) https://catalog.ldc.upenn.edu/LDC2002S04 L. Lamel, F. Schiel, A. Fourcin, J. Mariani, and H. Tillman, "The translingual English database (TED)," Proc. ICSLP, 1994 47 h 188 English (mostly non-native) ? lecture 1 or more seminar reverb human quasi-fixed head stationary background noise lapel no yes (partial) no no
CUAVE 2002 cocktail party 3 h 44100 1 (distant) 1 free http://www.clemson.edu/ces/speech/cuave.htm ksampat@clemson.edu Eric K Patterson, Sabri Gurbuz, Zekeriya Tufekci and John N Gowdy, "Moving-Talker, Speaker-Independent Feature Study, and Baseline Results Using the CUAVE Multimodal Speech Corpus," EURASIP Journal on Advances in Signal Processing 2002, 2002:208541 3 h 36 US English 10 digits 1 or 2 full reverb human quasi-fixed head stationary background noise no no yes no no
CU-Move ("Microphone Array Data"; downsampled data with more speakers but less channels exist) 2002-2011 car 286 h 44100 6 to 8 (distant) no $25k with UT-Drive http://crss.utdallas.edu/ john.hansen@utdallas.edu John H.L. Hansen, Pongtep Angkititrakul, Jay Plucienkowski, Stephen Gallant, Umit Yapanel, Bryan Pellom, Wayne Ward, and Ron Cole, ""CU-Move" : Analysis & Corpus Development for Interactive In-Vehicle Speech Systems", Interspeech 2001 286 h 172 US English 12k command/digits/read/dialogue 1 no reverb human quasi-fixed head car no no yes no no
CENSREC-1 (Aurora-2J) 2003 public spaces ? 8000 1 (close) no free http://research.nii.ac.jp/src/en/CENSREC-1.html S. Nakamura, K. Takeda, K. Yamamoto, T. Yamada, S. Kuroiwa, N. Kitaoka, T. Nishiura, A. Sasou, M. Mizumachi, C. Miyajima, M. Fujimoto, and T. Endo, "Aurora-2J, an evaluation framework for Japanese noisy speech recognition," IEICE Transactions on Information and Systems, vol. E88-D, no. 3:pp. 535544, 2005 214 Japanese 11 digits 1 no various microphones and simulated channels human N/S no various real environments original N/S yes no yes
AVICAR 2004 car 29 h 16000 7 (distant) 4 free http://www.isle.illinois.edu/sst/AVICAR/ jhasegaw@illinois.edu Bowon Lee, Mark Hasegawa-Johnson, Camille Goudeseune, Suketu Kamdar, Sarah Borys, Ming Liu, Thomas Huang, "AVICAR: Audio-Visual Speech Corpus in a Car Environment", Proc. Interspeech, 2004 29 h 86 US/non-native English 1k read 1 no reverb human quasi-fixed head car no no yes no no
AV16.3 2004 meeting 1.5 h 16000 16 (distant) 3 free http://www.idiap.ch/dataset/av16-3/ odobez@idiap.ch "AV16.3: an Audio-Visual Corpus for Speaker Localization and Tracking", by Guillaume Lathoud, Jean-Marc Odobez and Daniel Gatica-Perez, in Proceedings of the MLMI'04 Workshop, 2004. 1.5 h 12 N/S N/S colloquial 1 to 3 full reverb human various walk stationary background noise no yes no no no
ICSI Meeting Corpus 2004 meeting 72 h 16000 6 (distant) no $1900 (audio) + $900 (transcripts) https://catalog.ldc.upenn.edu/LDC2004S02 mrcontact@icsi.berkeley.edu A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, C. Wooters, "The ICSI meeting corpus," Proc. ICASSP, Apr. 2003 72 h 53 US English 13k meeting 3 to 10 meeting reverb human quasi-fixed head stationary background noise headset (some lapel) no yes yes no
NIST Meeting Pilot Corpus Speech 2004 meeting 15 h 16000 7 (distant) no (released but not currently available for download) $4000 (audio) + $1500 (transcripts) https://catalog.ldc.upenn.edu/LDC2004S09 john.garofolo@nist.gov John S. Garofolo, Christophe D. Laprun, Martial Michel, Vincent M. Stanford and Elham Tabassi, "The NIST Meeting Room Pilot Corpus," Proc. LREC, 2004 15 h 61 US English 6k meeting 3 to 9 meeting reverb human various walk stationary background noise headset+lapel no yes no no
CHIL Meetings 2004-2007 seminar/meeting 60 h 44100 79 to 147 (distant) 6 to 9 3 500 http://catalog.elra.info/search.php choukri@elda.org D. Mostefa, N. Moreau, K. Choukri, G. Potamianos, S. Chu, A. Tyagi, J. Casas, J. Turmo, L. Cristoforetti, F. Tobia, A. Pnevmatikakis, V. Mylonakis, F. Talantzis, S. Burger, R. Stiefelhagen, K. Bernardin, C. Rochet, The CHIL audiovisual corpus for lecture and meeting analysis inside smart rooms, in LANGUAGE RESOURCES AND EVALUATION, vol. 41, n. 3-4, 2007, pp. 389-407 ? ? non-native English ? lecture/meeting 3 to 20 seminar/meeting reverb human quasi-fixed head meeting (scenarized) headset yes yes yes no
SPEECON 2004-2011 public space/domestic/office/car ? 16000 3 (distant) no 29 x 75000 for all languages http://catalog.elra.info/index.php?cPath=37 diskra@appen.com Dorota Iskra, Beate Grosskopf, Krzysztof Marasek, Henk van den Heuvel, Frank Diehl, Andreas Kiessling, "SPEECON Speech Databases for Consumer Devices: Database Specification and Validation", LREC p. 329-333, 2002. ? 600/language Multiple ? command/read/spontaneous 1 no reverb human quasi-fixed head various real environments headset no yes no no
CENSREC-2 2005 car ? 16000 1 (distant) no free http://research.nii.ac.jp/src/en/CENSREC-2.html src@nii.ac.jp S. Nakamura, M. Fujimoto, and K. Takeda, "CENSREC2: Corpus and evaluation environments for in car continuous digit speech recognition," Proc. ICSLP 2006 ? 214 Japanese 11 digits 1 no reverb human quasi-fixed head car headset no yes no no
CENSREC-3 2005 car ? 16000 1 (distant) no free except phonetically balanced training set: JPY 21000 (Universities) / JPY 105000 (Companies) http://research.nii.ac.jp/src/en/CENSREC-3.html src@nii.ac.jp M. Fujimoto, K. Takeda, and S. Nakamura, "CENSREC-3: An evaluation framework for Japanese speech recognition in real driving-car environments," IEICE Transactions on Information and Systems, vol. E89-D, no. 11:pp. 27832793, 2006 ? 18 (+293 in training) Japanese 50 in evaluation; unknown but larger in phonetically-balanced utterances of training set read 1 no reverb human quasi-fixed head car headset no yes no no
Aurora-5 2006 public spaces/domestic/office/car ? 8000 1 (distant) no TIDigits http://aurora.hsnr.de/download.html hans-guenter.hirsch@hs-niederrhein.de Hans-Gnter Hirsch, "Aurora-5 experimental framework for the performance evaluation of speech recognition in case of a hands-free speech input in noisy environments,", Tech Report, Niederrhein Univ. of Applied Sciences, 2007 ? 225 US English 11 digits 1 no real rir/simu/no + simulated telephone channel loudspeaker N/S no various real environments original no yes no yes
AMI 2006 meeting 100 h 16000 16 (distant) 6 free http://groups.inf.ed.ac.uk/ami/ amicorpus@amiproject.org Steve Renals, Thomas Hain, and Herv Bourlard. Interpretation of multiparty meetings: The AMI and AMIDA projects. In IEEE Workshop on Hands-Free Speech Communication and Microphone Arrays, 2008. HSCMA 2008, pages 115-118, 2008 ? 189 UK English 8k meeting 4 (18% overlap) meeting reverb human quasi-fixed head stationary background noise headset+lapel yes yes yes no
PASCAL SSC 2006 cocktail party 18.5 min (+ 8.5h clean training data) 25000 1 (mixing console) no free m.cooke@ikerbasque.org Martin Cooke, John R. Hershey, Steven J. Rennie, "Monaural speech separation and recognition challenge," Computer, Speech and Language, 2010 18.5 min (+ 8.5h clean training data) 34 UK English 51 command 2 full no human N/S no no original N/S yes no no
HIWIRE 2007 airplane 21 h 16000 1 (close) no 50 http://catalog.elra.info/product_info.php?products_id=1088&language=en segura@ugr.es J.C. Segura, T. Ehrette, A. Potamianos, D. Fohr, I. Illina, P.-A. Breton, V. Clot, R. Gemello, M. Matassoni, P. Maragos, "The HIWIRE database, a noisy and non-native English speech corpus for cockpit communication" 21 h 81 non-native English 133 command 1 no no human N/S head airplane original N/S yes no no
UT-Drive 2007 car 40 h 25000 5 (distant) 2 $25k with CU-Move http://crss.utdallas.edu/ john.hansen@utdallas.edu P. Angkititrakul, M. Petracca, A. Sathyanarayana, J.H.L. Hansen, "UTDrive: Driver Behavior and Speech Interactive Systems for In-Vehicle Environments," Intelligent Vehicles Symposium, 2007 40 h 25 (more exist but not included in latest release 3.0) US English 2.4k (but transcription is incomplete) command/conversation 1 to 2 conversation reverb human quasi-fixed head car headset (but problem w/ recording quality) no yes (partial) no no
SASSEC/SiSEC underdetermined 2007-2011 cocktail party 19 min 16000 2 (distant) no free http://sisec2011.wiki.irisa.fr/tiki-index.php?page=Underdetermined+speech+and+music+mixtures araki.shoko@lab.ntt.co.jp The Signal Separation Evaluation Campaign (2007-2010): Achievements and Remaining Challenges, Emmanuel Vincent; Shoko Araki; Fabian J. Theis; Guido Nolte; Pau Bofill; Hiroshi Sawada; Alexey Ozerov; B. Vikrham Gowreesunker; Dominik Lutter; Ngoc Duong, Signal Processing, Elsevier, 2012, 92, pp. 1928-1936 19 min 16 N/S N/S read 3 or 4 full reverb/real rir/simu no fixed no no original+spatial image yes no no no
MC-WSJ-AV/PASCAL SSC2/2012_MMA/REVERB RealData 2007-2014 cocktail party 10 h 16000 8 to 40 (distant) no $1 500 https://catalog.ldc.upenn.edu/LDC2014S03 mike.lincoln@quoratetechnology.com M. Lincoln, I. McCowan, J. Vepa, and H. K. Maganti, The multi-channel wall street journal audio visual corpus (MC-WSJ-AV): Specification and initial experiments, in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2005. + E. Zwyssig, F. Faubel, S. Renals and M. Lincoln, "Recognition of overlapping speech using digital MEMS microphone arrays", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013 ? 45 UK English 10k read 1 or 2 full reverb human various walk stationary background noise headset+lapel yes yes no no
CENSREC-4 (Simulated) 2008 public spaces/domestic/office/car ? 16000 1 (distant) no free http://research.nii.ac.jp/src/en/CENSREC-4.html src@nii.ac.jp T. Nishiura et al., "Evaluation Framework for Distant-talking Speech Recognition under Reverberant Environments Newest Part of the CENSREC Series", Proc. LREC 2008 ? 214 Japanese 11 digits 1 no real rir mouth simulator fixed no various real environments original no yes no yes
CENSREC-4 (Real) 2008 public spaces/domestic/office/car ? 16000 1 (distant) no free http://research.nii.ac.jp/src/en/CENSREC-4.html src@nii.ac.jp T. Nishiura et al., "Evaluation Framework for Distant-talking Speech Recognition under Reverberant Environments Newest Part of the CENSREC Series", Proc. LREC 2008 ? 10 Japanese 11 digits 1 no reverb human quasi-fixed head various real environments headset no yes no yes
DICIT 2008 domestic 6 h 48000 16 (distant) 2 free http://shine.fbk.eu/resources/dicit-acoustic-woz-data omologo@fbk.eu Alessio Brutti, Luca Cristoforetti, Walter Kellermann, Lutz Marquardt and Maurizio Omologo, WOZ Acoustic Data Collection for Interactive TV, Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), 2008. 1 h ? Italian ? command 4 no reverb human various walk domestic (scenarized) headset+tv yes yes no yes
SiSEC head-geometry 2008 cocktail party 1.9 h 16000 2 (distant) no free http://sisec2008.wiki.irisa.fr/tiki-index.php?page=Head-geometry%20mixtures%20of%20two%20speech%20sources%20in%20real%20environments,%20impinging%20from%20many%20directions hendrik.kayser@uni-oldenburg.de The Signal Separation Evaluation Campaign (2007-2010): Achievements and Remaining Challenges, Emmanuel Vincent; Shoko Araki; Fabian J. Theis; Guido Nolte; Pau Bofill; Hiroshi Sawada; Alexey Ozerov; B. Vikrham Gowreesunker; Dominik Lutter; Ngoc Duong, Signal Processing, Elsevier, 2012, 92, pp. 1928-1936 1.9 h ? N/S N/S read 2 full real rir loudspeaker various no no original+spatial image yes no no no
COSINE 2009 conversation 38 h 48000 20 (distant) no free http://melodi.ee.washington.edu/cosine/ cosine@melodi.ee.washington.edu Alex Stupakov, Evan Hanusa, Deepak Vijaywargi, Dieter Fox, and Jeff Bilmes. The design and collection of COSINE, a multi-microphone in situ speech corpus recorded in noisy environments. Computer Speech and Langauge, 26:5266, 2011. 11 h 91 US/non-native English 5k colloquial 2 to 7 conversation reverb human various walk various real environments headset+throat mic no yes no no
SiSEC real-world noise 2010 public spaces 20 min 16000 2 to 4 (distant) no free http://sisec2010.wiki.irisa.fr/tiki-index.php?page=Source+separation+in+the+presence+of+real-world+background+noise ito.nobutaka@lab.ntt.co.jp The Signal Separation Evaluation Campaign (2007-2010): Achievements and Remaining Challenges, Emmanuel Vincent; Shoko Araki; Fabian J. Theis; Guido Nolte; Pau Bofill; Hiroshi Sawada; Alexey Ozerov; B. Vikrham Gowreesunker; Dominik Lutter; Ngoc Duong, Signal Processing, Elsevier, 2012, 92, pp. 1928-1936 20 min 6 N/S N/S read 1 or 3 full no loudspeaker various no various real environments original+spatial image yes no no no
SiSEC dynamic 2010-2011 cocktail party 11 min 16000 2 to 4 (distant) no free http://sisec2010.wiki.irisa.fr/tiki-index.php?page=Determined+convolutive+mixtures+under+dynamic+conditions francesco.nesta@gmail.com The Signal Separation Evaluation Campaign (2007-2010): Achievements and Remaining Challenges, Emmanuel Vincent; Shoko Araki; Fabian J. Theis; Guido Nolte; Pau Bofill; Hiroshi Sawada; Alexey Ozerov; B. Vikrham Gowreesunker; Dominik Lutter; Ngoc Duong, Signal Processing, Elsevier, 2012, 92, pp. 1928-1936 11 min ? N/S N/S read Many but only 2 simultaneous simu reverb loudspeaker various simu no original+spatial image yes no no no
CHiME 1/CHiME 2 Grid 2011-2012 domestic 70 h with some overlap 16000 2 (distant) no free http://spandh.dcs.shef.ac.uk/chime_challenge/chime2_task1.html emmanuel.vincent@inria.fr Vincent, E., Barker, J., Watanabe, S., Le Roux, J., Nesta, F. and Matassoni, M., "The second CHiME Speech Separation and Recognition Challenge: Datasets, tasks and baselines In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013, Vancouver 12 h 34 UK English 51 command 1 no real rir dummy quasi-fixed simu domestic yes yes yes no no
CHiME 2 WSJ0 2012 domestic 78 h with some overlap 16000 2 (distant) no WSJ0 http://spandh.dcs.shef.ac.uk/chime_challenge/chime2_task2.html francesco.nesta@gmail.com Vincent, E., Barker, J., Watanabe, S., Le Roux, J., Nesta, F. and Matassoni, M., "The second CHiME Speech Separation and Recognition Challenge: Datasets, tasks and baselines In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013, Vancouver 33 h 101 US English 11k read 1 no real rir dummy fixed no domestic yes yes yes no no
ETAPE 2012 debates, outdoor interviews, and other TV/radio broadcasts selected for large speaker overlap and/or noise 42 h 16000 1 (mixing console) 1 ? ? guillaume.gravier@irisa.fr Guillaume Gravier, Gilles Adda, Niklas Paulsson, Matthieu Carr, Aude Giraudel, Olivier Galibert, The ETAPE corpus for the evaluation of speech-based TV content processing in the French language, LREC 2012. 32 h 347 French 16k colloquial 1 or more (7% overlap on average, up to 10% in debates) conversation some reverb human quasi-fixed head various real environments no N/S yes no yes
GALE (Chinese broadcast conversation) 2013 conversation (TV Broadcast) 120 h 16000 1 (mixing console) no $2000 (audio) + $1500 (transcripts) https://catalog.ldc.upenn.edu/LDC2013S04 strassel@ldc.upenn.edu 108 h ? Mandarin ? colloquial 1 or more conversation no human quasi-fixed head no no N/S yes no no
GALE (Arabic broadcast conversation) 2013 conversation (TV Broadcast) 251 h 16000 1 (mixing console) no 2 x [$2000 (audio) + $1500 (transcripts)] https://catalog.ldc.upenn.edu/LDC2013S02 strassel@ldc.upenn.edu 234 h ? Arabic ? colloquial 1 or more conversation no human quasi-fixed head no no N/S yes no no
REVERB SimData 2013 domestic/office 25 h 16000 8 (distant) no WSJCAM0 http://reverb2014.dereverberation.com/ REVERB-challenge@lab.ntt.co.jp Keisuke Kinoshita, Marc Delcroix, Takuya Yoshioka, Tomohiro Nakatani, Emanuel Habets, Reinhold Haeb-Umbach, Volker Leutnant, Armin Sehr, Walter Kellermann, Roland Maas, Sharon Gannot, Bhiksha Raj, "The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech", Proc. WASPAA 2013 25 h 130 UK English 10k read 1 no real rir loudspeaker fixed no experimental room original+spatial image yes yes no yes
DIRHA 2014 domestic 3.8 h 48000 40 (distant) no free http://shine.fbk.eu/resources/dirha-ii-simulated-corpus mravanelli@fbk.eu Alessio Brutti, Mirco Ravanelli, Piergiorgio Svaizer, Maurizio Omologo, A speech event detection and localization task for multiroom environments, HSCMA 2014. 1.3 h 30 Italian, German, Greek, Portuguese various various 1 or more simu real rir loudspeaker various no domestic (sum of individual noises) yes yes yes no yes

Automatic speech recognition

1st CHiME Challenge (2011)

Artificially distorted version of the small vocabulary GRID audio-visual corpus (audio only). Binaural reverberated speech with speaker situated in front of the microphones. Additive household noises impinging from different directions. Clean-training, noisy-training, development and evaluation sets available, see

Jon Barker, E. Vincent, N. Ma, H. Christensen, P. Green, "The PASCAL CHiME speech separation and recognition challenge", Computer Speech & Language, Volume 27, Issue 3, May 2013, Pages 621-633.

Available from Computer Speech and Language here

Corpus available here (no cost)

Resources

  • Training recipe of the challenge for HTK here.

Baselines

  • See the paper above for results for a wide range of techniques.


AURORA 5 (2007)

Artificially distorted version of the digits TI-DIGITS corpus. Additive noise and additive noise plus reverberant speech sets. Variable SNR range. Various mixed training sets, no evaluation set, see

G. Hirsch "Aurora-5 Experimental Framework for the Performance Evaluation of Speech Recognition in Case of a Hands-free Speech Input in Noisy Environments", Niederrhein University of Applied Sciences, 2007.

Paper available online here (no cost)

Corpus available from LDC here

Resources

  • Training recipe for HTK is provided with the corpora.

Baselines

  • Reproducible baseline: The above cited paper includes a baseline for the ETSI Advanced Front-End.


AURORA 4 (2002)

Artificially distorted version of the 5K word Wall Street Journal corpus (WSJ0). Stationary and non-stationary noises added. Second recordings with distant mismatched microphone. Clean-training, mixed-training, noisy training and test sets available. No evaluation set, see

G. Hirsch "Experimental Framework for the Performance Evaluation of Speech Recognition Front-ends on a Large Vocabulary Task", ETSI STQ Aurora DSR Working Group, 2002.

Paper available with the corpus.

Corpora available from ELRA here and here

Resources

  • Training recipe for HTK available here. Note that this recipe is for Wall-Street Journal (WSJ0), which is the clean speech version of AURORA4. Small changes are needed in the feature extraction scripts to account for different file terminations.

Speaker identification and verification

Speech enhancement and separation

Other applications

Contribute a dataset

To contribute a new dataset, please

  • create an account and login
  • go to the wiki page above corresponding to your application; if it does not exist yet, you may create it
  • click on the "Edit" link at the top of the page and add a new section for your dataset (the datasets are ordered by year of collection)
  • click on the "Save page" link at the bottom of the page to save your modifications

Please make sure to provide the following information:

  • name of the dataset and year of collection
  • authors, institution, contact information
  • link to the dataset and to side resources (lexicon, language model, etc)
  • short description (nature of the data, license, etc) and link to a paper/report describing the dataset, if any
  • at least 1 research result obtained for this dataset (see below)

We currently cannot provide storage space for large datasets. Please upload the dataset at a stable URL on the website of your institution or elsewhere and provide its URL only. If this is not possible, please contact the resources sharing working group.

Contribute a research result

To contribute a new research result, please

  • create an account and login
  • go to the wiki page and the section corresponding to the dataset for which this result was obtained
  • click on the "Edit" link on the right of the section header and add a new item for your result
  • click on the "Save page" link at the bottom of the page to save your modifications

Please make sure to provide the following information:

  • authors, paper/report title, means of publication
  • link to the pdf of the paper
  • link to derived data (output transcriptions, intermediary data, etc)
  • Code and instructions to reproduce experiments (if available)

In order to save storage space, please do not upload the paper on this wiki, but link it as much as possible from your institutional archive, from another public archive (e.g., arxiv) or from the publisher website (e.g., ieexplore).

We currently cannot provide storage space for large datasets. Please upload the derived data at a stable URL on the website of your institution or elsewhere and provide its URL only. If this is not possible, please contact the resources sharing working group.