Difference between revisions of "Datasets"

From rosp
m
m
Line 1,262: Line 1,262:
 
'''General attributes''':
 
'''General attributes''':
 
* year of release
 
* year of release
* use case
+
* scenario: car, cocktail party, domestic, lecture, meeting, office, public space, TV...
* total duration
+
* total duration (h) (multiple channels counted only once)
* sampling rate
+
* sampling rate (kHz)
 
* number of distant or noisy microphones
 
* number of distant or noisy microphones
 
* number of video cameras
 
* number of video cameras
 
* cost
 
* cost
* links
+
* links: download data, reference papers, software baselines, evaluation results...
 
'''Speech attributes''':
 
'''Speech attributes''':
* speaking duration
+
* duration of speech (h) (overlapping speech counted only once)
 
* number of unique speakers
 
* number of unique speakers
 
* language
 
* language
* number of unique words
+
* number of unique words (differs from assumed vocabulary size, which is somewhat arbitrary)
* speaking style
+
* speaking style: digits, command, read, spontaneous...
 
* number of speakers present in the room
 
* number of speakers present in the room
* type of speaker overlap
+
* type of speaker overlap: no overlap, simulated overlap, dialogue, meeting, full overlap...
 
'''Channel attributes''':
 
'''Channel attributes''':
* channel type
+
* channel type: none, simulated room impulse response, convolution by a recorded room impulse response, reverberant recording...
* speaker radiation
+
* speaker radiation: loudspeaker, dummy head with mouth simulator, human...
* speaker location
+
* speaker location: at a fixed position in the room, at a quasi-fixed position (e.g., seated), at different positions...
* speaker movements
+
* speaker movements: no movement, head movements, walking...
 
'''Noise attributes''':
 
'''Noise attributes''':
* noise type
+
* noise type: stationary background noise (e.g., air-conditioning), car noise, meeting noises, domestic noises, outdoor noises...
 
'''Available ground truth''':
 
'''Available ground truth''':
* reference speech signal
+
* reference speech signal: original (at the mouth), headset or lapel (slightly differs from the signal at the mouth), spatial image (at the microphones)...
 
* speaker location and orientation
 
* speaker location and orientation
 
* words uttered
 
* words uttered
* nonverbal traits
+
* nonverbal traits:
 
* noise events
 
* noise events
  

Revision as of 23:46, 8 August 2014

Speech datasets

The table below aims to provide a list of speech datasets with detailed attributes and links to software baselines and research results. Each dataset may be used for one or more applications: automatic speech recognition, speaker identification and verification, source localization, speech enhancement and separation...

Disclaimer: Only datasets that are publicly available, suitable for robust speech processing research, and longer than 5 min are listed.

The meaning of each attribute is detailed below.

Datasets General attributes Speech Channel Noise Ground truth
rel. year use case total time (h) sam. rate (kHz) dist. or noisy mics video cams cost links speak. time (h) uniq. speak. lang. uniq. words (k) speak. style speak. / rec. overl. type chan. type speak. radiat. speak. loc. speak. moves noise type ref. signal speak. loc., orient. words non- verb. traits noise events
ShATR 1994 meeting 0.6 48 3 no free download

paper

0.6 5 UK English 1 spontaneous 5 multiple dialogs reverb human quasi-fixed head meeting headset yes yes no yes
LLSEC 1996 dialog 1.4 16 4 no free download ? 12 N/S N/S read, spontaneous 2 dialog reverb human quasi-fixed head hallway, restaurant (scenarized) no yes no no no
RWCP Spoken Dialog Corpus 1996 - 1997 dialog 10 16 2 no free download

paper

10 39 Japanese ? spontaneous 1 - 2 dialog reverb (low) human quasi-fixed head stationary background no no yes no no
Aurora-2 2000 public spaces 33 8 - 16 1 no free given TIDigits (0.5 k$) purchase (incl. HTK)

features paper

33 214 US English 0.01 digits 1 no simulated phone human N/S no various real environments original N/S yes no yes
SPINE1, SPINE2 2000 - 2001 military 38 16 2 no 7.4 k$ purchase

paper

? 100 US English 1 command, spontaneous 1 - 2 no simulated radio human quasi-fixed head military no no yes no no
Aurora-3 (subset of SpeechDat- Car) 2000 - 2003 car ? 16 4 no 1 k€ purchase (incl. HTK)

papers

? ? various ? digits, command, read, spontaneous 1 no reverb human quasi-fixed head car headset no yes no no
RWCP Meeting Speech Corpus 2001 meeting 3.5 16 - 48 1 3 free download

paper

3.5 ? Japanese ? spontaneous 1 - 5 meeting reverb (low) human quasi-fixed head stationary background headset no yes no no
RWCP Real Environment Speech and Acoustic Database 2001 domestic, office ? 16 - 48 30 no free download

paper

? 5 Japanese ? read 1 no real rir, reverb loudspeaker various no, pivoting arm stationary background original yes yes no yes
SpeechDat- Car 2001 - 2011 car ? 16 4 no 39 - 182 k€ per lang purchase

paper

? 300 per lang various ? digits, command, read, spontaneous 1 no reverb human quasi-fixed head car headset no yes no no
Aurora-4 2002 public spaces ? 8 - 16 1 no free given WSJ0 (1.5 k$) purchase

HTK paper

? 101 US English 10 read 1 no simulated phone human N/S no various real environments original N/S yes no yes
TED 2002 seminar 47 16 1 no 0.5 k$ purchase

paper

47 188 non-native English ? lecture 1 or more seminar reverb human quasi-fixed head stationary background lapel no partial no no
CUAVE 2002 cocktail party 3 44 1 1 free download

paper

3 36 US English 0.01 digits 1 - 2 full reverb human quasi-fixed head stationary background no no yes no no
CU-Move Microphone Array Data 2002 - 2011 car 286 44 6 - 8 no 25 k$ purchase

paper

286 172 US English 12 digits, command, read, dialog 1 no reverb human quasi-fixed head car no no yes no no
CENSREC-1 (Aurora-2J) 2003 public spaces ? 8 1 no free download

paper

? 214 Japanese 0.01 digits 1 no simulated phone human N/S no various real environments original N/S yes no yes
AVICAR 2004 car 29 16 7 4 free download

paper

29 86 US English, non-native English 1 read 1 no reverb human quasi-fixed head car no no yes no no
AV16.3 2004 meeting 1.5 16 16 3 free download

paper

1.5 12 N/S N/S spontaneous 1 - 3 full reverb human various walk stationary background no yes no no no
ICSI Meeting Corpus 2004 meeting 72 16 6 no 2.8 k$ purchase

paper

72 53 US English 13 meeting 3 - 10 meeting reverb human quasi-fixed head stationary background headset, lapel no yes yes no
NIST Meeting Pilot Corpus Speech 2004 meeting 15 16 7 no 5.5 k$ purchase

paper

15 61 US English 6 meeting 3 - 9 meeting reverb human various walk stationary background headset, lapel no yes no no
CHIL Meetings 2004 - 2007 seminar, meeting 60 44 79 - 147 6 - 9 3.5 k€ purchase

paper

? ? non-native English ? seminar, meeting 3 - 20 seminar, meeting reverb human quasi-fixed head meeting (scenarized) headset yes yes yes no
SPEECON 2004 - 2011 public space, domestic, office, car ? 16 3 no 75 k€ per lang purchase

paper

? 600 per lang various ? command, read, spontaneous 1 no reverb human quasi-fixed head various real environments headset no yes no no
CENSREC-2 2005 car ? 16 1 no free download

paper

? 214 Japanese 0.01 digits 1 no reverb human quasi-fixed head car headset no yes no no
CENSREC-3 2005 car ? 16 1 no 21 k¥ purchase

paper

? 311 Japanese 0.05 read 1 no reverb human quasi-fixed head car headset no yes no no
Aurora-5 2006 public spaces, domestic, office, car ? 8 1 no free given TIDigits (0.5 k$) purchase (incl. HTK)

paper

? 225 US English 0.01 digits 1 no no, simulated rir, real rir loudspeaker N/S no various real environments original no yes no yes
AMI 2006 meeting 100 16 16 6 free download

paper

? 189 UK English 8 meeting 4 meeting (18% overlap) reverb human quasi-fixed head stationary background headset, lapel yes yes yes no
PASCAL SSC 2006 cocktail party 8.8 25 1 no free download

paper

8.8 34 UK English 0.05 command 2 full no human N/S no no original N/S yes no no
HIWIRE 2007 airplane 21 16 1 no 0.05 k€ purchase

paper

21 81 non-native English 0.1 command 1 no no human N/S head airplane original N/S yes no no
UT-Drive 2007 car 40 25 5 2 25 k$ download

paper

40 25 US English 2.4 command, dialog 1 - 2 dialog reverb human quasi-fixed head car headset (low quality) no partial no no
SASSEC, SiSEC under- determined 2007 - 2011 cocktail party 0.3 16 2 no free download

paper

0.3 16 N/S N/S read 3 - 4 full simulated rir, real rir, reverb no, loudspeaker fixed no no original, spatial image yes no no no
MC-WSJ-AV, PASCAL SSC2, 2012_MMA, REVERB RealData 2007 - 2014 cocktail party 10 16 8 - 40 no 1.5 k$ purchase

paper paper HTK Kaldi results results

? 45 UK English 10 read 1 - 2 full reverb human various walk stationary background headset, lapel yes yes no no
CENSREC-4 (Simulated) 2008 public spaces, domestic, office, car ? 16 1 no free download

paper

? 214 Japanese 0.01 digits 1 no real rir dummy fixed no various real environments original no yes no yes
CENSREC-4 (Real) 2008 public spaces, domestic, office, car ? 16 1 no free download

paper

? 10 Japanese 0.01 digits 1 no reverb human quasi-fixed head various real environments headset no yes no yes
DICIT 2008 domestic 6 48 16 2 free download

paper

1 ? Italian ? command 4 no reverb human various walk domestic (scenarized) headset, tv yes yes no yes
SiSEC head-geometry 2008 cocktail party 1.9 16 2 no free download

paper

1.9 ? N/S N/S read 2 full real rir loudspeaker various no no original, spatial image yes no no no
COSINE 2009 dialog 38 48 20 no free download

paper

11 91 US English, non-native English 5 spontaneous 2 - 7 dialog reverb human various walk various real environments headset, throat mic no yes no no
SiSEC real-world noise 2010 public spaces 0.3 16 2 - 4 no free download

paper

0.3 6 N/S N/S read 1 - 3 full no, reverb (other room) loudspeaker various no various real environments original, spatial image yes no no no
SiSEC dynamic 2010 - 2011 cocktail party 0.2 16 2 - 4 no free download

paper

0.2 ? N/S N/S read many but only 2 simultaneous full reverb loudspeaker various simulated no original, spatial image yes no no no
CHiME 1, CHiME 2 Grid 2011 - 2012 domestic 70 16 - 48 2 no free download

paper HTK results results

12 34 UK English 0.05 command 1 no real rir dummy quasi-fixed simulated head domestic yes yes yes no no
CHiME 2 WSJ0 2012 domestic 78 16 2 no free given WSJ0 (1.5 k$) download

paper HTK Kaldi results

33 101 US English 11 read 1 no real rir dummy fixed no domestic yes yes yes no no
ETAPE 2012 TV/radio debates, outdoor interviews 42 16 1 1 ? download

paper

32 347 French 16 spontaneous 1 or more dialog (up to 10% overlap) reverb (some) human quasi-fixed head various real environments no N/S yes no yes
GALE (Chinese broadcast conversation) 2013 TV dialog 120 16 1 no 3.5 k$ purchase 108 ? Mandarin ? spontaneous 1 or more dialog no human quasi-fixed head no no N/S yes no no
GALE (Arabic broadcast conversation) 2013 TV dialog 251 16 1 no 7 k$ purchase 234 ? Arabic ? spontaneous 1 or more dialog no human quasi-fixed head no no N/S yes no no
REVERB SimData 2013 domestic, office 25 16 8 no free given WSJCAM0 (1.75 k$) purchase

paper HTK Kaldi results results

25 130 UK English 10 read 1 no real rir loudspeaker fixed no stationary background original, spatial image yes yes no yes
DIRHA 2014 domestic 3.8 48 40 no free download

paper

1.3 30 various ? command, read, spontaneous 1 or more simulated real rir loudspeaker various no domestic (sum of events) yes yes yes no yes

General attributes:

  • year of release
  • scenario: car, cocktail party, domestic, lecture, meeting, office, public space, TV...
  • total duration (h) (multiple channels counted only once)
  • sampling rate (kHz)
  • number of distant or noisy microphones
  • number of video cameras
  • cost
  • links: download data, reference papers, software baselines, evaluation results...

Speech attributes:

  • duration of speech (h) (overlapping speech counted only once)
  • number of unique speakers
  • language
  • number of unique words (differs from assumed vocabulary size, which is somewhat arbitrary)
  • speaking style: digits, command, read, spontaneous...
  • number of speakers present in the room
  • type of speaker overlap: no overlap, simulated overlap, dialogue, meeting, full overlap...

Channel attributes:

  • channel type: none, simulated room impulse response, convolution by a recorded room impulse response, reverberant recording...
  • speaker radiation: loudspeaker, dummy head with mouth simulator, human...
  • speaker location: at a fixed position in the room, at a quasi-fixed position (e.g., seated), at different positions...
  • speaker movements: no movement, head movements, walking...

Noise attributes:

  • noise type: stationary background noise (e.g., air-conditioning), car noise, meeting noises, domestic noises, outdoor noises...

Available ground truth:

  • reference speech signal: original (at the mouth), headset or lapel (slightly differs from the signal at the mouth), spatial image (at the microphones)...
  • speaker location and orientation
  • words uttered
  • nonverbal traits:
  • noise events

Text datasets

Other datasets

Contribute a dataset

To contribute a new dataset, please

  • create an account and login
  • go to the wiki page above corresponding to your application; if it does not exist yet, you may create it
  • click on the "Edit" link at the top of the page and add a new section for your dataset (the datasets are ordered by year of collection)
  • click on the "Save page" link at the bottom of the page to save your modifications

Please make sure to provide the following information:

  • name of the dataset and year of collection
  • authors, institution, contact information
  • link to the dataset and to side resources (lexicon, language model, etc)
  • short description (nature of the data, license, etc) and link to a paper/report describing the dataset, if any
  • at least 1 research result obtained for this dataset (see below)

We currently cannot provide storage space for large datasets. Please upload the dataset at a stable URL on the website of your institution or elsewhere and provide its URL only. If this is not possible, please contact the resources sharing working group.

Contribute a research result

To contribute a new research result, please

  • create an account and login
  • go to the wiki page and the section corresponding to the dataset for which this result was obtained
  • click on the "Edit" link on the right of the section header and add a new item for your result
  • click on the "Save page" link at the bottom of the page to save your modifications

Please make sure to provide the following information:

  • authors, paper/report title, means of publication
  • link to the pdf of the paper
  • link to derived data (output transcriptions, intermediary data, etc)
  • Code and instructions to reproduce experiments (if available)

In order to save storage space, please do not upload the paper on this wiki, but link it as much as possible from your institutional archive, from another public archive (e.g., arxiv) or from the publisher website (e.g., ieexplore).

We currently cannot provide storage space for large datasets. Please upload the derived data at a stable URL on the website of your institution or elsewhere and provide its URL only. If this is not possible, please contact the resources sharing working group.