All examples are under directory egs and named by its name of dataset. All data-sets starts with "mock" are data-sets for test.
| DataSet |
Supported Tasks |
Description |
| ATIS |
Sequence labeling/ Text classification/ NLU joint learning |
Air Travel Information System (ATIS) pilot corpus. |
| CoNLL2003 |
Sequence labeling |
The CoNLL 2003 NER task consists of newswire text from the Reuters RCV1 corpus tagged with four different entity types (PER, LOC, ORG, MISC). |
| MSRA_NER |
Sequence labeling |
MSRA datasets are in the news domain about NER. |
| SNIL |
Sentence Matching |
Stanford Natural Language Inference corpus is a new, freely available collection of labeled sentence pairs, written by humans doing a novel grounded task based on image captioning. |
| Quora_QP |
Sentence Matching |
Data collected from the quara platform. Quora is a place to gain and share knowledge—about anything. |
| Yahoo_Answer |
Document Classification |
Yahoo answers are obtained from (Zhang et al., 2015). This is a topic classification task with 10 classes. The document we use includes question titles, question contexts and best answers. |
| Trec |
Document Classification |
This data collection contains all the data used in our learning question classification experiments,which has question class definitions. |
| DataSet |
Supported Tasks |
Description |
| hkust |
ASR |
HKUST Mandarin Telephone Speech |
| voxceleb |
Speaker Verfication |
VoxCeleb is an audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube |
| iemocap |
Emotion |
The Interactive Emotional Dyadic Motion Capture (IEMOCAP) database is an acted, multimodal and multispeaker database, recently collected at SAIL lab at USC. |