Top Qs
Timeline
Chat
Perspective

List of text corpora

Overview of data sets of languages From Wikipedia, the free encyclopedia

Remove ads

Text corpora (singular: text corpus) are large and structured sets of texts, which have been systematically collected. Text corpora are used by both AI developers to train large language models and corpus linguists and within other branches of linguistics for statistical analysis, hypothesis testing, finding patterns of language use, investigating language change and variation, and teaching language proficiency.[1]

Remove ads

English language

Remove ads

European languages

Slavic

East Slavic

South Slavic

West Slavic

German

Remove ads

Middle Eastern Languages

  • Corpus Inscriptionum Semiticarum
  • Kanaanäische und Aramäische Inschriften
  • Hamshahri Corpus (Persian)
  • Persian in MULTEXT-EAST corpus (Persian)[15]
  • Amarna letters (for Akkadian, Egyptian, Sumerogram's, etc.)
  • TEP: Tehran English-Persian Parallel Corpus[16]
  • PTC: Persian Today Corpus: The Most Frequent Words of Today Persian, based on a one-million-word corpus (in Persian: Vāže-hā-ye Porkārbord-e Fārsi-ye Emrūz), Hamid Hassani, Tehran, Iran Language Institute (ILI), 2005, 322 pp. ISBN 964-8699-32-1
  • Kurdish-corpus.uok.ac.ir (Kurdish-corpus Sorani dialect) University of Kurdistan, Department of English Language and Linguistics
  • Bijankhan Corpus A Contemporary Persian Corpus for NLP researches, University of Tehran, 2012
  • Neo-Assyrian Text Corpus Project
  • Quranic Arabic Corpus (Classical Arabic)
  • Electronic Text Corpus of Sumerian Literature
  • Open Richly Annotated Cuneiform Corpus
  • Asosoft text corpus[17]Central Kurdish (Sorani)
  • Thesaurus Linguae Aegyptiae (ancient Egyptian, Afro-Asiatic)

Turkic languages

Devanagari

East Asian Languages

South Asian Languages

African languages

Parallel corpora of diverse languages

Remove ads

Comparable Corpora

Remove ads

L2 (English) Corpora

  • Cambridge Learner Corpus[44]
  • Corpus of Academic Written and Spoken English (CAWSE),[45] a collection of Chinese students’ English language samples in academic settings. Freely downloadable online.  
  • English as a Lingua Franca in Academic Settings (ELFA),[46] an academic ELF corpus.[47][48]
  • International Corpus of Learner English (ICLE),[49] a corpus of learner written English.
  • Louvain International Database of Spoken English Interlanguage (LINDSEI),[50] a corpus of learner spoken English.
  • Trinity Lancaster Corpus, one of the largest corpus of L2 spoken English.[51][52]
  • University of Pittsburgh English Language Institute Corpus (PELIC)[53]
  • Vienna-Oxford International Corpus of English (VOICE),[54] an ELF corpus.[47]
Remove ads

References

See also

Loading related searches...

Wikiwand - on

Seamless Wikipedia browsing. On steroids.

Remove ads