General Internet Corpus of Russian

General Internet Corpus of Russian
Type of site	Educational/scientific project
Available in	Russian language
Created by	Vladimir Selegey, Vladimir Belikov, Serge Sharoff
URL	www.webcorpora.ru/en
Commercial	No
Registration	Needed; given by request
Launched	2012
Current status	Beta-testing

General Internet Corpus of Russian (GICR) is a corpus of Russian internet texts that has been accessible on request through an online query interface since 2013. The corpus includes rich text materials from the blogosphere, social networks, major news sources and literary magazines.

Quick facts Type of site, Available in ...

Corpus segment	Words, millions	Documents
Mail.Ru Blogs	707	9882120
VKontakte	9820	193770717
Live Journal	8110	73229158
Russian Magazine Hall	313	56547
News (ria, regnum, lentaru, rosbalt)	851	2964897
All corpora	19801	279903439

Corpus	Languages	Access	Site	Size	Facilities
COW: Free, Large Web Corpora in European Languages	English, French, German, Spanish, Swedish, Dutch	free, after registration, trial access is possible without registration		30 billion words	KWIC format, morphological tagging, CQP search, markup and search by date, URL, country, city, etc.
Sketch Engine	English, French, German, Italian, Arabic, Russian, Spanish, Portuguese, Korean, Japanese, Chinese + more languages available at extra charge	Paid access, trial access is possible after registration		86 billion words	concordances, sketch grammar, thesaurus, KWIC, morphological tagging, CQP search
Aranea Corpora	English, Russian, Finnish, French, German, Hungarian, Spanish, Italian, Dutch, Polish, Slovak	Free, after registration, trial access is possible without registration	^{[permanent dead link]}	14 billion words	noSketch Engine, concordances, sketch grammar, thesaurus, KWIC, morphological tagging, CQP search, comparable query results in different languages
GICR (General Internet Corpus of Russian)	Russian	Free, registration on request		20 billion words	concordances, thesaurus, KWIC, morphological tagging, CQP search, markup and search by date, country, city, internet-segment, sex, year and place of birth of the author, “query mail” for users.
GloWbE (Corpus of Global Web-Based English)	English, specification for 20 countries	No registration		1,9 billion words	KWIC, concordances, collocates, results comparable by dialects, CQP search, corpus can be downloaded

General Internet Corpus of Russian

Goals of the project

Size and content of the corpus

Access

See also

References

Further reading

External links

Wikiwand - on