The Common Crawl corpus contains petabytes of data collected over 12 years of web crawling.
Here, we host the German colossal, cleaned Common Crawl corpus.
This is a German text corpus which is based on Common Crawl. It has been cleaned up and preprocessed and can be used for various tasks in the NLP field. For example, for the self-supervised training of language models.
GC4 has been created by Philipp Reißel from ambeRoad with support from Philip May.
In a very simplified matter one can say:
- HEAD: Consists of high quality text (e.g. newspaper, government websites)
- MIDDLE: More colloquial language like forum entries, commentary sections
- TAIL: The dark side of the Internet (not hosted here)
As it is classified through n-gram occurrences in comparison with the German wikipedia n-gram from our practical experience it worked quite well.
Use the following link to get more information about the GC4 Corpus: https://german-nlp-group.github.io/projects/gc4-corpus.html
Down below you can find the download links to get this dataset. For your convinience we providing a textfile containing all links.
Use the command $ tar xfvz *.tar.gz
to extract the files.
Head Part
Download as plain Textfile: headpart.txt
Middle Part
Download as plain Textfile: middlepart.txt