• Evaluation of German Language Models

    Questions, groundtruth, rating hints and answers of tested models until February 2024 Download the evaluation here

  • Evaluation of medium-sized language models (June 2023)

    Large language models (LLMs) have garnered significant attention, but the definition of «large» lacks clarity. This dataset focuses on medium-sized language models (MLMs), defined as having at least six billion parameters but less than 100 billion. The corresponding study ( evaluates MLMs regarding zero-shot generative question answering, which requires models to provide elaborate answers without…

  • Handtools image classification

    Dataset for handtools image classification. Photos were taken with different Cameras / Smartphones.

  • ASR Bundestag

    A dataset for Automatic Speech Recognition (ASR) Systems, consisting of multiple subsets (pending publishing).The dataset consists of over 1,000 hours of audio-transcripts from political speeches of the German Bundestag. Quelle der Rohdaten: Nutzungsbedingungen Nutzungsbedinungen: Der Inhalt ist nicht für «gewerbliche oder kommerzielle Werbezwecke» zu verwenden.

  • Common Crawl German

    The Common Crawl corpus contains petabytes of data collected over 12 years of web crawling. Here, we host the German colossal, cleaned Common Crawl corpus.This is a German text corpus which is based on Common Crawl. It has been cleaned up and preprocessed and can be used for various tasks in the NLP field. For example,…

  • HUI-Audio-Corpus-German

    A high quality Text-To-Speech dataset. This dataset was created by researchers at IISYS. The paper can be found here The data set contains several speakers. The 5 largest are listed individually, the rest are summarized as other. All audio files have a sampling rate of 44.1kHz. For each speaker, there is a clean variant in addition…