Datasets

  • Evaluation of German Language Models

    Questions, groundtruth, rating hints and answers of tested models until February 2024 Download the evaluation here


  • Evaluation of medium-sized language models (June 2023)

    Large language models (LLMs) have garnered significant attention, but the definition of «large» lacks clarity. This dataset focuses on medium-sized language models (MLMs), defined as having at least six billion parameters but less than 100 billion. The corresponding study (https://doi.org/10.48550/arXiv.2305.11991) evaluates MLMs regarding zero-shot generative question answering, which requires models to provide elaborate answers without…


  • Handtools image classification

    Dataset for handtools image classification. Photos were taken with different Cameras / Smartphones.


  • ASR Bundestag

    A dataset for Automatic Speech Recognition (ASR) Systems, consisting of multiple subsets (pending publishing).The dataset consists of over 1,000 hours of audio-transcripts from political speeches of the German Bundestag. Quelle der Rohdaten:https://www.bundestag.de/mediathek Nutzungsbedingungen Nutzungsbedinungen:https://www.bundestag.de/resource/blob/296016/301050a2c21ce66e24014805c235f9c7/nutzungsbedingungen_de-data.pdf Der Inhalt ist nicht für «gewerbliche oder kommerzielle Werbezwecke» zu verwenden.


  • Common Crawl German

    The Common Crawl corpus contains petabytes of data collected over 12 years of web crawling. Here, we host the German colossal, cleaned Common Crawl corpus.This is a German text corpus which is based on Common Crawl. It has been cleaned up and preprocessed and can be used for various tasks in the NLP field. For example,…


  • HUI-Audio-Corpus-German

    A high quality Text-To-Speech dataset. This dataset was created by researchers at IISYS. The paper can be found here The data set contains several speakers. The 5 largest are listed individually, the rest are summarized as other. All audio files have a sampling rate of 44.1kHz. For each speaker, there is a clean variant in addition…