Datasets

Evaluation of German Language Models

Questions, groundtruth, rating hints and answers of tested models until February 2024 Download the evaluation here

Februar 28, 2024
Evaluation of medium-sized language models (June 2023)

Large language models (LLMs) have garnered significant attention, but the definition of «large» lacks clarity. This dataset focuses on medium-sized language models (MLMs), defined as having at least six billion parameters but less than 100 billion. The corresponding study (https://doi.org/10.48550/arXiv.2305.11991) evaluates MLMs regarding zero-shot generative question answering, which requires models to provide elaborate answers without…

Juli 5, 2023
Handtools image classification

Dataset for handtools image classification. Photos were taken with different Cameras / Smartphones.

März 21, 2023
ASR Bundestag

A dataset for Automatic Speech Recognition (ASR) Systems, consisting of multiple subsets (pending publishing).The dataset consists of over 1,000 hours of audio-transcripts from political speeches of the German Bundestag. Quelle der Rohdaten:https://www.bundestag.de/mediathek Nutzungsbedingungen Nutzungsbedinungen:https://www.bundestag.de/resource/blob/296016/301050a2c21ce66e24014805c235f9c7/nutzungsbedingungen_de-data.pdf Der Inhalt ist nicht für «gewerbliche oder kommerzielle Werbezwecke» zu verwenden.

März 21, 2023
Common Crawl German

The Common Crawl corpus contains petabytes of data collected over 12 years of web crawling. Here, we host the German colossal, cleaned Common Crawl corpus.This is a German text corpus which is based on Common Crawl. It has been cleaned up and preprocessed and can be used for various tasks in the NLP field. For example,…

März 21, 2023
HUI-Audio-Corpus-German

A high quality Text-To-Speech dataset. This dataset was created by researchers at IISYS. The paper can be found here The data set contains several speakers. The 5 largest are listed individually, the rest are summarized as other. All audio files have a sampling rate of 44.1kHz. For each speaker, there is a clean variant in addition…

März 21, 2023

Evaluation of German Language Models

Evaluation of medium-sized language models (June 2023)

Handtools image classification

ASR Bundestag

Common Crawl German

HUI-Audio-Corpus-German