-
Evaluation of German Language Models
Questions, groundtruth, rating hints and answers of tested models until February 2024 Download the evaluation here
-
Evaluation of medium-sized language models (June 2023)
Large language models (LLMs) have garnered significant attention, but the definition of «large» lacks clarity. This dataset focuses on medium-sized language models (MLMs), defined as having at least six billion parameters but less than 100 billion. The corresponding study (https://doi.org/10.48550/arXiv.2305.11991) evaluates MLMs regarding zero-shot generative question answering, which requires models to provide elaborate answers without…
-
Handtools image classification
Dataset for handtools image classification. Photos were taken with different Cameras / Smartphones.
-
ASR Bundestag
A dataset for Automatic Speech Recognition (ASR) Systems, consisting of multiple subsets (pending publishing).The dataset consists of over 1,000 hours of audio-transcripts from political speeches of the German Bundestag. Quelle der Rohdaten:https://www.bundestag.de/mediathek Nutzungsbedingungen Nutzungsbedinungen:https://www.bundestag.de/resource/blob/296016/301050a2c21ce66e24014805c235f9c7/nutzungsbedingungen_de-data.pdf Der Inhalt ist nicht für «gewerbliche oder kommerzielle Werbezwecke» zu verwenden.
-
Common Crawl German
The Common Crawl corpus contains petabytes of data collected over 12 years of web crawling. Here, we host the German colossal, cleaned Common Crawl corpus.This is a German text corpus which is based on Common Crawl. It has been cleaned up and preprocessed and can be used for various tasks in the NLP field. For example,…
-
HUI-Audio-Corpus-German
A high quality Text-To-Speech dataset. This dataset was created by researchers at IISYS. The paper can be found here The data set contains several speakers. The 5 largest are listed individually, the rest are summarized as other. All audio files have a sampling rate of 44.1kHz. For each speaker, there is a clean variant in addition…