tinytorch/datasets/tinytalks/DATASHEET.md
Following "Datasheets for Datasets" by Gebru et al. (2018)
TinyTalks was created to provide an educational, lightweight conversational Q&A dataset specifically designed for teaching transformer architectures. The primary goal is to enable students to train their first transformer model and see meaningful learning in under 5 minutes, creating an "aha!" moment that demonstrates how transformers learn patterns.
TinyTalks was created by the TinyTorch Contributors as part of the TinyTorch educational deep learning framework. It was developed specifically for the Transformer milestone (Module 13 / Milestone 05) of the TinyTorch curriculum.
This dataset was created as an open-source educational resource without specific funding. It is part of the broader TinyTorch project.
Each instance represents a question-answer pair in natural language. Questions are conversational or factual queries, and answers are appropriate responses that an AI assistant might provide.
350 question-answer pairs distributed across 5 difficulty levels:
This is a curated sample. It represents a pedagogically-designed subset of possible conversational Q&A pairs, specifically selected for educational value and training efficiency.
Each instance consists of:
Example:
Q: What color is the sky?
A: The sky is blue during the day.
Yes. In a Q&A format, the question serves as input and the answer serves as the target label for supervised learning. For autoregressive language modeling, the entire text sequence serves as both input and target (shifted by one token).
No. Each Q&A pair is complete. However, the dataset intentionally excludes:
This is by design to keep the dataset simple and focused.
Partially. Level 5 (Multi-turn Context) contains sequential Q&A pairs where the answer to one question sets up context for the next. However, most Q&A pairs (Levels 1-4) are independent.
Yes, we provide:
The splits maintain proportional representation of all 5 difficulty levels and are deterministic (same split every time).
Fully self-contained. No external resources, URLs, or references required.
No. All data is original or public-domain factual knowledge. No confidential, proprietary, or sensitive information is included.
No. The dataset was explicitly designed to be:
All Q&A pairs were manually authored by TinyTorch contributors. No scraping, crowdsourcing, or automated generation was used for v1.0.
Not applicable. This is an original curated dataset, not a sample from a larger corpus.
TinyTorch contributors (open-source volunteers). No monetary compensation. Contributors are acknowledged in project documentation.
December 2024 - January 2025 (v1.0 release)
Informal ethical review by TinyTorch maintainers, focusing on:
No formal IRB review was required as no human subjects or sensitive data were involved.
Minimal preprocessing:
Q: ... \n A: ... \n\n formatNo automated cleaning or labeling was required as data was manually authored.
Since the data was manually authored, the "raw" data is the authored text itself. The generation script (scripts/generate_tinytalks.py) serves as the source of truth and can regenerate the dataset identically.
Yes:
scripts/generate_tinytalks.py (Python)scripts/validate_dataset.py (Python)scripts/stats.py (Python)All scripts are open-source (MIT license) and included in the repository.
Yes, the primary use case:
The dataset is hosted at: https://github.com/VJ/TinyTorch/tree/main/datasets/tinytalks
Usage examples:
milestones/05_2017_transformer/tinybot_demo.py - Main training scriptexamples/demo_usage.py - Data loading examplesPotential uses:
Considerations:
These are intentional design choices for educational clarity, not limitations.
Not suitable for:
Designed for:
Yes. TinyTalks is open-source and freely available to everyone under CC BY 4.0 license.
Yes. Creative Commons Attribution 4.0 International (CC BY 4.0)
No. All content is original or public-domain factual knowledge.
No export controls or regulatory restrictions apply.
TinyTorch Contributors (maintainers of the TinyTorch project)
Primary maintainer: VJ (@profvjreddi on GitHub)
Not yet. Any discovered errors will be documented in:
dataset + tinytalks)Yes, planned updates:
Updates will follow semantic versioning:
Not applicable. The dataset does not contain any personal data, PII, or information about real individuals.
Yes. All versions will remain available via Git tags:
git checkout tags/tinytalks-v1.0.0Yes:
See CONTRIBUTING.md for guidelines.
Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé III, H., & Crawford, K. (2018). Datasheets for datasets. arXiv preprint arXiv:1803.09010.
Last updated: January 2025