tinytorch/datasets/tinytalks/README.md
A carefully curated question-answering dataset designed for learning transformer architectures
TinyTalks is a lightweight, pedagogically-designed conversational dataset for training transformer models in educational settings. Unlike large-scale datasets that require hours of training, TinyTalks enables students to see their first transformer learn meaningful patterns in under 5 minutes.
ā Fast Training - Trains in 3-5 minutes on a laptop ā Verifiable Learning - Clear success metrics (correct vs. incorrect answers) ā Progressive Difficulty - 5 levels from greetings to reasoning ā Educational Focus - Designed for "aha!" moments, not benchmarks ā Zero Dependencies - Ships with TinyTorch, no downloads needed ā Reproducible - Deterministic generation, versioned releases
TinyTalks is organized into 5 progressive difficulty levels:
Basic conversational patterns and self-identification.
Q: Hello!
A: Hi there! How can I help you today?
Q: What is your name?
A: I am TinyBot, a simple AI assistant.
Learning Goal: Model learns conversation structure and identity.
Factual knowledge about the world (colors, animals, objects).
Q: What color is the sky?
A: The sky is blue during the day.
Q: How many legs does a dog have?
A: A dog has four legs.
Learning Goal: Model learns to retrieve and state facts.
Simple arithmetic operations (single-digit addition, subtraction).
Q: What is 2 plus 3?
A: 2 plus 3 equals 5.
Q: What is 7 minus 4?
A: 7 minus 4 equals 3.
Learning Goal: Model learns basic numerical reasoning.
Purpose, usage, and logical relationships.
Q: What do you use an umbrella for?
A: You use an umbrella to stay dry in the rain.
Q: Where do fish live?
A: Fish live in water.
Learning Goal: Model learns causal and functional relationships.
Simple context tracking across exchanges.
Q: I like pizza.
A: Pizza is delicious! What toppings do you like?
Q: I like pepperoni.
A: Pepperoni is a popular choice!
Learning Goal: Model begins to track conversational context.
# Load full dataset
with open('datasets/tinytalks/tinytalks_v1.txt', 'r') as f:
text = f.read()
# Or use pre-split versions
with open('datasets/tinytalks/splits/train.txt', 'r') as f:
train_text = f.read()
# See milestones/05_2017_transformer/tinybot_demo.py for full example
from tinytorch.models.transformer import GPT
from tinytorch.text.tokenization import CharTokenizer
# Initialize model
tokenizer = CharTokenizer()
tokenizer.fit(train_text)
model = GPT(
vocab_size=len(tokenizer),
embed_dim=128,
num_layers=4,
num_heads=4,
max_seq_len=64
)
# Train for 5 minutes ā See meaningful results!
After training for 10-20 epochs (~3-5 minutes):
This demonstrates the transformer has learned patterns, not just memorized.
Simple, human-readable text format:
Q: [Question text]
A: [Answer text]
Q: [Next question]
A: [Next answer]
Rationale:
Delimiter: Empty line separates Q&A pairs.
scripts/generate_tinytalks.py) produces identical outputpython datasets/tinytalks/scripts/validate_dataset.py
Checks:
Run scripts/stats.py to generate:
python datasets/tinytalks/scripts/stats.py
Output:
TinyTalks is designed as the canonical dataset for TinyTorch's Transformer milestone:
Creative Commons Attribution 4.0 International (CC BY 4.0)
You are free to:
Under these terms:
See LICENSE for full text.
If you use TinyTalks in your work, please cite:
@dataset{tinytalks2025,
title={TinyTalks: A Conversational Q\&A Dataset for Educational Transformers},
author={TinyTorch Contributors},
year={2025},
publisher={GitHub},
url={https://github.com/VJ/TinyTorch/tree/main/datasets/tinytalks},
version={1.0.0}
}
Text citation: TinyTorch Contributors. (2025). TinyTalks: A Conversational Q&A Dataset for Educational Transformers (Version 1.0.0). https://github.com/VJ/TinyTorch
Version 1.0.0 (Current)
Planned:
See CHANGELOG.md for detailed history.
We welcome contributions! Ways to help:
Guidelines:
See CONTRIBUTING.md for details.
Inspired by:
Created for:
The name embodies our philosophy:
Just like TinyTorch makes deep learning accessible, TinyTalks makes conversational AI immediate and tangible.
Built with ā¤ļø by the TinyTorch community
"The best way to understand transformers is to see them learn."