tinytorch/datasets/tinydigits/README.md
A curated subset of the sklearn digits dataset for rapid ML prototyping and educational demonstrations.
Following Karpathy's "~1000 samples" philosophy for educational datasets.
datasets/tinydigits/
├── train.pkl # {'images': (1000, 8, 8), 'labels': (1000,)}
└── test.pkl # {'images': (200, 8, 8), 'labels': (200,)}
import pickle
# Load training data
with open('datasets/tinydigits/train.pkl', 'rb') as f:
data = pickle.load(f)
train_images = data['images'] # (1000, 8, 8)
train_labels = data['labels'] # (1000,)
# Load test data
with open('datasets/tinydigits/test.pkl', 'rb') as f:
data = pickle.load(f)
test_images = data['images'] # (200, 8, 8)
test_labels = data['labels'] # (200,)
Educational Infrastructure: Designed for teaching ML systems with real data at edge-device scale.
Following Andrej Karpathy's philosophy: "~1000 samples is the sweet spot for educational datasets."
Created from the sklearn digits dataset (8×8 downsampled MNIST):
The sklearn digits dataset itself is derived from the UCI ML hand-written digits datasets.
TinyDigits serves as the first step in a scaffolded learning path:
TinyDigits is curated from the sklearn digits dataset for educational use in TinyTorch.
Original Source:
TinyTorch Curation:
@misc{tinydigits2025,
title={TinyDigits: Curated Educational Dataset for ML Systems Learning},
author={TinyTorch Project},
year={2025},
note={Balanced subset of sklearn digits optimized for edge deployment}
}
To regenerate this dataset from the original sklearn data:
python3 datasets/tinydigits/create_tinydigits.py
This ensures reproducibility and allows customization for specific educational needs.
See LICENSE for details. TinyDigits inherits the BSD 3-Clause license from sklearn.