TinyDigits Dataset

A curated subset of the sklearn digits dataset for rapid ML prototyping and educational demonstrations.

Following Karpathy's "~1000 samples" philosophy for educational datasets.

Training: 1000 samples (100 per digit, 0-9)
Test: 200 samples (20 per digit, balanced)
Format: 8×8 grayscale images, float32 normalized [0, 1]
Size: ~310 KB total (vs 10 MB MNIST, 50× smaller)

Files

datasets/tinydigits/
├── train.pkl  # {'images': (1000, 8, 8), 'labels': (1000,)}
└── test.pkl   # {'images': (200, 8, 8), 'labels': (200,)}

Usage

python

import pickle

# Load training data
with open('datasets/tinydigits/train.pkl', 'rb') as f:
    data = pickle.load(f)
    train_images = data['images']  # (1000, 8, 8)
    train_labels = data['labels']  # (1000,)

# Load test data
with open('datasets/tinydigits/test.pkl', 'rb') as f:
    data = pickle.load(f)
    test_images = data['images']   # (200, 8, 8)
    test_labels = data['labels']   # (200,)

Purpose

Educational Infrastructure: Designed for teaching ML systems with real data at edge-device scale.

Following Andrej Karpathy's philosophy: "~1000 samples is the sweet spot for educational datasets."

Decent accuracy: Achieves ~80% test accuracy on MLPs (vs <20% with 150 samples)
Fast training: <10 sec on CPU, instant feedback loop
Balanced classes: Perfect 100 samples per digit (0-9)
Offline-capable: Ships with repo, no downloads needed
USB-friendly: 310 KB fits on any device, even RasPi0
Real learning curve: Model improves visibly across epochs

Curation Process

Created from the sklearn digits dataset (8×8 downsampled MNIST):

Balanced Sampling: 100 training samples per digit class (1000 total)
Test Split: 20 samples per digit (200 total) from remaining examples
Random Seeding: Reproducible selection (seed=42)
Normalization: Pixels normalized to [0, 1] range
Shuffled: Training and test sets randomly shuffled for fair evaluation

The sklearn digits dataset itself is derived from the UCI ML hand-written digits datasets.

Why TinyDigits vs Full MNIST?

<table> <thead> <tr> <th width="25%">Metric</th> <th width="20%">MNIST</th> <th width="20%">TinyDigits</th> <th width="35%">Benefit</th> </tr> </thead> <tbody> <tr> <td><b>Samples</b></td> <td>60,000</td> <td>1,000</td> <td>60× fewer samples</td> </tr> <tr> <td><b>File size</b></td> <td>10 MB</td> <td>310 KB</td> <td>32× smaller</td> </tr> <tr> <td><b>Train time</b></td> <td>5-10 min</td> <td><10 sec</td> <td>30-60× faster</td> </tr> <tr> <td><b>Test accuracy (MLP)</b></td> <td>~92%</td> <td>~80%</td> <td>Close enough for learning</td> </tr> <tr> <td><b>Download</b></td> <td>Network required</td> <td>Ships with repo</td> <td>Always available</td> </tr> <tr> <td><b>Resolution</b></td> <td>28×28 (784 pixels)</td> <td>8×8 (64 pixels)</td> <td>Faster forward pass</td> </tr> <tr> <td><b>Edge deployment</b></td> <td>Challenging</td> <td>Perfect</td> <td>Works on RasPi0</td> </tr> </tbody> </table>

Educational Progression

TinyDigits serves as the first step in a scaffolded learning path:

TinyDigits (8×8) ← Start here: Learn MLP/CNN basics with instant feedback
Full MNIST (28×28) ← Graduate to: Standard benchmark, longer training
CIFAR-10 (32×32 RGB) ← Advanced: Color images, real-world complexity

Citation

TinyDigits is curated from the sklearn digits dataset for educational use in TinyTorch.

Original Source:

sklearn.datasets.load_digits()
Derived from UCI ML hand-written digits datasets
License: BSD 3-Clause (sklearn)

TinyTorch Curation:

bibtex

@misc{tinydigits2025,
  title={TinyDigits: Curated Educational Dataset for ML Systems Learning},
  author={TinyTorch Project},
  year={2025},
  note={Balanced subset of sklearn digits optimized for edge deployment}
}

Generation

To regenerate this dataset from the original sklearn data:

bash

python3 datasets/tinydigits/create_tinydigits.py

This ensures reproducibility and allows customization for specific educational needs.

License

See LICENSE for details. TinyDigits inherits the BSD 3-Clause license from sklearn.

TinyDigits Dataset

TinyDigits Dataset

Contents

Files

Usage

Purpose

Curation Process

Why TinyDigits vs Full MNIST?

Educational Progression

Citation

Generation

License