Datasets and Benchmarking

Aeon provides comprehensive tools for loading datasets and benchmarking time series algorithms.

From aeon 1.4 onward, most classification and regression archives are hosted on Zenodo (including the relaunched Multiverse multivariate classification archive). Loaders download on first use; cache location follows aeon defaults.

Dataset Loading

Task-Specific Loaders

Classification Datasets:

python

from aeon.datasets import load_classification

# Load train/test split (or use load_gunpoint for this benchmark)
from aeon.datasets import load_gunpoint

X_train, y_train = load_classification("GunPoint", split="train")
X_test, y_test = load_classification("GunPoint", split="test")
# X_train, y_train = load_gunpoint(split="train")

# Load entire dataset
X, y = load_classification("GunPoint")

Regression Datasets:

python

from aeon.datasets import load_regression

X_train, y_train = load_regression("Covid3Month", split="train")
X_test, y_test = load_regression("Covid3Month", split="test")

# Bulk download
from aeon.datasets import download_all_regression
download_all_regression()  # Downloads Monash TSER archive

Forecasting Datasets:

python

from aeon.datasets import load_forecasting

# Load from forecastingdata.org
y, X = load_forecasting("airline", return_X_y=True)

Anomaly Detection Datasets:

python

from aeon.datasets import load_anomaly_detection

X, y = load_anomaly_detection("NAB_realKnownCause")

File Format Loaders

Load from .ts files:

python

from aeon.datasets import load_from_ts_file

X, y = load_from_ts_file("path/to/data.ts")

Load from .tsf files:

python

from aeon.datasets import load_from_tsf_file

df, metadata = load_from_tsf_file("path/to/data.tsf")

Load from ARFF files:

python

from aeon.datasets import load_from_arff_file

X, y = load_from_arff_file("path/to/data.arff")

Load from TSV files:

python

from aeon.datasets import load_from_tsv_file

data = load_from_tsv_file("path/to/data.tsv")

Load TimeEval CSV:

python

from aeon.datasets import load_from_timeeval_csv_file

X, y = load_from_timeeval_csv_file("path/to/timeeval.csv")

Writing Datasets

Write to .ts format:

python

from aeon.datasets import write_to_ts_file

write_to_ts_file(X, "output.ts", y=y, problem_name="MyDataset")

Write to ARFF format:

python

from aeon.datasets import write_to_arff_file

write_to_arff_file(X, "output.arff", y=y)

Built-in Datasets

Aeon includes several benchmark datasets for quick testing:

Classification

ArrowHead - Shape classification
GunPoint - Gesture recognition
ItalyPowerDemand - Energy demand
BasicMotions - Motion classification
And 100+ more from UCR/UEA archives

Regression

Covid3Month - COVID forecasting
Various datasets from Monash TSER archive

Segmentation

Time series segmentation datasets
Human activity data
Sensor data collections

Special Collections

RehabPile - Rehabilitation data (classification & regression)

Dataset Metadata

Get information about datasets:

python

from aeon.datasets import get_dataset_meta_data

metadata = get_dataset_meta_data("GunPoint")
print(metadata)
# {'n_train': 50, 'n_test': 150, 'length': 150, 'n_classes': 2, ...}

Benchmarking Tools

Loading Published Results

Access pre-computed benchmark results:

python

from aeon.benchmarking import get_estimator_results

# Get results for specific algorithm on dataset
results = get_estimator_results(
    estimator_name="ROCKET",
    dataset_name="GunPoint"
)

# Get all available estimators for a dataset
estimators = get_available_estimators("GunPoint")

Resampling Strategies

Create reproducible train/test splits:

python

from aeon.benchmarking import stratified_resample

# Stratified resampling maintaining class distribution
X_train, X_test, y_train, y_test = stratified_resample(
    X, y,
    random_state=42,
    test_size=0.3
)

Performance Metrics

Specialized metrics for time series tasks:

Anomaly Detection Metrics:

python

from aeon.benchmarking.metrics.anomaly_detection import (
    range_precision,
    range_recall,
    range_f_score,
    range_roc_auc_score
)

# Range-based metrics for window detection
precision = range_precision(y_true, y_pred, alpha=0.5)
recall = range_recall(y_true, y_pred, alpha=0.5)
f1 = range_f_score(y_true, y_pred, alpha=0.5)
auc = range_roc_auc_score(y_true, y_scores)

Clustering Metrics:

python

from aeon.benchmarking.metrics.clustering import clustering_accuracy

# Clustering accuracy with label matching
accuracy = clustering_accuracy(y_true, y_pred)

Segmentation Metrics:

python

from aeon.benchmarking.metrics.segmentation import (
    count_error,
    hausdorff_error
)

# Number of change points difference
count_err = count_error(y_true, y_pred)

# Maximum distance between predicted and true change points
hausdorff_err = hausdorff_error(y_true, y_pred)

Statistical Testing

Post-hoc analysis for algorithm comparison:

python

from aeon.benchmarking import (
    nemenyi_test,
    wilcoxon_test
)

# Nemenyi test for multiple algorithms
results = nemenyi_test(scores_matrix, alpha=0.05)

# Pairwise Wilcoxon signed-rank test
stat, p_value = wilcoxon_test(scores_alg1, scores_alg2)

Benchmark Collections

UCR/UEA Time Series Archives

Access to comprehensive benchmark repositories:

python

# Classification: 112 univariate + 30 multivariate datasets
X_train, y_train = load_classification("Chinatown", split="train")

# Automatically downloads from timeseriesclassification.com

Monash Forecasting Archive

python

# Load forecasting datasets
y = load_forecasting("nn5_daily", return_X_y=False)

Published Benchmark Results

Pre-computed results from major competitions:

2017 Univariate Bake-off
2021 Multivariate Classification
2023 Univariate Bake-off

Workflow Example

Complete benchmarking workflow:

python

from aeon.datasets import load_classification
from aeon.classification.convolution_based import RocketClassifier
from aeon.benchmarking import get_estimator_results
from sklearn.metrics import accuracy_score
import numpy as np

# Load dataset
dataset_name = "GunPoint"
X_train, y_train = load_classification(dataset_name, split="train")
X_test, y_test = load_classification(dataset_name, split="test")

# Train model
clf = RocketClassifier(n_kernels=10000, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

# Compare with published results
published = get_estimator_results("ROCKET", dataset_name)
print(f"Published ROCKET accuracy: {published['accuracy']:.4f}")

Best Practices

1. Use Standard Splits

For reproducibility, use provided train/test splits:

python

# Good: Use standard splits
X_train, y_train = load_classification("GunPoint", split="train")
X_test, y_test = load_classification("GunPoint", split="test")

# Avoid: Creating custom splits
X, y = load_classification("GunPoint")
X_train, X_test, y_train, y_test = train_test_split(X, y)

2. Set Random Seeds

Ensure reproducibility:

python

clf = RocketClassifier(random_state=42)
results = stratified_resample(X, y, random_state=42)

3. Report Multiple Metrics

Don't rely on single metric:

python

from sklearn.metrics import accuracy_score, f1_score, precision_score

accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')
precision = precision_score(y_test, y_pred, average='weighted')

4. Cross-Validation

For robust evaluation on small datasets:

python

from sklearn.model_selection import cross_val_score

scores = cross_val_score(
    clf, X_train, y_train,
    cv=5,
    scoring='accuracy'
)
print(f"CV Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")

5. Compare Against Baselines

Always compare with simple baselines:

python

from aeon.classification.distance_based import KNeighborsTimeSeriesClassifier

# Simple baseline: 1-NN with Euclidean distance
baseline = KNeighborsTimeSeriesClassifier(n_neighbors=1, distance="euclidean")
baseline.fit(X_train, y_train)
baseline_acc = baseline.score(X_test, y_test)

print(f"Baseline: {baseline_acc:.4f}")
print(f"Your model: {accuracy:.4f}")

6. Statistical Significance

Test if improvements are statistically significant:

python

from aeon.benchmarking import wilcoxon_test

# Run on multiple datasets
accuracies_alg1 = [0.85, 0.92, 0.78, 0.88]
accuracies_alg2 = [0.83, 0.90, 0.76, 0.86]

stat, p_value = wilcoxon_test(accuracies_alg1, accuracies_alg2)
if p_value < 0.05:
    print("Difference is statistically significant")

Dataset Discovery

Find datasets matching criteria:

python

# List all available classification datasets
from aeon.datasets import get_available_datasets

datasets = get_available_datasets("classification")
print(f"Found {len(datasets)} classification datasets")

# Filter by properties
univariate_datasets = [
    d for d in datasets
    if get_dataset_meta_data(d)['n_channels'] == 1
]