apps/opik-documentation/documentation/fern/docs-v2/evaluation/advanced/resume_evaluations.mdx
A long evaluation can be interrupted: Ctrl-C, OOM, a metric raising, a network blip.
opik.evaluate_resume(experiment_id, ...) continues from where the original evaluate(...)
stopped — replaying only the runs that didn't finish, keeping the runs that did.
import opik
from opik.evaluation.metrics import Equals
def my_task(item):
return {"output": call_my_model(item["input"])}
result = opik.evaluate_resume(
experiment_id="<id of the experiment to resume>",
task=my_task,
scoring_metrics=[Equals()],
)
The returned EvaluationResult covers the whole experiment, not just the runs this call
executed. You don't pass dataset, nb_samples, or experiment_name — resume reads them
back from the experiment.
trial_count > 1 all replay.EvaluationResult.test_results covers both the kept runs
and the freshly replayed ones.evaluate_resume is the wrong toolopik.evaluate_experiment(...)
— it scores existing runs without re-running the task.evaluate() against the larger dataset.task implementation or the metrics between calls. Providing the
same task and scoring_metrics you used originally is the caller's responsibility.
Resume calls your new task and runs your new metrics only for the missing runs;
already-completed runs keep their original outputs and feedback scores. If the change
should affect every run, start a fresh evaluate().To call evaluate_resume, the experiment must have been created by:
evaluate(...) call against a versioned dataset.If either condition isn't met, evaluate_resume raises
opik.exceptions.ExperimentNotResumable.
If the original evaluate(...) used a custom dataset_sampler or explicit
dataset_item_ids, resume also needs a local checkpoint that was written next to the
experiment id. Run resume from the same machine that ran the original call — otherwise
opik.exceptions.LocalCheckpointMissing is raised. Evaluations without a sampler or
explicit ids do not need to run on the same machine.