changedetectionio/model/PYDANTIC_MIGRATION.md
Plan for incrementally moving the app's storage dicts behind Pydantic models. Driven by security (CWE-915 mass-assignment, see GHSA-h3x5-5j56-hm2j) and schema enforcement, not just type tidying.
Every form/API endpoint that mutates a stored dict should validate input against a
declared schema before writing. extra='forbid' rejects unknown keys — so an attacker
POSTing extra fields like uuid=…, last_checked=…, history=[…] can't smuggle them
into storage. Per-route allowlists work but rot; one declared schema per stored shape
doesn't.
If you're about to add a compatibility shim, an alias, a backward-compat fallback, or a
"handle both old and new shape" branch — stop and ask whether a one-time update_N
migration solves the same problem by renaming the stored data. A migration runs once
per install; the shim lives in the code forever and every future contributor has to
understand it.
Concrete example from this PR: the original design used Field(alias='llm_X') so
Pydantic could accept both the legacy form-field name (llm_model) and the new
storage name (model). That alias survived every read/write for the life of the app
and introduced a subtle model_dump(by_alias=True) merge bug. The simpler answer was
to rename the form fields to match the storage names (an in-PR rename, no migration
needed since storage was new), drop the aliases entirely, and delete ~25 lines of
plumbing. Pay once with a migration; don't pay forever with complexity.
Same principle applies the moment you find yourself writing dict.get(new_key) or dict.get(old_key). That's a migration in disguise — write the migration instead.
There are two ways to use Pydantic. Pick one per slice — they are not interchangeable.
Pydantic-as-validator (what we do). Storage stays a plain dict. A BaseModel
validates input at the boundary, dumps back to a dict. No call-site changes; the
existing watch['x'] dict access keeps working everywhere.
Pydantic-as-domain-model. Replace dict inheritance with BaseModel. ~190 call
sites switch from watch['x'] to watch.x. Much bigger blast radius, defers the
security win. Not what we're doing right now.
The CWE-915 fix only needs the validator pattern. Domain-model replacement is a separate, later project.
The first migrated slice. Use as the reference for the next one.
Match the WTForms field names to the storage / Pydantic field names so the
form-input dict and the storage dict have the same key shape. No aliases, no
populate_by_name=True, no by_alias=True merge gymnastics. Only reach for
Field(alias=…) if you genuinely cannot rename the form field (rare).
model/LLMSettings.py:
class LLMSettings(BaseModel):
model_config = ConfigDict(extra='forbid')
enabled: bool = True
model: str = ''
...
# System-managed counters
tokens_total_cumulative: int = 0
...
# Field groups
CONNECTION_FIELDS: ClassVar[Tuple[str, ...]] = ('model', 'api_key', ...)
PROTECTED_FIELDS: ClassVar[Tuple[str, ...]] = ('tokens_total_cumulative', ...)
Boundary pattern at the route handler:
# Read
settings = LLMSettings.model_validate(
datastore.data['settings']['application'].get('llm') or {}
)
# Merge form input
form_input = dict(form.data.get('llm') or {})
for protected in LLMSettings.PROTECTED_FIELDS:
form_input.pop(protected, None) # counters never come from form
merged = LLMSettings.model_validate({**settings.model_dump(), **form_input})
# Write — re-validates the schema on every write
datastore.data['settings']['application']['llm'] = merged.model_dump()
Two decisions need answers before the WatchInput slice. They're not blockers for App.py.
Today: docs/api-spec.yaml declares the Watch/Tag shape; model/schema_utils.py reads
it to compute readonly fields; the API layer validates against it; the model layer is a
plain dict that doesn't know about either. When WatchInput lands, that's a third
shape declaration.
Two ways to live:
api-spec.yaml from the model
(e.g. via model_json_schema()). One declaration, multiple consumers. Long-term
right answer; needs tooling.Recommendation: start parallel (keep api-spec.yaml for now), but write Watch's
Pydantic model so it could be the eventual single source. Don't invent a new
field shape — match the spec.
processor_config_restock_diff (and future processor configs) are written by
plugins, not the core. extra='forbid' on a Watch input model would reject them.
Options:
<Processor>Settings Pydantic
model; Watch input validates only core fields, processor configs validate
separately at their own boundary (the per-watch restock_diff.json, etc.).processor_config_* as a
declared dict-typed field. Loses per-key validation but preserves the
plugin-extensibility contract.Recommendation: per-processor sub-models. Matches the file split already done in
update_30 (separate restock_diff.json per watch).
| Target | Difficulty | Value | Status |
|---|---|---|---|
LLMSettings | low | medium | done (this PR) |
App.py → AppSettings (nested) | low | medium | next |
WatchInput (form/API validator) | medium | HIGH — closes GHSA-h3x5-5j56-hm2j | next-next |
TagInput (form/API validator) | medium | medium | after Watch |
watch_base(dict) → BaseModel | very high | high | separate multi-PR project, much later |
Tags.py (TagsDict), persistence.py, schema_utils.py are not data models — leave alone.
App.py. Pure dict tree under settings.{application,requests,headers}. Define
nested BaseModels; LLMSettings slots in as the existing sub-tree. No call-site
churn — just the global settings POST handler. Sets the pattern for nested models.
WatchInput BaseModel for blueprint/ui/edit.py:225 and api/Watch.py. Replace:
datastore.data['watching'][uuid].update(form.data) # CWE-915
with:
validated = WatchInput.model_validate(form.data)
datastore.data['watching'][uuid].update(validated.model_dump())
Closes the unpatched advisory. Should be a security-tagged commit referencing the GHSA.
TagInput BaseModel — same pattern, smaller.
These cost real debugging time in the LLMSettings PR. Worth knowing before the next slice.
extra='forbid' is the right defaultextra='ignore' silently drops unknowns and hides developer mistakes (add a form field,
forget to declare it on the model, your feature appears to work until you reload). forbid
fails loudly. allow defeats the purpose entirely — it's how injection succeeds.
The LLMSettings PR originally used alias='llm_X' to bridge llm_-prefixed WTForms
names to stripped storage names. That created a documented gotcha: with
extra='forbid', having both model and llm_model in the same input dict is a
ValidationError, and merging existing-storage-dump with form input required
by_alias=True to keep both sides on the alias shape. We fixed it by renaming the
form fields to match the storage field names. Match the form to the model
upfront and you avoid the whole class of merge bugs.
If runtime code (e.g. a token accumulator) writes to the storage dict directly, the
schema is bypassed. Load → mutate instance attributes → model_dump() → write back.
This re-validates on every write and prevents drift.
You might be tempted to validate form input strictly but allow extras in storage
hydration. Don't — extra='forbid' everywhere means storage drift is impossible. If
something put unknown keys in storage, you want loud failure, not silent acceptance.
for k in list(d) if k.startswith('llm_') is shorter than an explicit list but
silently catches any future flat llm_* key. Migrations are forever — prefer an
explicit allowlist of keys to move, even if it's verbose.
dump_without_connection(), clear_X()) when stock
model_dump(exclude=set(FIELDS)) works. The standard idiom is more readable and
zero-line.get_X_config() return a Pydantic instance if callers do dict-style access.
Either migrate all call sites (high-touch) or keep returning a dict and let the model
be the validation/dump layer only.model_copy(update=...) without re-validating. It doesn't coerce types or
enforce extra='forbid'. Use model_validate({**old.model_dump(), **updates}) for
strict merges.Each migration PR should ship:
model/<Thing>Settings.py (or input model) — declared schema, extra='forbid',
field aliases if there's a name mismatch between form and storage.store/updates.py:update_N if the storage shape changes. Pure dict-shuffling, no
Pydantic import (migrations should not depend on the model — model evolves
independently).tests/unit/test_<thing>.py — unit coverage of the model itself: defaults,
alias merge, type coercion, extra='forbid' rejection, dump shapes.get_<thing>_settings(datastore) or
equivalent, not raw dict reads.model/LLMSettings.py — the templatetests/unit/test_llm_settings.py — model unit-test templatestore/updates.py:update_31 — schema migration templateblueprint/settings/__init__.py (POST handler) — boundary-validation templatellm/evaluator.py:accumulate_global_tokens — instance-mutation-then-dump-back template