Back to Mem0

mem0-test-integration

skills/mem0-test-integration/SKILL.md

2.0.215.7 KB
Original Source

mem0-test-integration

Verifies what /mem0-integrate produced. Runs in the same workspace, on the same feature branch. Loose coupling — fast, catches compile and runtime bugs, does not catch logical errors.

Canonical sources (use these, not ambient knowledge)

All static checks and smoke-test shapes validate against these URLs. WebFetch each before running step 3.

Read the Delegated skill: field in .mem0-integration/plan.md — if it names a skill URL, fetch that skill and use its example blocks as the reference for both static checks (step 3) and the smoke test (step 5).

Non-invasiveness contract

Every check in this skill assumes the integration is additive and feature-flagged (see /mem0-integrate "Integration principles"). Specifically:

  • product.json must contain a feature_flag field.
  • Steps 4–6 run in two passes:
    • Pass A — flag unset. All pre-existing tests must pass, smoke/E2E skip. The repo must behave like main. Any failure here is a hard fail — do not let the self-heal loop attempt a patch.
    • Pass B — flag set. New tests must pass, smoke and E2E run.
  • If Pass A fails, the scorecard marks non_invasive: false and sets overall: fail with a distinct reason code the integrator's heal loop refuses to touch.

Preconditions

Refuse to start unless ALL of the following are true:

  • .mem0-integration/ directory exists in the repo root.
  • .mem0-integration/product.json, goal.md, and plan.md are readable and internally consistent (JSON parses, docs non-empty).
  • Current branch name begins with mem0-integrate/ (set by the companion skill). Prevents accidental runs on unrelated branches.
  • Working tree is clean. The skill never modifies source files; any dirty state means the integration is mid-edit and not ready to verify.
  • The same API key the integration used is available in the environment (MEM0_API_KEY for Platform, OPENAI_API_KEY for OSS — read which from product.json). Interactive mode asks if missing; CI mode exits 2.

Exit with a written rationale on any precondition failure. Never attempt to "fix up" state.

Pipeline

1. Read the contract

Load:

  • product.json → which language, which product (Platform vs OSS), which mem0 version, write_site, read_site.
  • plan.md → the mechanical contract (write pattern, read pattern, preserved behavior).
  • goal.md → the intent (displayed in the scorecard only; not tested).

2. Install dependencies

Route by language from product.json:

LanguageCommand
Pythonpip install -e . if editable, else pip install -r requirements.txt. Then pip install mem0ai if not already present at the pinned version.
TypeScript / JavaScriptnpm install (or pnpm install / yarn install if detected by lockfile).

If install fails → exit code 2 with stderr tail. Never move to testing if dependencies don't resolve.

3. Static sanity checks (fast, local, no API calls)

  • Import check: does the write-site file import the expected Mem0 surface? Authoritative list comes from ## Identify the User's Setup in https://docs.mem0.ai/llms.txt:

    • Platform Python → from mem0 import MemoryClient
    • Platform TS → import MemoryClient from "mem0ai"
    • OSS Python → from mem0 import Memory
    • OSS TS → import { Memory } from "mem0ai/oss"

    If plan.md names a delegated skill (e.g., Vercel AI), use that skill's import signature instead of the list above. Mismatch → fail with line number.

  • Version check: installed mem0ai version falls in the range from this skill's mem0_tested_versions. Out of range → warn but continue.

  • Type check (TS tracks only): run tsc --noEmit or tsup --dts. Non-zero → fail.

  • Lint (if the repo has a linter configured): run the repo's own lint command. Lint failures from this skill's changes → fail; pre-existing lint failures → surface as a warning.

  • Eager-init check: grep the write_site and read_site files (paths from product.json) for MemoryClient( or Memory( at module scope — i.e., not inside a function, method, or class body. MemoryClient() validates the API key in __init__ (network call) and OSS Memory() can eagerly initialize embedding/LLM providers — module-level instantiation hits the wire on import and breaks Pass A's test collection whenever the key is unset. Hit → fail with file:line and the lazy-init guidance from /mem0-integrate step 8 constraint #7.

4. Run the repo's native test suite (two passes)

LanguageTest command (in priority order)
Pythonpytest with the test files from step 5 of the companion skill, else python -m unittest discover.
TypeScript / JavaScriptnpm test if defined in package.json; else auto-detect vitest or jest.

Pass A — feature_flag unset. Run the entire pre-existing suite (excluding the new test_mem0_* files). Must be 100% green. Any failure here marks non_invasive: false in the scorecard and is a hard fail — the integrator's self-heal loop refuses to touch it.

Pass B — feature_flag set (value from product.json). Run the full suite including the new tests. All must pass.

Isolate integration-introduced failures using git diff main..HEAD --name-only. A test file that exists on main and fails only under the integration branch (flag set or unset) counts against the scorecard regardless of pass. A test file that already failed on main is surfaced as pre_existing_unrelated and does not count — but is still reported so the user can clean it up.

Capture output to .mem0-integration/test-stdout-flag-off.log and .mem0-integration/test-stdout-flag-on.log. Scorecard reports pass/fail per pass.

5. Smoke test (real API call, shortest round-trip)

Scripted end-to-end flow tailored to product.json. The call shapes below are the minimal ones; if plan.md names a delegated skill, use that skill's minimal example verbatim instead — it is the canonical shape for the detected stack.

Platform (Python):

from mem0 import MemoryClient
c = MemoryClient()                               # uses MEM0_API_KEY
uid = f"mem0-test-integration-{os.urandom(4).hex()}"
c.add([{"role": "user", "content": "I prefer aisle seats"}], user_id=uid)
hits = c.search("seat preference", user_id=uid)
assert any("aisle" in h.get("memory", "") for h in hits), hits
c.delete_all(user_id=uid)                        # clean up

Platform (TS): same shape with MemoryClient from "mem0ai".

OSS (Python / TS): uses Memory() / new Memory() with default config (OpenAI LLM via OPENAI_API_KEY, local Qdrant). If the repo ships a docker-compose.yml with a Qdrant service, the skill starts it first and tears it down after. If no backing store is reachable → fail with a clear message naming the fix.

The smoke test always uses a disposable random user_id prefixed with mem0-test-integration- so a failed cleanup doesn't pollute the user's real data. A background tidy step deletes any prefix-matching entries older than 24 hours on the next run.

Capture output to .mem0-integration/smoke-stdout.log.

6. E2E integration test (run the app, exercise the flow)

Unit tests + smoke prove the SDK works in isolation. This step is the real signal: does memory actually appear in the app's user-visible output when the integration runs end-to-end?

Requires plan.md to contain an E2E recipe: section (authored by /mem0-integrate step 5). If absent → status skipped (not fail), note in scorecard that the repo has no runnable entry point.

Recipe fields the skill reads:

  • start — shell command to launch the app using $PORT for any network port. Run in background with stdout/stderr teed to .mem0-integration/e2e-app.log.
  • ready_probe — how to detect readiness. url=... status=... polls an HTTP endpoint; log="..." waits for a substring in e2e-app.log; sleep=N waits N seconds (last resort). 60-second hard timeout.
  • compose_services — optional. If set, bring them up via docker compose up -d <services> before start, tear them down with docker compose down at the end.
  • write_call — triggers the Mem0 write path exactly once. Output is captured and surfaced on failure. 60-second hard timeout.
  • write_async_wait_ms — pause after write_call to let async memory flushes land. Default 0.
  • read_call — triggers the Mem0 read path. Typically a fresh session or new request that should surface the stored memory.
  • read_assert — substring, regex=..., or jsonpath=<expr>=<value> that must appear in read_call's stdout. This is the E2E pass gate.

Execution order:

  1. Allocate an ephemeral TCP port; export as PORT.
  2. Set MEM0_USER_ID to a disposable mem0-test-integration-<rand> value and export it, so the app can use the same scoping the smoke test does if the recipe wants cleanup.
  3. Bring up compose_services if named.
  4. Run start in the background.
  5. Poll ready_probe until success or 60s timeout. Timeout → fail.
  6. Run write_call. Non-zero exit → fail (but continue to cleanup).
  7. Sleep write_async_wait_ms.
  8. Run read_call.
  9. Evaluate read_assert against read_call's stdout. Miss → fail.
  10. Cleanup (always, even on failure): SIGTERM the app, SIGKILL after 5s, docker compose down if services were started, delete_all memories matching mem0-test-integration-* on Platform scenarios.

On any failure, the scorecard includes:

  • Last 40 lines of e2e-app.log
  • Full write_call output
  • Full read_call output
  • The expected vs actual for read_assert

7. Scorecard

Write .mem0-integration/scorecard.md and .mem0-integration/scorecard.json:

{
  "timestamp": "2026-04-20T14:03:11Z",
  "branch": "mem0-integrate/remember-user-preferences",
  "product": "platform",
  "language": "python",
  "mem0_version": "2.0.0",
  "non_invasive": true,
  "feature_flag": "MEM0_ENABLED",
  "results": {
    "install":      {"status": "pass", "duration_ms": 12043},
    "static_checks":{"status": "pass", "duration_ms": 812},
    "unit_tests_flag_off": {"status": "pass", "duration_ms": 3920, "count": 47,
                            "reason": "all pre-existing tests green with flag unset"},
    "unit_tests_flag_on":  {"status": "pass", "duration_ms": 4321, "count": 49},
    "smoke_test":   {"status": "pass", "duration_ms": 2890, "memory_id": "mem_..."},
    "e2e_test":     {"status": "pass", "duration_ms": 14200,
                     "ready_probe_ms": 3100, "write_exit": 0,
                     "read_assert_matched": true}
  },
  "friction": {
    "dependency_install_retries": 0,
    "pre_existing_test_failures": 0,
    "warnings": ["mem0ai 2.0.0 pinned; consider 2.0.1 for fix X"]
  },
  "overall": "pass"
}

The markdown version is human-readable and includes:

  • Goal doc + plan doc reprinted at top (so reviewers don't have to hunt).
  • Each check with pass/fail + log excerpt.
  • Friction summary.
  • Verbatim warnings from mem0 SDK (if any — e.g., deprecated field usage).
  • Explicit "NOT checked" section listing what loose coupling misses: "Whether the stored data is what the user wants stored. Whether search runs at the right moment. Whether user_id matches the actual session scope. Human review required."

8. Report + exit

  • Print the scorecard path + overall pass/fail to stdout.
  • Do not commit the scorecard files. They live in .mem0-integration/, which is gitignored. The user can inspect and optionally pin.
  • On fail: print the first failing step's log tail (last 40 lines) and stop. Do not attempt to fix anything.

Artifacts (all under .mem0-integration/)

FilePurposeRetention
scorecard.mdHuman-readable verdict.Overwritten per run.
scorecard.jsonMachine-readable verdict. Consumed by the CI scorecard workflow later.Overwritten per run.
test-stdout-flag-off.logStep 4 Pass A (pre-existing suite, flag unset).Overwritten per run.
test-stdout-flag-on.logStep 4 Pass B (full suite, flag set).Overwritten per run.
smoke-stdout.logFull output from step 5.Overwritten per run.
e2e-app.logBackground app stdout/stderr from step 6.Overwritten per run.
e2e-calls.logwrite_call + read_call invocations and outputs.Overwritten per run.

Modes

ModeTriggerBehavior
Interactive (default)TTY present, MEM0_TEST_CI unsetAsks for missing keys, prints friendly summaries.
CIMEM0_TEST_CI=1Keys must be in env, no prompts, non-zero exit on any fail. JSON scorecard goes to stdout's tail for workflow parsing.

Invocation

/mem0-test-integration                       # interactive, all steps
/mem0-test-integration --ci                  # non-interactive
/mem0-test-integration --skip-smoke          # no API calls, no E2E
/mem0-test-integration --skip-e2e            # unit + smoke only (faster CI)
/mem0-test-integration --only-smoke          # just smoke
/mem0-test-integration --only-e2e            # just E2E (assumes deps installed)

Composition: --skip-* can stack (--skip-smoke --skip-e2e = static + unit only, zero API cost). --only-* is mutually exclusive with all other flags.

Exit codes

CodeMeaning
0All checks passed.
1Precondition failed (no .mem0-integration/, wrong branch, dirty tree).
2Missing env key (CI mode) or dependency install failure.
3Static sanity check failed (wrong import, type error).
4Unit tests failed (Pass B — integration itself broken).
5Smoke test failed.
6E2E test failed (ready_probe timeout, write/read call failed, or read_assert miss).
7Non-invasiveness violation: Pass A failed (pre-existing tests broke). Integrator's heal loop refuses to touch this.
8Internal error (skill bug — report it).

Explicitly out of scope

  • Modifying source files. The skill is read-only against the repo. If verification exposes a bug, re-run /mem0-integrate on the same goal + plan; do not hand-patch.
  • Fixing broken tests. Failing unit tests are a signal that the integration is wrong, not that the tests are wrong. The skill does not "try a different test."
  • Deep logical correctness. The E2E step proves "something the user said earlier comes back later," which is a useful but shallow signal. It does NOT prove the integration picks the right facts to store, scopes user_id correctly across real users, or handles conflict resolution well. That's human review territory.
  • Self-healing. This skill never modifies source files. The paired /mem0-integrate skill in its default --heal mode consumes the scorecard produced here and drives its own remediation loop. Exit code 7 (non-invasiveness violation) is the explicit signal the heal loop must stop and surface to the user.
  • Cross-branch comparisons. No main baseline diffing. The scorecard reflects this branch only.
  • Running against production data. Every smoke test uses a disposable random user_id and cleans up after. Never touches any other user's data.