Manual Recovery Guide

Overview

Claude-mem's manual recovery system helps you recover observations that get stuck in the processing queue after worker crashes, system restarts, or unexpected shutdowns.

Key Change in v5.x: Automatic recovery on worker startup is now disabled. This gives you explicit control over when reprocessing happens, preventing unexpected duplicate observations.

When Do You Need Manual Recovery?

You should trigger manual recovery when:

Worker crashed or restarted - Observations were queued but worker stopped before processing
No new summaries appearing - Observations are being saved but not processed into summaries
Stuck messages detected - Messages showing as "processing" for >5 minutes
System crashes - Unexpected shutdowns left messages in incomplete states

Quick Start

Using the CLI Tool (Recommended)

The interactive CLI tool is the safest and easiest way to recover stuck observations:

bash

# Check status and prompt for recovery
bun scripts/check-pending-queue.ts

This will:

Check worker health
Show queue summary (pending, processing, failed, stuck counts)
Display sessions with pending work
Prompt you to confirm recovery
Show recently processed messages for feedback

Auto-Process Without Prompts

For scripting or when you're confident recovery is needed:

bash

# Auto-process without prompting
bun scripts/check-pending-queue.ts --process

# Limit to 5 sessions
bun scripts/check-pending-queue.ts --process --limit 5

Understanding Queue States

Messages progress through these lifecycle states:

pending → Queued, waiting to process
processing → Currently being processed by SDK agent
processed → Completed successfully
failed → Failed after 3 retry attempts

Stuck Detection

Messages in processing state for >5 minutes are considered stuck:

They're automatically reset to pending on worker startup
They're NOT automatically reprocessed (requires manual trigger)
They appear in the stuckCount field when checking queue status

Recovery Methods

Method 1: Interactive CLI Tool

Best for: Regular users, interactive sessions, when you want visibility into what's happening

bash

bun scripts/check-pending-queue.ts

Example Output:

Checking worker health...
Worker is healthy ✓

Queue Summary:
  Pending: 12 messages
  Processing: 2 messages (1 stuck)
  Failed: 0 messages
  Recently Processed: 5 messages in last 30 minutes

Sessions with pending work: 3
  Session 44: 5 pending, 1 processing (age: 2m)
  Session 45: 4 pending, 1 processing (age: 7m - STUCK)
  Session 46: 2 pending

Would you like to process these pending queues? (y/n)

Features:

✅ Pre-flight health check (verifies worker is running)
✅ Detailed queue breakdown by session
✅ Age tracking for stuck detection
✅ Confirmation prompt (prevents accidental reprocessing)
✅ Non-interactive mode with --process flag
✅ Session limit control with --limit N

Method 2: HTTP API

Best for: Automation, scripting, integration with monitoring systems

Check Queue Status

bash

curl http://localhost:37777/api/pending-queue

Response:

json

{
  "queue": {
    "messages": [
      {
        "id": 123,
        "session_db_id": 45,
        "claude_session_id": "abc123",
        "message_type": "observation",
        "status": "pending",
        "retry_count": 0,
        "created_at_epoch": 1730886600000
      }
    ],
    "totalPending": 12,
    "totalProcessing": 2,
    "totalFailed": 0,
    "stuckCount": 1
  },
  "recentlyProcessed": [...],
  "sessionsWithPendingWork": [44, 45, 46]
}

Key Fields:

totalPending - Messages waiting to process
totalProcessing - Messages currently processing
stuckCount - Processing messages >5 minutes old
sessionsWithPendingWork - Session IDs needing recovery

Trigger Recovery

bash

curl -X POST http://localhost:37777/api/pending-queue/process \
  -H "Content-Type: application/json" \
  -d '{"sessionLimit": 10}'

Response:

json

{
  "success": true,
  "totalPendingSessions": 15,
  "sessionsStarted": 10,
  "sessionsSkipped": 2,
  "startedSessionIds": [44, 45, 46, 47, 48, 49, 50, 51, 52, 53]
}

Response Fields:

totalPendingSessions - Total sessions with pending messages in database
sessionsStarted - Sessions we started processing this request
sessionsSkipped - Sessions already processing (prevents duplicate agents)
startedSessionIds - Database IDs of sessions we started

Best Practices

1. Always Check Before Recovery

bash

# Check queue status first
curl http://localhost:37777/api/pending-queue

# Or use CLI tool which checks automatically
bun scripts/check-pending-queue.ts

2. Start with Low Session Limits

bash

# Process only 5 sessions at a time
bun scripts/check-pending-queue.ts --process --limit 5

This prevents overwhelming the worker with too many concurrent SDK agents.

3. Monitor During Recovery

Watch worker logs while recovery runs:

bash

npm run worker:logs

Look for:

SDK agent starts: Starting SDK agent for session...
Processing completions: Processed observation...
Errors: ERROR or Failed to process...

4. Verify Recovery Success

Check recently processed messages:

bash

curl http://localhost:37777/api/pending-queue | jq '.recentlyProcessed'

Or use the CLI tool which shows this automatically.

5. Handle Failed Messages

Messages that fail 3 times are marked failed and won't auto-retry:

bash

# View failed messages
sqlite3 ~/.claude-mem/claude-mem.db "
  SELECT id, session_db_id, message_type, retry_count
  FROM pending_messages
  WHERE status = 'failed'
  ORDER BY completed_at_epoch DESC;
"

You can manually reset them if needed:

bash

sqlite3 ~/.claude-mem/claude-mem.db "
  UPDATE pending_messages
  SET status = 'pending', retry_count = 0
  WHERE status = 'failed';
"

Troubleshooting

Recovery Not Working

Symptom: Triggered recovery but messages still pending

Solutions:

Verify worker health:
bash
```
curl http://localhost:37777/health
```
Check worker logs for errors:
bash
```
npm run worker:logs | grep -i error
```
Restart worker:
bash
```
npm run worker:restart
```

Check database integrity:

bash

sqlite3 ~/.claude-mem/claude-mem.db "PRAGMA integrity_check;"

Messages Stuck Forever

Symptom: Messages show as "processing" for hours

Solution: Force reset stuck messages

bash

# Reset all stuck messages to pending
sqlite3 ~/.claude-mem/claude-mem.db "
  UPDATE pending_messages
  SET status = 'pending', started_processing_at_epoch = NULL
  WHERE status = 'processing';
"

# Then trigger recovery
bun scripts/check-pending-queue.ts --process

Worker Crashes During Recovery

Symptom: Worker stops while processing recovered messages

Solutions:

Check available memory:
bash
```
npm run worker:status
```

Reduce session limit:

bash

bun scripts/check-pending-queue.ts --process --limit 3

Check for SDK errors in logs:
bash
```
npm run worker:logs | grep -i "sdk"
```

Increase worker memory (if using custom runner):

bash

export NODE_OPTIONS="--max-old-space-size=4096"
npm run worker:restart

Advanced Usage

Direct Database Inspection

View all pending messages:

bash

sqlite3 ~/.claude-mem/claude-mem.db "
  SELECT
    id,
    session_db_id,
    message_type,
    status,
    retry_count,
    datetime(created_at_epoch/1000, 'unixepoch') as created_at,
    datetime(started_processing_at_epoch/1000, 'unixepoch') as started_at,
    CAST((strftime('%s', 'now') * 1000 - started_processing_at_epoch) / 60000 AS INTEGER) as age_minutes
  FROM pending_messages
  WHERE status IN ('pending', 'processing')
  ORDER BY created_at_epoch;
"

Count Messages by Status

bash

sqlite3 ~/.claude-mem/claude-mem.db "
  SELECT status, COUNT(*) as count
  FROM pending_messages
  GROUP BY status;
"

Find Sessions with Pending Work

bash

sqlite3 ~/.claude-mem/claude-mem.db "
  SELECT
    session_db_id,
    COUNT(*) as pending_count,
    GROUP_CONCAT(message_type) as message_types
  FROM pending_messages
  WHERE status IN ('pending', 'processing')
  GROUP BY session_db_id;
"

View Recent Failures

bash

sqlite3 ~/.claude-mem/claude-mem.db "
  SELECT
    id,
    session_db_id,
    message_type,
    retry_count,
    datetime(completed_at_epoch/1000, 'unixepoch') as failed_at
  FROM pending_messages
  WHERE status = 'failed'
  ORDER BY completed_at_epoch DESC
  LIMIT 10;
"

Integration Examples

Cron Job for Automatic Recovery

bash

#!/bin/bash
# Run every hour to process stuck queues

# Check if worker is healthy
if curl -f http://localhost:37777/health > /dev/null 2>&1; then
  # Auto-process up to 5 sessions
  bun scripts/check-pending-queue.ts --process --limit 5
else
  echo "Worker not healthy, skipping recovery"
  exit 1
fi

Monitoring Script

bash

#!/bin/bash
# Alert if stuck count exceeds threshold

STUCK_COUNT=$(curl -s http://localhost:37777/api/pending-queue | jq '.queue.stuckCount')

if [ "$STUCK_COUNT" -gt 5 ]; then
  echo "WARNING: $STUCK_COUNT stuck messages detected"
  # Send alert (email, Slack, etc.)
fi

Pre-Shutdown Recovery

bash

#!/bin/bash
# Process pending queues before system shutdown

echo "Processing pending queues before shutdown..."
bun scripts/check-pending-queue.ts --process --limit 20

echo "Waiting for processing to complete..."
sleep 10

echo "Stopping worker..."
claude-mem stop

Migration Note

If you're upgrading from v4.x to v5.x:

v4.x Behavior (Automatic Recovery):

Worker automatically recovered stuck messages on startup
No user control over reprocessing timing

v5.x Behavior (Manual Recovery):

Stuck messages detected but NOT automatically reprocessed
User must explicitly trigger recovery via CLI or API
Prevents unexpected duplicate observations
Provides explicit control over when processing happens

Migration Steps:

Upgrade to v5.x
Check for stuck messages: bun scripts/check-pending-queue.ts
Process if needed: bun scripts/check-pending-queue.ts --process
Add recovery to your workflow (cron job, pre-shutdown script, etc.)

Manual Recovery

Manual Recovery Guide

Overview

When Do You Need Manual Recovery?

Quick Start

Using the CLI Tool (Recommended)

Auto-Process Without Prompts

Understanding Queue States

Stuck Detection

Recovery Methods

Method 1: Interactive CLI Tool

Method 2: HTTP API

Check Queue Status

Trigger Recovery

Best Practices

1. Always Check Before Recovery

2. Start with Low Session Limits

3. Monitor During Recovery

4. Verify Recovery Success

5. Handle Failed Messages

Troubleshooting

Recovery Not Working

Messages Stuck Forever

Worker Crashes During Recovery

Advanced Usage

Direct Database Inspection

Count Messages by Status

Find Sessions with Pending Work

View Recent Failures

Integration Examples

Cron Job for Automatic Recovery

Monitoring Script

Pre-Shutdown Recovery

Migration Note

See Also