pkg/engine/headless/TODOS.md
After clicking on elements there isn't enough wait time to reflect SPA navigation
Replace exact DOM hash with perceptual fingerprint
ExactHash & FuzzyHash; update graph comparison logic.Robust βpage readyβ detector
MutationObserver + requestIdleCallback.location.href/history.length/<title> stable.page.WaitForRouteChange() and replace WaitPageLoadHeuristics.Lazy-load / infinite-scroll support
scrollBy(0, viewportHeight*0.9) until scrollHeight stops growing.Capture all secondary resource navigations
FetchEnable + Network.* events.Dynamic form-filling
type, pattern, min, max, maxlength, required.ValueProvider interface for site-specific logic.Site adapters / hooks
Concurrent tab execution
rod.Page instances (shared browser) β make CrawlGraph concurrency-safe.Smart time-out & retry budgets
Viewport variants
Memory & process recycling
Anti-bot hardening
hardwareConcurrency.Export crawl sessions
JS coverage tracking
Profiler.startPreciseCoverage β know which scripts never executed.Metrics & health
/debug/pprof enabled by default.TLS / proxy flexibility
Sandboxing & security
Graceful crash recovery
Page.crashed / Browser.disconnected; re-spawn browser, resume queue.Claude OPUS info below:
Handle iframe content extraction
WebComponent & Shadow DOM support
Multi-window/tab detection
window.open() calls that bypass current hooksAuth state detection
Multi-step auth flows
Session persistence
Complex UI interactions
Keyboard navigation support
Touch/mobile gestures
Performance metrics
Crawl quality metrics
Error tracking
ML-based duplicate detection
Priority queue optimization
State space reduction
CAPTCHA handling
Rate limiting & politeness
Privacy compliance
API extraction
Export formats
Workflow recording
Rendering optimizations
Caching layer
Distributed crawling
Debug tooling
Configuration management
Testing infrastructure
state.go is the crawlerβs βstate-managerβ.
Everything else in the headless package (browser wrappers, normalizer, graph, diagnostics) either feeds data into it or asks it to restore a known state.
To make the crawler scalable, reliable and de-dupe friendly the file should be responsible for exactly three things:
Below is a complete design that meets those goals and leaves room for future TODOs.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
onclick, onmouseover, β¦).B. Two-tier hash
β’ ExactHash = SHA-256(strippedDOM).
β’ FuzzyHash = SimHash64(4-word shingles of strippedDOM).
β’ Treat states equal if
- ExactHash matches, or
- Hamming(FuzzyHash, other.FuzzyHash) β€ 3 bits.
β’ Persist both; the graph layer deduplicates on (ExactHash || close-enough FuzzyHash).
C. Optional visual fallback
β’ If comparison is inconclusive (β₯ 4 bit distance but DOM len < 1 MiB)
β low-res screenshot, pHash/dHash β same threshold logic.
β’ Executed lazily to avoid perf hit.
Resulting struct:
type PageState struct { ExactHash string // always present FuzzyHash uint64 // present if SimHash computed URL string Title string Depth int StrippedDOM string NavigationAction *Action // edge that produced this state Timestamp time.Time }
ββββ Advantages
β’ SimHash makes minor DOM variations (ads, CSRF tokens) resolve to the same state, reducing graph size.
β’ Screenshot hash catches SPA view switches that donβt touch the DOM tree much but look different.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 2. Metadata collection (page β PageState) ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Algorithm newPageState(page, causingAction):
page.Info(); bail out if URL is empty or about:blank.Edge cases handled:
β’ Empty page β custom ErrEmptyPage (already present).
β’ Non-deterministic DOM normalizer failure β bubbled up with context.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 3. Return-to-origin algorithm (current page, targetOriginID) β (pageID, error) ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Keep the existing three-level approach but hard-code their priority and exit conditions.
Step 0 Fast-fail: if currentID == target β done.
Step 1 Element re-use
β’ If action.Element is non-nil, locate by XPath, ensure Visible & Interactable,
plus DOM equality check under the canonicalizer to avoid false positives.
β’ If match, return targetOriginID.
Step 2 Browser history
β’ page.GetNavigationHistory()
β’ Walk back until (url == origin.URL && title == origin.Title).
β’ Limit: max 10 steps to avoid long loops.
β’ After each back() call wait with WaitForRouteChange() (new detector described below).
β’ Recompute fingerprint; if equal (exact or fuzzy) β success.
Step 3 Graph shortest path
β’ crawlerGraph.ShortestPath(currentID, targetID).
β’ If unreachable, retry from emptyPageHash (fresh tab).
β’ Execute each Action; after each, WaitForRouteChange().
β’ After final step verify state (same equality logic as Step 2).
β’ Failure β ErrNoNavigationPossible.
Enhancements
β’ Cache the computed βdistanceβ between two states; next call can skip graph search.
β’ Record statistics (#navigationBackSuccessByMethod) to tune the priority order.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 4. βPage readyβ detector (WaitForRouteChange) ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Replace the brittle WaitPageLoadHeuristics with:
Injected JS once per tab:
const idle = () => new Promise(res => { const done = () => { obs.disconnect(); res(); }; let t; const reset = () => { clearTimeout(t); t = setTimeout(done, 300); }; const obs = new MutationObserver(reset); obs.observe(document, {subtree: true, childList: true, attributes: true}); reset(); });
window.__katanaReady = () => Promise.all([ idle(), new Promise(r => requestIdleCallback(r, {timeout: 5000})) ]);
Go side:
func (p BrowserPage) WaitForRouteChange() error {
ctx, cancel := context.WithTimeout(p.ctx, 15time.Second)
defer cancel()
return rod.Try(func() {
p.Eval(ctx, await window.__katanaReady())
})
}
Detects route changes, SPA navigations, AJAX content, infinite scroll βsettlingβ, etc.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
5. Extensibility hooks
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β’ FingerprintStrategy interface so users can plug in custom SimHash/Screenshot logic.
β’ ValueProvider & SiteAdapter interfaces already planned can depend on PageState to decide actions.
β’ Diagnostics sink gets PageState + serialized Action graph for offline visualizer.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 6. Migration plan ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
This architecture keeps state.go focused, removes hidden coupling, and sets up the crawler for future road-map items (concurrent tabs, adapters, ML dedup, etc.) while remaining incremental enough to merge in small PRs.