src/features/builtin-skills/dev-browser/references/scraping.md
For large datasets (followers, posts, search results), intercept and replay network requests rather than scrolling and parsing the DOM. This is faster, more reliable, and handles pagination automatically.
Scrolling is slow, unreliable, and wastes time. APIs return structured data with pagination built in. Always prefer API replay.
Don't try to automate everything at once. Work incrementally:
This prevents wasting time debugging a complex script when the issue is a simple path like data.user.timeline vs data.user.result.timeline.
First, intercept a request to understand URL structure and required headers:
import { connect, waitForPageLoad } from "@/client.js";
import * as fs from "node:fs";
const client = await connect();
const page = await client.page("site");
let capturedRequest = null;
page.on("request", (request) => {
const url = request.url();
// Look for API endpoints (adjust pattern for your target site)
if (url.includes("/api/") || url.includes("/graphql/")) {
capturedRequest = {
url: url,
headers: request.headers(),
method: request.method(),
};
fs.writeFileSync("tmp/request-details.json", JSON.stringify(capturedRequest, null, 2));
console.log("Captured request:", url.substring(0, 80) + "...");
}
});
await page.goto("https://example.com/profile");
await waitForPageLoad(page);
await page.waitForTimeout(3000);
await client.disconnect();
Save a raw response to inspect the data structure:
page.on("response", async (response) => {
const url = response.url();
if (url.includes("UserTweets") || url.includes("/api/data")) {
const json = await response.json();
fs.writeFileSync("tmp/api-response.json", JSON.stringify(json, null, 2));
console.log("Captured response");
}
});
Then analyze the structure to find:
data.user.result.timeline.instructions[].entries)cursor-bottom entries)Once you understand the schema, replay requests directly:
import { connect } from "@/client.js";
import * as fs from "node:fs";
const client = await connect();
const page = await client.page("site");
const results = new Map(); // Use Map for deduplication
const headers = JSON.parse(fs.readFileSync("tmp/request-details.json", "utf8")).headers;
const baseUrl = "https://example.com/api/data";
let cursor = null;
let hasMore = true;
while (hasMore) {
// Build URL with pagination cursor
const params = { count: 20 };
if (cursor) params.cursor = cursor;
const url = `${baseUrl}?params=${encodeURIComponent(JSON.stringify(params))}`;
// Execute fetch in browser context (has auth cookies/headers)
const response = await page.evaluate(
async ({ url, headers }) => {
const res = await fetch(url, { headers });
return res.json();
},
{ url, headers }
);
// Extract data and cursor (adjust paths for your API)
const entries = response?.data?.entries || [];
for (const entry of entries) {
if (entry.type === "cursor-bottom") {
cursor = entry.value;
} else if (entry.id && !results.has(entry.id)) {
results.set(entry.id, {
id: entry.id,
text: entry.content,
timestamp: entry.created_at,
});
}
}
console.log(`Fetched page, total: ${results.size}`);
// Check stop conditions
if (!cursor || entries.length === 0) hasMore = false;
// Rate limiting - be respectful
await new Promise((r) => setTimeout(r, 500));
}
// Export results
const data = Array.from(results.values());
fs.writeFileSync("tmp/results.json", JSON.stringify(data, null, 2));
console.log(`Saved ${data.length} items`);
await client.disconnect();
| Pattern | Description |
|---|---|
page.on('request') | Capture outgoing request URL + headers |
page.on('response') | Capture response data to understand schema |
page.evaluate(fetch) | Replay requests in browser context (inherits auth) |
Map for deduplication | APIs often return overlapping data across pages |
| Cursor-based pagination | Look for cursor, next_token, offset in responses |
page.context().cookies() doesn't work - capture auth headers from intercepted requests insteadvariables and features JSON objects - capture and reuse them