Back to Crawl4ai

Crawl4AI Prospect‑Wizard – step‑by‑step guide

docs/apps/linkdin/README.md

0.8.64.4 KB
Original Source

Crawl4AI Prospect‑Wizard – step‑by‑step guide

A three‑stage demo that goes from LinkedIn scrapingLLM reasoninggraph visualisation.

Try it in Google Colab! Click the badge above to run this demo in a cloud environment with zero setup required.

prospect‑wizard/
├─ c4ai_discover.py         # Stage 1 – scrape companies + people
├─ c4ai_insights.py         # Stage 2 – embeddings, org‑charts, scores
├─ graph_view_template.html # Stage 3 – graph viewer (static HTML)
└─ data/                    # output lands here (*.jsonl / *.json)

1  Install & boot a LinkedIn profile (one‑time)

1.1  Install dependencies

bash
pip install crawl4ai litellm sentence-transformers pandas rich

1.2  Create / warm a LinkedIn browser profile

bash
crwl profiles
  1. The interactive shell shows New profile – hit enter.
  2. Choose a name, e.g. profile_linkedin_uc.
  3. A Chromium window opens – log in to LinkedIn, solve whatever CAPTCHA, then close.

Remember the profile name. All future runs take --profile-name <your_name>.


2  Discovery – scrape companies & people

bash
python c4ai_discover.py full \
  --query "health insurance management" \
  --geo 102713980 \               # Malaysia geoUrn
  --title-filters "" \            # or "Product,Engineering"
  --max-companies 10 \            # default set small for workshops
  --max-people 20 \               # \^ same
  --profile-name profile_linkedin_uc \
  --outdir ./data \
  --concurrency 2 \
  --log-level debug

Outputs in ./data/:

  • companies.jsonl – one JSON per company
  • people.jsonl – one JSON per employee

🛠️ Dry‑run: C4AI_DEMO_DEBUG=1 python c4ai_discover.py full --query coffee uses bundled HTML snippets, no network.

Handy geoUrn cheatsheet

LocationgeoUrn
Singapore103644278
Malaysia102713980
United States103644922
United Kingdom102221843
Australia101452733
See more: https://www.linkedin.com/search/results/companies/?geoUrn=XXX – the number after geoUrn= is what you need.

3  Insights – embeddings, org‑charts, decision makers

bash
python c4ai_insights.py \
  --in ./data \
  --out ./data \
  --embed-model all-MiniLM-L6-v2 \
  --llm-provider gemini/gemini-2.0-flash \
  --llm-api-key "" \
  --top-k 10 \
  --max-llm-tokens 8024 \
  --llm-temperature 1.0 \
  --workers 4

Emits next to the Stage‑1 files:

  • company_graph.json – inter‑company similarity graph
  • org_chart_<handle>.json – one per company
  • decision_makers.csv – hand‑picked ‘who to pitch’ list

Flags reference (straight from build_arg_parser()):

FlagDefaultPurpose
--in.Stage‑1 output dir
--out.Destination dir
--embed_modelall-MiniLM-L6-v2Sentence‑Transformer model
--top_k10Neighbours per company in graph
--openai_modelgpt-4.1LLM for scoring decision makers
--max_llm_tokens8024Token budget per LLM call
--llm_temperature1.0Creativity knob
--stuboffSkip OpenAI and fabricate tiny charts
--workers4Parallel LLM workers

4  Visualise – interactive graph

After Stage 2 completes, simply open the HTML viewer from the project root:

bash
open graph_view_template.html   # or Live Server / Python -http

The page fetches data/company_graph.json and the org_chart_*.json files automatically; keep the data/ folder beside the HTML file.

  • Left pane → list of companies (clans).
  • Click a node to load its org‑chart on the right.
  • Chat drawer lets you ask follow‑up questions; context is pulled from people.jsonl.

5  Common snags

SymptomFix
Infinite CAPTCHAUse a residential proxy: --proxy http://user:pass@ip:port
429 Too Many RequestsLower --concurrency, rotate profile, add delay
Blank graphCheck JSON paths, clear localStorage in browser

TL;DR

crwl profilesc4ai_discover.pyc4ai_insights.py → open graph_view_template.html.
Live long and import crawl4ai.