docs/apps/linkdin/Crawl4ai_Linkedin_Data_Discovery_Part_2.ipynb
In Part 1, we extracted LinkedIn company and people data. Now, we'll transform that raw data into actionable insights:
If you haven't completed Part 1, please start there to generate the required data files, or use our sample data to follow along.
An interactive B2B intelligence dashboard that:
companies.jsonl and people.jsonl files needed hereRaw Data β Embeddings β Similarity Graph β Org Charts β Decision Makers β Visualization
This notebook implements an 8-step pipeline to transform raw LinkedIn data into actionable B2B intelligence:
Each step builds upon the previous, creating a complete intelligence system.
Let's begin! π
In this step, we prepare the Colab environment for our insights pipeline. We clone the Crawl4AI repository to access template files and install essential libraries: sentence-transformers for creating semantic embeddings, litellm for LLM integration, and data processing tools. This foundation ensures all subsequent steps have the necessary resources and dependencies to execute smoothly.
%%capture
# Clone the repository and copy necessary files
!git clone -b next https://github.com/unclecode/crawl4ai.git
!cp -r /content/crawl4ai/docs/apps/linkdin/{templates,samples} /content/
!mkdir -p /content/output
# Install required packages
!pip install -q sentence-transformers litellm pandas numpy scikit-learn
Here we import the LinkedIn data extracted in Part 1 of the workshop. The pipeline accepts two JSONL files: companies.jsonl (containing company profiles, descriptions, and metadata) and people.jsonl (containing employee information linked to companies). Users can either upload their own data or use provided samples. This data serves as the raw material for building our knowledge graph and organizational insights.
import json
import pandas as pd
import numpy as np
from pathlib import Path
from datetime import datetime
from collections import defaultdict
from typing import List, Dict, Any, Optional, Tuple
import warnings
warnings.filterwarnings('ignore')
# For embeddings and similarity
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
# For LLM inference
import litellm
from google.colab import userdata
# Configuration
SIMILARITY_THRESHOLD = 0.3
INDUSTRY_WEIGHT_BONUS = 0.10
GEO_WEIGHT_BONUS = 0.05
DECISION_THRESHOLD = 0.5
print("β
Libraries imported successfully!")
Here we import the LinkedIn data extracted in Part 1 of the workshop. The pipeline accepts two JSONL files: companies.jsonl (containing company profiles, descriptions, and metadata) and people.jsonl (containing employee information linked to companies). Users can either upload their own data or use provided samples. This data serves as the raw material for building our knowledge graph and organizational insights.
Upload your companies.jsonl and people.jsonl files from Part 1, or use the sample data.
# Option 1: Upload your own files
from google.colab import files
import shutil
print("π€ Please upload your data files:")
print("1. companies.jsonl")
print("2. people.jsonl")
print("\nOr press Cancel to use sample data...")
try:
uploaded = files.upload()
# Move uploaded files to output directory
for filename in uploaded.keys():
shutil.move(filename, f'/content/output/{filename}')
print("\nβ
Files uploaded successfully!")
except:
# Option 2: Use sample data
print("\nπ Using sample data...")
!cp /content/samples/*.jsonl /content/output/
print("β
Sample data loaded!")
# Check files are created, if not simply copy from samples to output
if not Path('/content/output/companies.jsonl').exists():
!cp /content/samples/companies.jsonl /content/output/
if not Path('/content/output/people.jsonl').exists():
!cp /content/samples/people.jsonl /content/output/
# Load the data
def load_jsonl(path: str) -> List[Dict]:
"""Load JSONL file into list of dictionaries"""
data = []
with open(path, 'r') as f:
for line in f:
data.append(json.loads(line.strip()))
return data
# Load companies and people
companies = load_jsonl('/content/output/companies.jsonl')
people = load_jsonl('/content/output/people.jsonl')
print(f"π Loaded {len(companies)} companies and {len(people)} people")
print(f"\nπ’ Sample company: {companies[0]['name']}")
print(f"π€ Sample person: {people[0]['name'] if people else 'No people data'}")
This step transforms company descriptions into high-dimensional vectors (embeddings) using sentence transformers. These embeddings capture the semantic meaning of each company's business model, industry focus, and offerings. By converting text to numbers, we enable mathematical operations like similarity calculations. The quality of these embeddings directly impacts how well we can identify related companies and business opportunities.
We'll use sentence transformers to create semantic embeddings from company descriptions.
# Initialize the embedding model
print("π€ Loading embedding model...")
model = SentenceTransformer('all-MiniLM-L6-v2')
print("β
Model loaded!")
# Create company descriptions for embedding
def create_company_description(company: Dict) -> str:
"""Create a rich text description for embedding"""
parts = [
company.get('name', ''),
company.get('descriptor', ''),
company.get('about', ''),
f"{company.get('followers', 0)} followers" if company.get('followers') else ''
]
return ' '.join(filter(None, parts))
# Generate embeddings
print("\nπ Generating embeddings...")
descriptions = [create_company_description(c) for c in companies]
embeddings = model.encode(descriptions, show_progress_bar=True)
# Add embeddings to company data
for i, company in enumerate(companies):
company['desc_embed'] = embeddings[i].tolist()
print(f"β
Generated embeddings for {len(companies)} companies")
We now construct a network graph where companies are nodes and weighted edges represent their relationships. The similarity scoring combines multiple signals: semantic similarity from embeddings, industry alignment bonuses, geographic proximity bonuses, and company size compatibility penalties. This multi-factor approach ensures the graph reflects real-world B2B relationship potential, not just textual similarity. The resulting graph reveals clusters of related companies and potential partnership opportunities.
Calculate similarity scores between companies and apply bonuses for matching industries/locations.
def extract_industry(descriptor: str) -> Optional[str]:
"""Extract industry from descriptor (e.g., 'Insurance β’ Singapore')"""
if not descriptor or 'β’' not in descriptor:
return None
return descriptor.split('β’')[0].strip()
def extract_location(descriptor: str) -> Optional[str]:
"""Extract location from descriptor"""
if not descriptor or 'β’' not in descriptor:
return None
return descriptor.split('β’')[-1].strip()
def calculate_similarity_score(c1: Dict, c2: Dict, embeddings: np.ndarray,
idx1: int, idx2: int) -> float:
"""Calculate weighted similarity between two companies
This function combines multiple signals to determine how similar two companies are:
1. Semantic similarity (from embeddings)
2. Industry alignment
3. Geographic proximity
4. Company size compatibility
"""
# Base cosine similarity (0 to 1)
# This captures semantic similarity from company descriptions
# Higher values mean more similar business models/offerings
base_sim = cosine_similarity([embeddings[idx1]], [embeddings[idx2]])[0][0]
# Start with base similarity as our weight
weight = base_sim
# Industry bonus (+0.10)
# Companies in the same industry are more likely to:
# - Face similar challenges
# - Need complementary services
# - Understand each other's business context
# Example: Two "Insurance" companies get a bonus even if their descriptions differ
ind1 = extract_industry(c1.get('descriptor', ''))
ind2 = extract_industry(c2.get('descriptor', ''))
if ind1 and ind2 and ind1.lower() == ind2.lower():
weight += INDUSTRY_WEIGHT_BONUS # +0.10
# Geographic bonus (+0.05)
# Companies in the same location benefit from:
# - Easier in-person meetings
# - Similar regulatory environment
# - Local partnership opportunities
# - Shared timezone for collaboration
loc1 = extract_location(c1.get('descriptor', ''))
loc2 = extract_location(c2.get('descriptor', ''))
if loc1 and loc2 and loc1.lower() == loc2.lower():
weight += GEO_WEIGHT_BONUS # +0.05
# Follower ratio penalty (scales weight by 0.5 to 1.0)
# This addresses company size compatibility:
# - Similar-sized companies often have comparable resources
# - Prevents unrealistic pairings (e.g., 10-person startup with Microsoft)
# - Ratio close to 1.0 = similar size (no penalty)
# - Ratio close to 0.0 = very different sizes (50% penalty)
f1 = c1.get('followers', 1) or 1 # Avoid division by zero
f2 = c2.get('followers', 1) or 1
ratio = min(f1, f2) / max(f1, f2) # Always between 0 and 1
# Scale the penalty: at worst (ratio=0), multiply by 0.5
# at best (ratio=1), multiply by 1.0 (no penalty)
weight *= (0.5 + 0.5 * ratio)
# Example calculation:
# - Base similarity: 0.7
# - Same industry: +0.1 β 0.8
# - Same location: +0.05 β 0.85
# - Size ratio 0.2: Γ 0.6 β 0.51 final score
# Cap at 1.0 to maintain valid probability range
return min(weight, 1.0)
# Build the similarity graph
print("πΈοΈ Building company similarity graph...")
nodes = []
edges = []
# Create nodes
for company in companies:
nodes.append({
'id': company['handle'],
'name': company['name'],
'industry': extract_industry(company.get('descriptor', '')),
'location': extract_location(company.get('descriptor', '')),
'followers': company.get('followers', 0),
'about': company.get('about', ''),
'handle': company['handle'],
'desc_embed': company['desc_embed']
})
# Create edges (similarities above threshold)
for i in range(len(companies)):
for j in range(i + 1, len(companies)):
score = calculate_similarity_score(
companies[i], companies[j], embeddings, i, j
)
if score >= SIMILARITY_THRESHOLD:
edges.append({
'source': companies[i]['handle'],
'target': companies[j]['handle'],
'weight': float(score)
})
# Create graph data structure
graph_data = {
'nodes': nodes,
'edges': edges,
'metadata': {
'created_at': datetime.now().isoformat(),
'total_companies': len(companies),
'total_connections': len(edges),
'similarity_threshold': SIMILARITY_THRESHOLD
}
}
# Save graph
with open('/content/output/company_graph.json', 'w') as f:
json.dump(graph_data, f, indent=2)
print(f"β
Graph built with {len(nodes)} nodes and {len(edges)} edges")
print(f"π Average connections per company: {len(edges) * 2 / len(nodes):.1f}")
This step leverages Large Language Models to analyze employee titles and infer organizational hierarchies. For each company, we send employee data to the LLM with a structured prompt requesting org chart inference, reporting relationships, and decision-making scores. The LLM uses its training on corporate structures to identify C-level executives, VPs, directors, and their likely reporting chains. This automated inference scales what would otherwise require manual research for each company.
Use LLM to analyze job titles and infer reporting structures.
# Configure LLM
try:
# Try to get API key from Colab secrets
provider = "OPENAI"
model = "gpt-4.1"
api_key = userdata.get(f'{provider}_API_KEY')
litellm.api_key = api_key
LLM_MODEL = f"{provider.lower()}/{model}"
print("β
Using " + LLM_MODEL)
except:
print(f"β οΈ No {provider} API key found. Please add {provider}_API_KEY to Colab secrets.")
api_key = input("Enter your API key (GEMINI, OPENAI, ...): ")
provider = input("Enter provider (GEMINI, OPENAI, ...): ").lower()
model = input("Enter model (gpt-4.1, gpt-4o-mini): ").lower()
LLM_MODEL = f"{provider.lower()}/{model}"
litellm.api_key = api_key
# Org chart inference prompt template
ORG_CHART_PROMPT = """Analyze these LinkedIn profiles and infer the organizational structure.
Company: {company_name}
Employees:
{employees_text}
Create a hierarchical org chart with:
1. Reporting relationships (who reports to whom)
2. Decision-making score (0.0-1.0) based on seniority and title
3. Department classification
Return ONLY valid JSON in this format:
{{
"nodes": [
{{
"id": "profile_url",
"name": "person name",
"title": "job title",
"dept": "department",
"decision_score": 0.0-1.0,
"title_level": "C-Level|VP|Director|Manager|IC"
}}
],
"edges": [
{{"source": "manager_profile_url", "target": "report_profile_url"}}
]
}}
"""
def infer_org_chart(company: Dict, employees: List[Dict]) -> Optional[Dict]:
"""Use LLM to infer organizational structure"""
if not employees:
return None
# Format employee data
emp_lines = []
for emp in employees[:50]: # Limit to 50 for token constraints
emp_lines.append(
f"- {emp.get('name', 'Unknown')} | "
f"{emp.get('headline', 'No title')} | "
f"URL: {emp.get('profile_url', 'N/A')}"
)
prompt = ORG_CHART_PROMPT.format(
company_name=company['name'],
employees_text='\n'.join(emp_lines)
)
try:
response = litellm.completion(
model=LLM_MODEL,
messages=[{"role": "user", "content": prompt}],
temperature=0.3,
response_format={"type": "json_object"}
)
result = json.loads(response.choices[0].message.content)
# Add metadata
result['meta'] = {
'company': company['name'],
'company_handle': company['handle'],
'total_analyzed': len(employees),
'created_at': datetime.now().isoformat()
}
return result
except Exception as e:
print(f"β Error inferring org chart for {company['name']}: {e}")
return None
```
```python
# Process organizational charts for each company
print("π’ Inferring organizational structures...\n")
# Group people by company
people_by_company = defaultdict(list)
for person in people:
company_handle = person.get('company_handle', '')
if company_handle:
people_by_company[company_handle].append(person)
# Process each company
org_charts = {}
decision_makers = []
for i, company in enumerate(companies):
print(f"Processing {i+1}/{len(companies)}: {company['name']}...")
company_people = people_by_company.get(company['handle'], [])
if not company_people:
print(f" β οΈ No employees found")
continue
# Infer org chart
org_chart = infer_org_chart(company, company_people)
if org_chart:
# Save org chart
safe_handle = company['handle'].replace('/', '_')
filename = f'/content/output/org_chart_{safe_handle}.json'
with open(filename, 'w') as f:
json.dump(org_chart, f, indent=2)
org_charts[company['handle']] = org_chart
# Extract decision makers
for node in org_chart.get('nodes', []):
if node.get('decision_score', 0) >= DECISION_THRESHOLD:
# Find original person data
person_data = next(
(p for p in company_people if p.get('profile_url') == node['id']),
{}
)
decision_makers.append({
'name': node['name'],
'title': node['title'],
'company': company['name'],
'company_handle': company['handle'],
'decision_score': node['decision_score'],
'title_level': node.get('title_level', 'Unknown'),
'dept': node.get('dept', 'Unknown'),
'profile_url': node['id'],
'avatar_url': person_data.get('avatar_url', ''),
'connection_degree': person_data.get('connection_degree', ''),
'followers': person_data.get('followers', ''),
'yoe_current': node.get('yoe_current', 0),
'connection_count': person_data.get('connection_count', 0)
})
print(f" β
Found {len([n for n in org_chart.get('nodes', []) if n.get('decision_score', 0) >= DECISION_THRESHOLD])} decision makers")
else:
print(f" β Failed to generate org chart")
print(f"\nβ
Processed {len(org_charts)} companies")
print(f"π― Found {len(decision_makers)} total decision makers")
```
## Step 6: Export Decision Makers
We extract and rank individuals with high decision-making potential based on their
inferred organizational position. The system filters employees by decision score
(typically 0.5 or higher), enriches their profiles with company context, and exports
them to a CSV file. This creates a prioritized contact list for B2B sales teams,
focusing efforts on individuals most likely to influence purchasing decisions.
```python
# Create decision makers DataFrame and export to CSV
if decision_makers:
df_decision_makers = pd.DataFrame(decision_makers)
# Sort by decision score
df_decision_makers = df_decision_makers.sort_values(
'decision_score', ascending=False
)
# Save to CSV
df_decision_makers.to_csv('/content/output/decision_makers.csv', index=False)
print("π Top 10 Decision Makers:")
print("=" * 80)
for _, person in df_decision_makers.head(10).iterrows():
print(f"{person['name']:<30} | {person['title']:<40} | Score: {person['decision_score']:.2f}")
print(f" Company: {person['company']}")
print(f" Level: {person['title_level']} | Dept: {person['dept']}")
print("-" * 80)
else:
print("β οΈ No decision makers found")
```
## Step 7: Generate Interactive Visualization
This step assembles all components into an interactive web application. We copy the
HTML template and JavaScript files, ensure all generated data files (company graph,
org charts, people data) are in place, and prepare the visualization environment. The
template includes a network graph viewer (vis.js), organizational chart displays, and
an AI chat interface for querying the knowledge base. This creates a complete,
self-contained dashboard.
```python
# Copy template files and AI.js
!cp /content/templates/graph_view_template.html /content/output/graph_view.html
!cp /content/templates/ai.js /content/output/
# Also copy the people.jsonl for the chat context
!cp /content/output/people.jsonl /content/output/
print("β
Visualization files prepared!")
print("\nπ Output directory contents:")
!ls -la /content/output/
```
## Step 8: Launch Web Server and Display
Finally, we start a local HTTP server to serve the visualization files and display
them within Colab using an iframe. The server runs in the background on port 8000, and
Colab's proxy system allows secure access to the dashboard. Users can interact with
the graph, click on companies to view organizational structures, and use the AI chat
to query the data. This step brings together all previous work into a live,
interactive B2B intelligence tool.
```python
# Start a simple HTTP server in the background
import subprocess
import time
# Kill any existing server on port 8000
!kill -9 $(lsof -t -i:8000) 2>/dev/null || true
# Start the server
server_process = subprocess.Popen(
['python', '-m', 'http.server', '8000', '--directory', '/content/output'],
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL
)
# Wait for server to start
time.sleep(2)
print("π Web server started on port 8000")
print("π Loading visualization...")
```
```python
# Display the visualization in an iframe
from IPython.display import IFrame
# Note: In Colab, we need to use the proxy URL
from google.colab.output import eval_js
proxy_url = eval_js("google.colab.kernel.proxyPort(8000)")
print("π Your LinkedIn Insights Dashboard is ready!")
print("\nπ Instructions:")
print("1. The graph shows company relationships")
print("2. Click on a company to see its org chart")
print("3. Use the chat button to ask questions about the data")
print("4. Don't forget to set your OpenAI API key in Settings for chat to work")
print("\nβ οΈ Note: If the visualization doesn't load, refresh this cell")
# Display the iframe
IFrame(src=proxy_url + "/graph_view.html", width='100%', height='800')
```
```python
# Get public URL for the server
from google.colab.output import eval_js
public_url = eval_js("google.colab.kernel.proxyPort(8000)")
print(f"π Public URL: {public_url}")
print("\nπ± Share this URL to access your dashboard from anywhere!")
print("β οΈ Note: This URL is temporary and will expire when the Colab session ends")
```
## π― Summary & Next Steps
Congratulations! You've built a complete B2B intelligence system that:
β
**Analyzed** company similarities using AI embeddings
β
**Inferred** organizational structures with LLM
β
**Identified** key decision makers
β
**Visualized** everything in an interactive dashboard
### Download Your Results
Run the cell below to download all generated files:
```python
# Create a zip file with all outputs
!cd /content/output && zip -r linkedin_insights.zip *.json *.csv *.html *.js *.jsonl
# Download the zip file
from google.colab import files
files.download('/content/output/linkedin_insights.zip')
print("π¦ All files packaged and ready for download!")
```
## π What's Next?
### Enhance Your System:
1. **Add more data sources** - Combine with CRM, news, social media
2. **Improve scoring** - Use more sophisticated algorithms
3. **Track changes** - Monitor company/people updates over time
4. **Export to CRM** - Integrate with Salesforce, HubSpot, etc.
### Production Deployment:
1. **Host the dashboard** - Deploy to Vercel, Netlify, or AWS
2. **Add authentication** - Secure your intelligence data
3. **Schedule updates** - Automate data refreshes
4. **Scale the pipeline** - Process thousands of companies
## Connect & Learn More
- π **GitHub**: [github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
- π¦ **Follow on X**: [@unclecode](https://twitter.com/unclecode)
- π¬ **Join our Discord**: [discord.gg/gpPZZgzRAP](https://discord.gg/gpPZZgzRAP)
Thank you for joining this workshop! π
---
Live Long and Build Intelligent Systems π