Back to Crawl4ai

🧠 Advanced LinkedIn Insights with Crawl4AI: Knowledge Graph & AI Analysis

docs/apps/linkdin/Crawl4ai_Linkedin_Data_Discovery_Part_2.ipynb

0.8.624.7 KB
Original Source

🧠 Advanced LinkedIn Insights with Crawl4AI: Knowledge Graph & AI Analysis

Welcome to Part 2!

In Part 1, we extracted LinkedIn company and people data. Now, we'll transform that raw data into actionable insights:

  • πŸ•ΈοΈ Build a company similarity graph using semantic embeddings
  • 🏒 Infer organizational structures with LLM analysis
  • 🎯 Identify key decision makers through influence scoring
  • 🌐 Create an interactive knowledge graph visualization
  • πŸ’¬ Enable AI-powered queries on your data

If you haven't completed Part 1, please start there to generate the required data files, or use our sample data to follow along.

What You'll Build

An interactive B2B intelligence dashboard that:

  1. Shows companies as nodes in a network graph
  2. Visualizes organizational hierarchies
  3. Highlights decision makers with influence scores
  4. Enables chat-based exploration of the data

Prerequisites

  • Google Colab (free tier sufficient)
  • Complete Part 1 first: πŸ“Š LinkedIn Data Extraction Workshop
    • Part 1 generates the companies.jsonl and people.jsonl files needed here
    • If you haven't completed Part 1, you can still follow along using sample data
  • LLM API key for org chart inference (Gemini recommended)
  • OpenAI API key for chat functionality (optional)

Pipeline Overview

Raw Data β†’ Embeddings β†’ Similarity Graph β†’ Org Charts β†’ Decision Makers β†’ Visualization

Pipeline Overview

This notebook implements an 8-step pipeline to transform raw LinkedIn data into actionable B2B intelligence:

  1. Setup & Dependencies β†’ Install required libraries and prepare environment
  2. Data Loading β†’ Import companies and people data from Part 1
  3. Semantic Embeddings β†’ Convert company descriptions into mathematical representations
  4. Similarity Graph β†’ Build a network showing company relationships
  5. Organizational Inference β†’ Use AI to understand company hierarchies
  6. Decision Maker Identification β†’ Score and extract key contacts
  7. Visualization Generation β†’ Create interactive dashboard files
  8. Interactive Display β†’ Launch and view the knowledge graph

Each step builds upon the previous, creating a complete intelligence system.

Let's begin! πŸš€

Step 0: Setup and Dependencies

In this step, we prepare the Colab environment for our insights pipeline. We clone the Crawl4AI repository to access template files and install essential libraries: sentence-transformers for creating semantic embeddings, litellm for LLM integration, and data processing tools. This foundation ensures all subsequent steps have the necessary resources and dependencies to execute smoothly.

python
%%capture
# Clone the repository and copy necessary files
!git clone -b next https://github.com/unclecode/crawl4ai.git
!cp -r /content/crawl4ai/docs/apps/linkdin/{templates,samples} /content/
!mkdir -p /content/output
python
# Install required packages
!pip install -q sentence-transformers litellm pandas numpy scikit-learn

Step 1: Import Libraries and Configuration

Here we import the LinkedIn data extracted in Part 1 of the workshop. The pipeline accepts two JSONL files: companies.jsonl (containing company profiles, descriptions, and metadata) and people.jsonl (containing employee information linked to companies). Users can either upload their own data or use provided samples. This data serves as the raw material for building our knowledge graph and organizational insights.

python
import json
import pandas as pd
import numpy as np
from pathlib import Path
from datetime import datetime
from collections import defaultdict
from typing import List, Dict, Any, Optional, Tuple
import warnings
warnings.filterwarnings('ignore')

# For embeddings and similarity
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# For LLM inference
import litellm
from google.colab import userdata

# Configuration
SIMILARITY_THRESHOLD = 0.3
INDUSTRY_WEIGHT_BONUS = 0.10
GEO_WEIGHT_BONUS = 0.05
DECISION_THRESHOLD = 0.5

print("βœ… Libraries imported successfully!")

Step 2: Load Data

Here we import the LinkedIn data extracted in Part 1 of the workshop. The pipeline accepts two JSONL files: companies.jsonl (containing company profiles, descriptions, and metadata) and people.jsonl (containing employee information linked to companies). Users can either upload their own data or use provided samples. This data serves as the raw material for building our knowledge graph and organizational insights.

Upload your companies.jsonl and people.jsonl files from Part 1, or use the sample data.

python
# Option 1: Upload your own files
from google.colab import files
import shutil

print("πŸ“€ Please upload your data files:")
print("1. companies.jsonl")
print("2. people.jsonl")
print("\nOr press Cancel to use sample data...")

try:
    uploaded = files.upload()

    # Move uploaded files to output directory
    for filename in uploaded.keys():
        shutil.move(filename, f'/content/output/{filename}')
    print("\nβœ… Files uploaded successfully!")
except:
    # Option 2: Use sample data
    print("\nπŸ“ Using sample data...")
    !cp /content/samples/*.jsonl /content/output/
    print("βœ… Sample data loaded!")

# Check files are created, if not simply copy from samples to output
if not Path('/content/output/companies.jsonl').exists():
    !cp /content/samples/companies.jsonl /content/output/

if not Path('/content/output/people.jsonl').exists():
    !cp /content/samples/people.jsonl /content/output/
python
# Load the data
def load_jsonl(path: str) -> List[Dict]:
    """Load JSONL file into list of dictionaries"""
    data = []
    with open(path, 'r') as f:
        for line in f:
            data.append(json.loads(line.strip()))
    return data

# Load companies and people
companies = load_jsonl('/content/output/companies.jsonl')
people = load_jsonl('/content/output/people.jsonl')

print(f"πŸ“Š Loaded {len(companies)} companies and {len(people)} people")
print(f"\n🏒 Sample company: {companies[0]['name']}")
print(f"πŸ‘€ Sample person: {people[0]['name'] if people else 'No people data'}")

Step 3: Generate Company Embeddings

This step transforms company descriptions into high-dimensional vectors (embeddings) using sentence transformers. These embeddings capture the semantic meaning of each company's business model, industry focus, and offerings. By converting text to numbers, we enable mathematical operations like similarity calculations. The quality of these embeddings directly impacts how well we can identify related companies and business opportunities.

We'll use sentence transformers to create semantic embeddings from company descriptions.

python
# Initialize the embedding model
print("πŸ€– Loading embedding model...")
model = SentenceTransformer('all-MiniLM-L6-v2')
print("βœ… Model loaded!")

# Create company descriptions for embedding
def create_company_description(company: Dict) -> str:
    """Create a rich text description for embedding"""
    parts = [
        company.get('name', ''),
        company.get('descriptor', ''),
        company.get('about', ''),
        f"{company.get('followers', 0)} followers" if company.get('followers') else ''
    ]
    return ' '.join(filter(None, parts))

# Generate embeddings
print("\nπŸ”„ Generating embeddings...")
descriptions = [create_company_description(c) for c in companies]
embeddings = model.encode(descriptions, show_progress_bar=True)

# Add embeddings to company data
for i, company in enumerate(companies):
    company['desc_embed'] = embeddings[i].tolist()

print(f"βœ… Generated embeddings for {len(companies)} companies")

Step 4: Build Company Similarity Graph

We now construct a network graph where companies are nodes and weighted edges represent their relationships. The similarity scoring combines multiple signals: semantic similarity from embeddings, industry alignment bonuses, geographic proximity bonuses, and company size compatibility penalties. This multi-factor approach ensures the graph reflects real-world B2B relationship potential, not just textual similarity. The resulting graph reveals clusters of related companies and potential partnership opportunities.

Calculate similarity scores between companies and apply bonuses for matching industries/locations.

python
def extract_industry(descriptor: str) -> Optional[str]:
    """Extract industry from descriptor (e.g., 'Insurance β€’ Singapore')"""
    if not descriptor or 'β€’' not in descriptor:
        return None
    return descriptor.split('β€’')[0].strip()

def extract_location(descriptor: str) -> Optional[str]:
    """Extract location from descriptor"""
    if not descriptor or 'β€’' not in descriptor:
        return None
    return descriptor.split('β€’')[-1].strip()

def calculate_similarity_score(c1: Dict, c2: Dict, embeddings: np.ndarray,
                            idx1: int, idx2: int) -> float:
    """Calculate weighted similarity between two companies

    This function combines multiple signals to determine how similar two companies are:
    1. Semantic similarity (from embeddings)
    2. Industry alignment
    3. Geographic proximity
    4. Company size compatibility
    """

    # Base cosine similarity (0 to 1)
    # This captures semantic similarity from company descriptions
    # Higher values mean more similar business models/offerings
    base_sim = cosine_similarity([embeddings[idx1]], [embeddings[idx2]])[0][0]

    # Start with base similarity as our weight
    weight = base_sim

    # Industry bonus (+0.10)
    # Companies in the same industry are more likely to:
    # - Face similar challenges
    # - Need complementary services
    # - Understand each other's business context
    # Example: Two "Insurance" companies get a bonus even if their descriptions differ
    ind1 = extract_industry(c1.get('descriptor', ''))
    ind2 = extract_industry(c2.get('descriptor', ''))
    if ind1 and ind2 and ind1.lower() == ind2.lower():
        weight += INDUSTRY_WEIGHT_BONUS  # +0.10

    # Geographic bonus (+0.05)
    # Companies in the same location benefit from:
    # - Easier in-person meetings
    # - Similar regulatory environment
    # - Local partnership opportunities
    # - Shared timezone for collaboration
    loc1 = extract_location(c1.get('descriptor', ''))
    loc2 = extract_location(c2.get('descriptor', ''))
    if loc1 and loc2 and loc1.lower() == loc2.lower():
        weight += GEO_WEIGHT_BONUS  # +0.05

    # Follower ratio penalty (scales weight by 0.5 to 1.0)
    # This addresses company size compatibility:
    # - Similar-sized companies often have comparable resources
    # - Prevents unrealistic pairings (e.g., 10-person startup with Microsoft)
    # - Ratio close to 1.0 = similar size (no penalty)
    # - Ratio close to 0.0 = very different sizes (50% penalty)
    f1 = c1.get('followers', 1) or 1  # Avoid division by zero
    f2 = c2.get('followers', 1) or 1
    ratio = min(f1, f2) / max(f1, f2)  # Always between 0 and 1

    # Scale the penalty: at worst (ratio=0), multiply by 0.5
    # at best (ratio=1), multiply by 1.0 (no penalty)
    weight *= (0.5 + 0.5 * ratio)

    # Example calculation:
    # - Base similarity: 0.7
    # - Same industry: +0.1 β†’ 0.8
    # - Same location: +0.05 β†’ 0.85
    # - Size ratio 0.2: Γ— 0.6 β†’ 0.51 final score

    # Cap at 1.0 to maintain valid probability range
    return min(weight, 1.0)
python
# Build the similarity graph
print("πŸ•ΈοΈ Building company similarity graph...")

nodes = []
edges = []

# Create nodes
for company in companies:
    nodes.append({
        'id': company['handle'],
        'name': company['name'],
        'industry': extract_industry(company.get('descriptor', '')),
        'location': extract_location(company.get('descriptor', '')),
        'followers': company.get('followers', 0),
        'about': company.get('about', ''),
        'handle': company['handle'],
        'desc_embed': company['desc_embed']
    })

# Create edges (similarities above threshold)
for i in range(len(companies)):
    for j in range(i + 1, len(companies)):
        score = calculate_similarity_score(
            companies[i], companies[j], embeddings, i, j
        )

        if score >= SIMILARITY_THRESHOLD:
            edges.append({
                'source': companies[i]['handle'],
                'target': companies[j]['handle'],
                'weight': float(score)
            })

# Create graph data structure
graph_data = {
    'nodes': nodes,
    'edges': edges,
    'metadata': {
        'created_at': datetime.now().isoformat(),
        'total_companies': len(companies),
        'total_connections': len(edges),
        'similarity_threshold': SIMILARITY_THRESHOLD
    }
}

# Save graph
with open('/content/output/company_graph.json', 'w') as f:
    json.dump(graph_data, f, indent=2)

print(f"βœ… Graph built with {len(nodes)} nodes and {len(edges)} edges")
print(f"πŸ“Š Average connections per company: {len(edges) * 2 / len(nodes):.1f}")

Step 5: Infer Organizational Charts with LLM

This step leverages Large Language Models to analyze employee titles and infer organizational hierarchies. For each company, we send employee data to the LLM with a structured prompt requesting org chart inference, reporting relationships, and decision-making scores. The LLM uses its training on corporate structures to identify C-level executives, VPs, directors, and their likely reporting chains. This automated inference scales what would otherwise require manual research for each company.

Use LLM to analyze job titles and infer reporting structures.

python
# Configure LLM
try:
    # Try to get API key from Colab secrets
    provider = "OPENAI"
    model = "gpt-4.1"
    api_key = userdata.get(f'{provider}_API_KEY')
    litellm.api_key = api_key
    LLM_MODEL = f"{provider.lower()}/{model}"
    print("βœ… Using " + LLM_MODEL)
except:
    print(f"⚠️ No {provider} API key found. Please add {provider}_API_KEY to Colab secrets.")
    api_key = input("Enter your API key (GEMINI, OPENAI, ...): ")
    provider = input("Enter provider (GEMINI, OPENAI, ...): ").lower()
    model = input("Enter model (gpt-4.1, gpt-4o-mini): ").lower()
    LLM_MODEL = f"{provider.lower()}/{model}"
    litellm.api_key = api_key
python
# Org chart inference prompt template
ORG_CHART_PROMPT = """Analyze these LinkedIn profiles and infer the organizational structure.

Company: {company_name}
Employees:

{employees_text}


Create a hierarchical org chart with:
1. Reporting relationships (who reports to whom)
2. Decision-making score (0.0-1.0) based on seniority and title
3. Department classification

Return ONLY valid JSON in this format:
{{
  "nodes": [
    {{
      "id": "profile_url",
      "name": "person name",
      "title": "job title",
      "dept": "department",
      "decision_score": 0.0-1.0,
      "title_level": "C-Level|VP|Director|Manager|IC"
    }}
  ],
  "edges": [
    {{"source": "manager_profile_url", "target": "report_profile_url"}}
  ]
}}
"""

def infer_org_chart(company: Dict, employees: List[Dict]) -> Optional[Dict]:
    """Use LLM to infer organizational structure"""
    if not employees:
        return None

    # Format employee data
    emp_lines = []
    for emp in employees[:50]:  # Limit to 50 for token constraints
        emp_lines.append(
            f"- {emp.get('name', 'Unknown')} | "
            f"{emp.get('headline', 'No title')} | "
            f"URL: {emp.get('profile_url', 'N/A')}"
        )

    prompt = ORG_CHART_PROMPT.format(
        company_name=company['name'],
        employees_text='\n'.join(emp_lines)
    )

    try:
        response = litellm.completion(
            model=LLM_MODEL,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.3,
            response_format={"type": "json_object"}
        )

        result = json.loads(response.choices[0].message.content)

        # Add metadata
        result['meta'] = {
            'company': company['name'],
            'company_handle': company['handle'],
            'total_analyzed': len(employees),
            'created_at': datetime.now().isoformat()
        }

        return result

    except Exception as e:
        print(f"❌ Error inferring org chart for {company['name']}: {e}")
        return None
```

```python
# Process organizational charts for each company
print("🏒 Inferring organizational structures...\n")

# Group people by company
people_by_company = defaultdict(list)
for person in people:
    company_handle = person.get('company_handle', '')
    if company_handle:
        people_by_company[company_handle].append(person)

# Process each company
org_charts = {}
decision_makers = []

for i, company in enumerate(companies):
    print(f"Processing {i+1}/{len(companies)}: {company['name']}...")

    company_people = people_by_company.get(company['handle'], [])

    if not company_people:
        print(f"  ⚠️ No employees found")
        continue

    # Infer org chart
    org_chart = infer_org_chart(company, company_people)

    if org_chart:
        # Save org chart
        safe_handle = company['handle'].replace('/', '_')
        filename = f'/content/output/org_chart_{safe_handle}.json'

        with open(filename, 'w') as f:
            json.dump(org_chart, f, indent=2)

        org_charts[company['handle']] = org_chart

        # Extract decision makers
        for node in org_chart.get('nodes', []):
            if node.get('decision_score', 0) >= DECISION_THRESHOLD:
                # Find original person data
                person_data = next(
                    (p for p in company_people if p.get('profile_url') == node['id']),
                    {}
                )

                decision_makers.append({
                    'name': node['name'],
                    'title': node['title'],
                    'company': company['name'],
                    'company_handle': company['handle'],
                    'decision_score': node['decision_score'],
                    'title_level': node.get('title_level', 'Unknown'),
                    'dept': node.get('dept', 'Unknown'),
                    'profile_url': node['id'],
                    'avatar_url': person_data.get('avatar_url', ''),
                    'connection_degree': person_data.get('connection_degree', ''),
                    'followers': person_data.get('followers', ''),
                    'yoe_current': node.get('yoe_current', 0),
                    'connection_count': person_data.get('connection_count', 0)
                })

        print(f"  βœ… Found {len([n for n in org_chart.get('nodes', []) if n.get('decision_score', 0) >= DECISION_THRESHOLD])} decision makers")
    else:
        print(f"  ❌ Failed to generate org chart")

print(f"\nβœ… Processed {len(org_charts)} companies")
print(f"🎯 Found {len(decision_makers)} total decision makers")
```

## Step 6: Export Decision Makers

  We extract and rank individuals with high decision-making potential based on their
  inferred organizational position. The system filters employees by decision score
  (typically 0.5 or higher), enriches their profiles with company context, and exports
  them to a CSV file. This creates a prioritized contact list for B2B sales teams,
  focusing efforts on individuals most likely to influence purchasing decisions.

```python
# Create decision makers DataFrame and export to CSV
if decision_makers:
    df_decision_makers = pd.DataFrame(decision_makers)

    # Sort by decision score
    df_decision_makers = df_decision_makers.sort_values(
        'decision_score', ascending=False
    )

    # Save to CSV
    df_decision_makers.to_csv('/content/output/decision_makers.csv', index=False)

    print("πŸ“Š Top 10 Decision Makers:")
    print("=" * 80)

    for _, person in df_decision_makers.head(10).iterrows():
        print(f"{person['name']:<30} | {person['title']:<40} | Score: {person['decision_score']:.2f}")
        print(f"  Company: {person['company']}")
        print(f"  Level: {person['title_level']} | Dept: {person['dept']}")
        print("-" * 80)
else:
    print("⚠️ No decision makers found")
```

## Step 7: Generate Interactive Visualization

  This step assembles all components into an interactive web application. We copy the
  HTML template and JavaScript files, ensure all generated data files (company graph,
  org charts, people data) are in place, and prepare the visualization environment. The
  template includes a network graph viewer (vis.js), organizational chart displays, and
  an AI chat interface for querying the knowledge base. This creates a complete,
  self-contained dashboard.

```python
# Copy template files and AI.js
!cp /content/templates/graph_view_template.html /content/output/graph_view.html
!cp /content/templates/ai.js /content/output/

# Also copy the people.jsonl for the chat context
!cp /content/output/people.jsonl /content/output/

print("βœ… Visualization files prepared!")
print("\nπŸ“ Output directory contents:")
!ls -la /content/output/
```

## Step 8: Launch Web Server and Display

  Finally, we start a local HTTP server to serve the visualization files and display
  them within Colab using an iframe. The server runs in the background on port 8000, and
   Colab's proxy system allows secure access to the dashboard. Users can interact with
  the graph, click on companies to view organizational structures, and use the AI chat
  to query the data. This step brings together all previous work into a live,
  interactive B2B intelligence tool.

```python
# Start a simple HTTP server in the background
import subprocess
import time

# Kill any existing server on port 8000
!kill -9 $(lsof -t -i:8000) 2>/dev/null || true

# Start the server
server_process = subprocess.Popen(
    ['python', '-m', 'http.server', '8000', '--directory', '/content/output'],
    stdout=subprocess.DEVNULL,
    stderr=subprocess.DEVNULL
)

# Wait for server to start
time.sleep(2)

print("🌐 Web server started on port 8000")
print("πŸ“Š Loading visualization...")
```

```python
# Display the visualization in an iframe
from IPython.display import IFrame

# Note: In Colab, we need to use the proxy URL
from google.colab.output import eval_js
proxy_url = eval_js("google.colab.kernel.proxyPort(8000)")

print("πŸŽ‰ Your LinkedIn Insights Dashboard is ready!")
print("\nπŸ“Œ Instructions:")
print("1. The graph shows company relationships")
print("2. Click on a company to see its org chart")
print("3. Use the chat button to ask questions about the data")
print("4. Don't forget to set your OpenAI API key in Settings for chat to work")
print("\n⚠️ Note: If the visualization doesn't load, refresh this cell")

# Display the iframe
IFrame(src=proxy_url + "/graph_view.html", width='100%', height='800')
```

```python
# Get public URL for the server
from google.colab.output import eval_js
public_url = eval_js("google.colab.kernel.proxyPort(8000)")

print(f"🌐 Public URL: {public_url}")
print("\nπŸ“± Share this URL to access your dashboard from anywhere!")
print("⚠️ Note: This URL is temporary and will expire when the Colab session ends")
```

## 🎯 Summary & Next Steps

Congratulations! You've built a complete B2B intelligence system that:

βœ… **Analyzed** company similarities using AI embeddings  
βœ… **Inferred** organizational structures with LLM  
βœ… **Identified** key decision makers  
βœ… **Visualized** everything in an interactive dashboard  

### Download Your Results

Run the cell below to download all generated files:

```python
# Create a zip file with all outputs
!cd /content/output && zip -r linkedin_insights.zip *.json *.csv *.html *.js *.jsonl

# Download the zip file
from google.colab import files
files.download('/content/output/linkedin_insights.zip')

print("πŸ“¦ All files packaged and ready for download!")
```

## πŸš€ What's Next?

### Enhance Your System:
1. **Add more data sources** - Combine with CRM, news, social media
2. **Improve scoring** - Use more sophisticated algorithms
3. **Track changes** - Monitor company/people updates over time
4. **Export to CRM** - Integrate with Salesforce, HubSpot, etc.

### Production Deployment:
1. **Host the dashboard** - Deploy to Vercel, Netlify, or AWS
2. **Add authentication** - Secure your intelligence data
3. **Schedule updates** - Automate data refreshes
4. **Scale the pipeline** - Process thousands of companies

## Connect & Learn More

- πŸ™ **GitHub**: [github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
- 🐦 **Follow on X**: [@unclecode](https://twitter.com/unclecode)
- πŸ’¬ **Join our Discord**: [discord.gg/gpPZZgzRAP](https://discord.gg/gpPZZgzRAP)

Thank you for joining this workshop! πŸ™

---

Live Long and Build Intelligent Systems πŸ––