🎨 PyGWalker: Turn Your Pandas DataFrame into an Interactive UI for Visual Analysis

Transform your DataFrame into a Tableau-style interface with just one line of code!

Documentation | GitHub | Discord Community

</div>

📊 Tutorial Info:

⏱️ Estimated Time: 30-45 minutes
📈 Level: Beginner to Intermediate
🐍 Prerequisites: Basic Python and pandas knowledge

📋 Table of Contents

👋 Welcome & Introduction
🛠️ Setup & Installation
🚀 Quick Start
🎯 Core Features
- Loading Data
- Visualization Types
- Advanced Features
💡 Practical Use Cases
- Sales Analysis
- Customer Segmentation
- Data Quality Checks
- A/B Testing
💎 Best Practices
🐛 Troubleshooting
📚 Additional Resources

✅ Your Progress Tracker

Track your learning journey:

Mark them as you go! 🎉

👋 Welcome!

Hey there, data explorer! 🚀

If you've ever wanted the power of tools like Tableau or Power BI but inside your Python environment, you're in the right place. PyGWalker (Python binding of Graphic Walker) is here to make your data exploration journey smooth, intuitive, and honestly... pretty fun!

This tutorial will walk you through everything you need to know to become a PyGWalker pro. Whether you're a beginner or an experienced data scientist, you'll find something valuable here.

🎯 What You'll Learn

✅ How to get started in minutes
✅ Creating stunning visualizations with drag-and-drop
✅ Advanced features and customization
✅ Real-world use cases and best practices
✅ Performance optimization tips
✅ Troubleshooting common issues

📚 Prerequisites

Required:

🐍 Basic Python knowledge
📊 Familiarity with pandas DataFrames

Helpful (but not required):

📈 Basic data visualization concepts
📊 Understanding of statistical terms (mean, median, etc.)

Don't worry if you're new! We explain everything as we go. 🎓

Let's dive in! 🏊‍♂️

🤔 What is PyGWalker?

PyGWalker (pronounced "Pig Walker" 🐷) is a Python library that turns your pandas DataFrame into an interactive, Tableau-style user interface for visual exploration.

Why PyGWalker is Awesome

🎯 One-Line Magic

Seriously. One line of code. That's all it takes to get a full visual analysis interface.

🖱️ Drag-and-Drop Interface

No need to memorize complex plotting syntax
Drag columns to create charts instantly
Switch between visualization types with a click

🚀 Lightning Fast

Built for performance with large datasets
Real-time interaction with your data
Smooth experience in Jupyter and Google Colab

🎨 Rich Visualization Options

Scatter plots, bar charts, line charts, heatmaps, and more
Automatic chart recommendations based on your data
Customizable colors, scales, and styling

🔍 Exploratory Data Analysis (EDA) Supercharged

Filter, aggregate, and drill down into your data
Discover patterns and insights visually
Perfect for the initial data exploration phase

🆚 PyGWalker vs Traditional Tools

Feature	PyGWalker	Matplotlib/Seaborn	Tableau/Power BI
Code Required	1 line	Many lines	No code
Interactive	✅ Yes	❌ No	✅ Yes
Python Integration	✅ Native	✅ Native	❌ Limited
Learning Curve	🟢 Easy	🟡 Medium	🟡 Medium
Cost	🟢 Free	🟢 Free	🔴 Paid (mostly)
Jupyter/Colab	✅ Perfect	✅ Good	❌ No
Drag-and-Drop	✅ Yes	❌ No	✅ Yes

When to Use PyGWalker:

✅ Perfect for:

Quick exploratory data analysis (EDA)
Interactive presentations in notebooks
When you want visual insights without writing plotting code
Sharing interactive analysis with non-technical stakeholders
Teaching data analysis concepts

⚠️ Maybe not ideal for:

Production dashboards (use Plotly Dash or Streamlit)
Highly customized, publication-ready static plots (use Matplotlib)
Automated reporting pipelines

🎯 What We'll Build Today

Throughout this tutorial, we'll explore PyGWalker using real-world datasets. Here's a sneak peek:

Quick Start: Get your first visualization in under 2 minutes ⏱️
Core Features: Master the drag-and-drop interface 🎨
Advanced Techniques: Calculated fields, filters, and aggregations 🔧
Real Use Cases: Sales analysis, customer segmentation, and more 📊
Pro Tips: Best practices and performance optimization 💡

By the end, you'll be able to:

✅ Quickly explore any dataset visually
✅ Create insightful visualizations without complex code
✅ Impress your team with interactive data stories
✅ Speed up your data analysis workflow significantly

Ready? Let's get started! 🎉

🛠️ Setup & Installation

Let's get PyGWalker installed and ready to roll! This should take less than a minute. ⏱️

python

# Install PyGWalker
!pip install pygwalker -q

print("✅ PyGWalker installed successfully!")

python

# Import necessary libraries
import pandas as pd
import pygwalker as pyg
import warnings
warnings.filterwarnings('ignore')

# Check PyGWalker version
print(f"🐷 PyGWalker version: {pyg.__version__}")
print("✅ All imports successful!")

📦 What We Just Installed

PyGWalker comes with everything you need:

Interactive visualization interface
Automatic chart type recommendations
Data processing capabilities
Export functionality

Compatibility:

✅ Google Colab (where we are now!)
✅ Jupyter Notebook
✅ Jupyter Lab
✅ VS Code notebooks
✅ Any IPython environment

Note: PyGWalker works best with pandas DataFrames, so make sure your data is in that format!

🚀 Quick Start: Your First Visualization in 30 Seconds!

Let's jump right in with a fun dataset: Palmer Penguins 🐧

This dataset contains measurements of 3 penguin species from islands in Antarctica. It's perfect for learning because:

🎯 Small and easy to understand
📊 Mix of numerical and categorical data
🐧 Penguins are adorable!

Let's load it and create our first interactive visualization!

python

# Load the Palmer Penguins dataset
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv"
df = pd.read_csv(url)

print("🐧 Dataset loaded successfully!")
print(f"📊 Shape: {df.shape[0]} rows × {df.shape[1]} columns")
print("\n" + "="*50)
print("First look at our penguin friends:")
print("="*50)
df.head()

🔍 Understanding Our Dataset

Let's see what we're working with:

python

# Dataset overview
print("📋 Dataset Information:")
print("="*50)
df.info()

print("\n📊 Basic Statistics:")
print("="*50)
df.describe()

print("\n🐧 Penguin Species in our dataset:")
print("="*50)
print(df['species'].value_counts())

Our penguin dataset includes:

🏝️ species: Penguin species (Adelie, Chinstrap, Gentoo)
🏝️ island: Island where observed (Torgersen, Biscoe, Dream)
📏 bill_length_mm: Length of the bill in millimeters
📏 bill_depth_mm: Depth of the bill in millimeters
🦅 flipper_length_mm: Length of the flipper in millimeters
⚖️ body_mass_g: Body mass in grams
⚧️ sex: Penguin sex (Male, Female)

Now, let's see the magic! ✨

🎨 The Magic Moment: One Line of Code!

Here it comes... the moment you've been waiting for!

Watch how ONE single line transforms our DataFrame into a full-fledged interactive visualization tool:

python

# 🪄 THE MAGIC LINE 🪄
pyg.walk(df)

🎉 Congratulations! You Did It!

What just happened?

With that single line of code, you now have:

✅ An interactive drag-and-drop interface
✅ Multiple chart types at your fingertips
✅ Automatic data type detection
✅ Real-time filtering and aggregation
✅ The power of Tableau... in your notebook!

🎮 How to Use the Interface:

Left Panel - Fields:

Drag any field (column) to the shelves on the right

Main Canvas - Visualization Area:

See your charts come to life in real-time

Top Bar - Controls:

📊 Change chart types (bar, line, scatter, etc.)
🎨 Customize colors and styling
💾 Export your visualizations
⚙️ Access advanced settings

💡 Try These Quick Experiments:

Scatter Plot:
- Drag bill_length_mm to X-axis
- Drag bill_depth_mm to Y-axis
- Drag species to Color
- 🎉 See how different species cluster!
Bar Chart:
- Drag species to X-axis
- Drag body_mass_g to Y-axis
- The interface will auto-aggregate (mean by default)
- 🐧 Compare penguin sizes!
Distribution:
- Drag flipper_length_mm to X-axis
- Change chart type to histogram
- Drag species to Color
- 📊 See the distribution patterns!

Take a few minutes to play around! There's no wrong way to explore. 🎪

🤓 Pro Tip: Understanding the Interface

PyGWalker's interface is divided into key areas:

Encoding Shelves (where you drag fields):

Dimensions 📏: Categorical/discrete data (species, island, sex)
Measures 📊: Numerical/continuous data (bill_length, body_mass)

Marks Shelf:

Color: Add visual distinction
Size: Vary point/bar sizes
Opacity: Control transparency
Shape: Different marker shapes

Filters:

Click the filter icon on any field to narrow down your data

The interface automatically suggests the best visualization based on what you drag. Smart, right? 🧠

🎯 Core Features: Mastering PyGWalker

Now that you've seen the magic, let's dive deeper into what makes PyGWalker so powerful!

We'll explore:

📁 Loading different data sources
📊 Creating various visualization types
🔧 Advanced features and customization
🎨 Making your visualizations pop!

📁 Part 1: Loading Different Data Sources

PyGWalker is flexible! You can feed it data from multiple sources. Let's see how:

python

# Method 1: From a CSV file (what we just did!)
df_from_url = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv")
print("✅ Method 1: Loaded from URL")

# Method 2: From a local CSV (if you have one)
# df_from_local = pd.read_csv('your_file.csv')

# Method 3: From a dictionary
data_dict = {
    'species': ['Adelie', 'Gentoo', 'Chinstrap'],
    'avg_mass': [3700, 5076, 3733],
    'count': [152, 124, 68]
}
df_from_dict = pd.DataFrame(data_dict)
print("✅ Method 2: Created from dictionary")

# Method 4: From Excel (requires openpyxl)
# df_from_excel = pd.read_excel('your_file.xlsx')

# Method 5: From SQL, APIs, web scraping... anything that becomes a DataFrame!

print("\n🎯 The key point: If it's a pandas DataFrame, PyGWalker can visualize it!")

💡 Quick Tip: Data Preparation

PyGWalker works best when your data is clean! Before using pyg.walk(), consider:

✅ Good practices:

Handle missing values (we'll see this in action!)
Use meaningful column names
Ensure proper data types (dates as datetime, numbers as numeric)
Keep reasonable dataset sizes (< 100k rows for best performance)

⚠️ PyGWalker will still work with messy data, but clean data = better insights!

📊 Part 2: Visualization Types & When to Use Them

PyGWalker supports a wide variety of chart types. Let's explore the most useful ones with our penguin friends! 🐧

1️⃣ Scatter Plots: Finding Relationships

Best for: Exploring relationships between two numerical variables

Let's investigate: Do penguins with longer bills also have deeper bills?

python

# Let's create a clean version of our dataset for visualization
df_clean = df.dropna()  # Remove rows with missing values

print(f"🧹 Cleaned dataset: {df_clean.shape[0]} rows (removed {df.shape[0] - df_clean.shape[0]} rows with missing data)")
print("\n🎨 Now let's visualize! Run the cell below:")

python

# Scatter plot exploration
pyg.walk(df_clean, hide_data_source_config=True)

🎯 Try This - Scatter Plot Exercise:

Step-by-step instructions:

💡 If "Aggregation" is enabled, disable it. (cube symbol in the top bar)

Create the basic scatter:
- Drag bill_length_mm to X-axis
- Drag bill_depth_mm to Y-axis
- You should see a scatter plot!
Add species distinction:
- Drag species to Color in the Marks shelf
- Wow! See how each species forms its own cluster? 🎨
Add more context:
- Drag sex to Shape
- Drag body_mass_g to Size
- Now you can see size, species, and sex all at once!
Insights to look for:
- 🔍 Gentoo penguins have longer but shallower bills
- 🔍 Adelie penguins cluster in the upper-left
- 🔍 Each species has distinct bill characteristics!

Pro move: Try changing the chart type using the dropdown at the top. PyGWalker suggests the best type automatically! 🤖

2️⃣ Bar Charts: Comparing Categories

Best for: Comparing values across different groups

Let's investigate: Which penguin species is the heaviest on average?

python

# Bar chart exploration
pyg.walk(df_clean, hide_data_source_config=True)

🎯 Try This - Bar Chart Exercise:

Creating a comparative bar chart:

Basic bar chart:
- Drag species to X-axis
- Drag body_mass_g to Y-axis
- PyGWalker automatically calculates the average (mean)!
Compare by island:
- Drag island to Color
- Now you can see how species weights vary by location 🏝️
Change aggregation:
- Click on body_mass_g in the Y-axis
- Try different aggregations: Sum, Count, Median, Min, Max
- Each tells a different story!
Flip it:
- Try swapping X and Y axes (drag them to opposite positions)
- Horizontal bars can be easier to read with long labels

Did you notice? Gentoo penguins are significantly heavier! 💪🐧

3️⃣ Line Charts: Trends Over... Wait! 🤔

Interesting discovery: Our penguin dataset doesn't have a time dimension!

But that's okay - let's create one to demonstrate line charts:

python

# Let's create a time-series dataset for demonstration
import numpy as np

# Simulate penguin population monitoring over months
months = pd.date_range('2023-01-01', periods=12, freq='M')
penguin_trends = pd.DataFrame({
    'month': months,
    'Adelie_count': np.random.randint(45, 55, 12),
    'Gentoo_count': np.random.randint(35, 45, 12),
    'Chinstrap_count': np.random.randint(20, 30, 12)
})

# Reshape for PyGWalker
penguin_trends_long = penguin_trends.melt(
    id_vars=['month'],
    var_name='species',
    value_name='count'
)

print("📈 Time-series data created!")
penguin_trends_long.head(10)

python

# Line chart exploration
pyg.walk(penguin_trends_long, hide_data_source_config=True)

🎯 Try This - Line Chart Exercise:

Create the trend line:
- Drag month to X-axis
- Drag count to Y-axis
- Change chart type to Line 📈
Compare species:
- Drag species to Color
- Now you see all three species trends!
Add markers:
- In the marks settings, enable data points
- Makes it easier to see individual observations

Use case: Line charts are perfect for time-series data, trends, and sequential patterns!

4️⃣ Histograms & Distributions: Understanding Spread

Best for: Seeing the distribution and frequency of values

Let's investigate: How are flipper lengths distributed across our penguins?

python

# Back to our main penguin dataset
pyg.walk(df_clean, hide_data_source_config=True)

🎯 Try This - Histogram Exercise:

Create a histogram:
- Drag flipper_length_mm to X-axis
- Change chart type to Histogram or Bar
- PyGWalker automatically bins the data!
See by species:
- Drag species to Color
- Choose "Stack" or "Dodge" layout
- See how each species has different flipper sizes! 🦅
Adjust bins:
- Click on the X-axis settings
- Try different bin sizes (narrower = more detail)

Insight: You'll notice three distinct peaks - one for each species! This is called a "trimodal distribution." 🎯

5️⃣ Heatmaps & Tables: Dense Data Views

Best for: Showing values across two categorical dimensions

python

# Create a summary table for heatmap
summary_df = df_clean.groupby(['species', 'island']).agg({
    'body_mass_g': 'mean',
    'bill_length_mm': 'mean',
    'flipper_length_mm': 'mean'
}).reset_index().round(1)

print("📊 Summary statistics by species and island:")
summary_df

python

# Heatmap exploration
pyg.walk(summary_df, hide_data_source_config=True)

🎯 Try This - Heatmap Exercise:

Create a heatmap:
- Drag species to X-axis
- Drag island to Y-axis
- Drag body_mass_g to Color
- Change chart type to Heatmap or Square
Insights at a glance:
- Dark colors = higher values
- Light colors = lower values
- Empty cells = no data (e.g., no Chinstrap on Biscoe)

Use case: Heatmaps are perfect for correlation matrices, confusion matrices, or any 2D categorical comparison!

🔧 Part 3: Advanced Features

Now let's level up! 🚀 These features will make you a PyGWalker power user.

⚡ Feature 1: Filters - Focus on What Matters

Filters help you drill down into specific subsets of your data.

Let's explore: Male vs Female penguins by species

python

# Full dataset for filtering demo
pyg.walk(df_clean, hide_data_source_config=True)

🎯 Try This - Filtering Exercise:

Add a filter:
- Find the Filters section (usually top-right)
- Click "Add Filter"
- Select sex and choose only "Male"
- Watch your visualization update instantly! ⚡
Multiple filters:
- Add another filter for island
- Select only "Biscoe"
- Now you're looking at male penguins from Biscoe only!
Dynamic filtering:
- Try selecting different values
- Remove filters to go back to full data
- Use range filters for numeric fields (e.g., body_mass_g > 4000)

Pro tip: Filters don't change your DataFrame - they just change what's displayed! Your original data stays safe. 🛡️

⚡ Feature 2: Aggregations - Summarize Like a Pro

PyGWalker automatically aggregates data when needed. Let's master this!

python

pyg.walk(df_clean, hide_data_source_config=True)

🎯 Try This - Aggregation Exercise:

Understanding auto-aggregation:
- Drag species to X-axis
- Drag body_mass_g to Y-axis
- Notice it shows average by default
Change aggregation type:
- Click on body_mass_g in the Y-axis shelf
- Try these:
  - Count: How many penguins per species?
  - Sum: Total mass of all penguins per species
  - Median: Middle value (less affected by outliers)
  - Min/Max: Smallest/largest penguin per species
  - Std Dev: How varied are the weights?
Multiple measures:
- You can add multiple fields to Y-axis!
- Try adding both body_mass_g and flipper_length_mm
- Compare two metrics side by side

Real-world use: Aggregations are crucial for sales reports, KPI dashboards, and summary statistics! 📊

⚡ Feature 3: Sorting & Ranking

Make patterns jump out by ordering your data!

🎯 Try This - Sorting Exercise:

Sort a bar chart:
- Create: species on X, body_mass_g (mean) on Y
- Click on the axis or bar
- Look for sort options (ascending/descending)
- Watch bars rearrange! 📊
Sort by multiple fields:
- Some chart types allow nested sorting
- Great for complex categorical data

Use case: Rankings, top N analysis, identifying outliers

⚡ Feature 4: Calculated Fields (Power User Feature! 🔥)

Create new metrics on-the-fly without modifying your DataFrame!

Example: Let's calculate Body Mass Index (sort of) for penguins

💡 Calculated Field Example:

While PyGWalker has calculated field capabilities, the exact implementation varies by version.

Alternative approach - Create in pandas first:

python

# Add calculated fields to our DataFrame
df_enhanced = df_clean.copy()

# Bill ratio: length to depth
df_enhanced['bill_ratio'] = (df_enhanced['bill_length_mm'] / df_enhanced['bill_depth_mm']).round(2)

# Mass category
df_enhanced['size_category'] = pd.cut(
    df_enhanced['body_mass_g'],
    bins=[0, 3500, 4500, 6500],
    labels=['Small', 'Medium', 'Large']
)

# Flipper to mass ratio (efficiency!)
df_enhanced['flipper_mass_ratio'] = (df_enhanced['flipper_length_mm'] / df_enhanced['body_mass_g'] * 1000).round(2)

print("✨ Enhanced dataset with calculated fields!")
df_enhanced[['species', 'bill_ratio', 'size_category', 'flipper_mass_ratio']].head()

python

# Explore the enhanced dataset
pyg.walk(df_enhanced, hide_data_source_config=True)

🎯 Try This - Calculated Fields Exercise:

Explore bill ratio:
- Drag bill_ratio to X-axis
- Drag species to Color
- Which species has the longest bills relative to depth?
Use size categories:
- Drag size_category to X-axis
- Drag species to Color
- Create a stacked bar chart
- See the size distribution per species!
Efficiency analysis:
- Create scatter: body_mass_g vs flipper_mass_ratio
- Color by species
- Higher ratio = more flipper per unit of body mass (more "efficient" flippers!)

Real-world use: Calculated fields are essential for:

KPIs (conversion rates, profit margins)
Normalized metrics (per capita, percentages)
Custom business logic

🎨 Part 4: Customization & Styling

Make your visualizations beautiful and professional! ✨

python

# PyGWalker with custom configuration
pyg.walk(
    df_enhanced,
    hide_data_source_config=True,
    spec="./config.json",  # Save/load your chart configurations (optional)
    kernel_computation=True  # Better performance for large datasets
)

🎨 Customization Options:

Visual Styling:

🌗 Theme: Switch between light and dark modes
🎨 Colors: Click on color legends to customize palettes
📏 Axes: Customize labels, scales (linear, log), and ranges
📊 Titles: Add descriptive titles to your charts

Interface Options:

hide_data_source_config=True: Cleaner interface, hides data source panel
dark='light' or appearance='dark': Theme preference
kernel_computation=True: Offload calculations to Python kernel (faster!)

Saving Your Work:

💾 Export charts: Use the export button for PNG/SVG
📋 Save configuration: Export your chart setup as JSON
🔄 Load configuration: Reuse your favorite chart setups

Pro tip: You can save your PyGWalker configuration and load it later for consistent visualizations! 🎯

🎓 Quick Recap: Core Features

You've learned A LOT! Let's recap: 🎉

✅ Data Loading: URLs, files, dictionaries → any DataFrame works!

✅ Chart Types:

📊 Scatter plots → relationships
📊 Bar charts → comparisons
📈 Line charts → trends
📊 Histograms → distributions
🔥 Heatmaps → 2D patterns

✅ Advanced Features:

🔍 Filters → focus on subsets
📊 Aggregations → summarize data
🔄 Sorting → reveal patterns
⚡ Calculated fields → custom metrics

✅ Customization:

🎨 Themes and colors
💾 Export and save
⚙️ Performance options

You're now a PyGWalker intermediate user! 🎊

Next up, we'll dive into real-world use cases and best practices. Ready? 🚀

💡 Practical Use Cases: Real-World Applications

Time to see PyGWalker in action with scenarios you'll actually face! 🌍

We'll explore:

📈 Sales Analysis: Revenue trends and performance
👥 Customer Segmentation: Understanding your audience
🔍 Data Quality Checks: Finding problems in your data
🧪 A/B Testing Results: Comparing experiments

Each example includes a realistic dataset and step-by-step analysis. Let's go! 🚀

📈 Use Case 1: Sales Analysis

Scenario: You're a data analyst at an e-commerce company. Your manager wants insights on:

Which products are selling best?
Are sales trending up or down?
Which regions are performing well?

Let's create a realistic sales dataset and analyze it!

python

# Create a realistic sales dataset
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Set seed for reproducibility
np.random.seed(42)

# Generate dates for the last 12 months
date_range = pd.date_range(end=datetime.now(), periods=365, freq='D')

# Product categories
products = ['Laptop', 'Phone', 'Tablet', 'Headphones', 'Smartwatch', 'Camera']
regions = ['North America', 'Europe', 'Asia', 'South America']
channels = ['Online', 'Retail']

# Generate sales data
n_records = 1000
sales_data = pd.DataFrame({
    'date': np.random.choice(date_range, n_records),
    'product': np.random.choice(products, n_records),
    'region': np.random.choice(regions, n_records),
    'channel': np.random.choice(channels, n_records),
    'units_sold': np.random.randint(1, 50, n_records),
    'unit_price': np.random.uniform(50, 2000, n_records).round(2),
})

# Calculate revenue
sales_data['revenue'] = (sales_data['units_sold'] * sales_data['unit_price']).round(2)

# Add some seasonality (higher sales in Nov-Dec)
sales_data.loc[sales_data['date'].dt.month.isin([11, 12]), 'revenue'] *= 1.5
sales_data['revenue'] = sales_data['revenue'].round(2)

# Sort by date
sales_data = sales_data.sort_values('date').reset_index(drop=True)

print("💰 Sales dataset created!")
print(f"📊 Records: {len(sales_data):,}")
print(f"💵 Total Revenue: ${sales_data['revenue'].sum():,.2f}")
print(f"📅 Date Range: {sales_data['date'].min().date()} to {sales_data['date'].max().date()}")
print("\n" + "="*60)
sales_data.head(10)

python

# Quick overview of sales data
print("📊 Sales Summary Statistics:\n")
print(sales_data.describe())
print("\n" + "="*60)
print("🎯 Sales by Product:\n")
print(sales_data.groupby('product')['revenue'].agg(['sum', 'mean', 'count']).sort_values('sum', ascending=False))

python

# Let's analyze! 🚀
pyg.walk(sales_data, hide_data_source_config=True)

🎯 Sales Analysis Exercises:

Exercise 1: Revenue by Product 📊

Drag product to X-axis
Drag revenue to Y-axis (will auto-aggregate to SUM)
Sort descending to see top performers
Question: Which product generates the most revenue?

Exercise 2: Sales Trends Over Time 📈

Drag date to X-axis
Drag revenue to Y-axis
Change to Line chart
Drag product to Color
Question: Do you see the holiday season spike? (Nov-Dec)

Exercise 3: Regional Performance 🌍

Create a bar chart: region vs revenue
Drag channel to Color
Use "Dodge" layout to compare side-by-side
Question: Which region prefers online vs retail?

Exercise 4: Profitability Analysis 💰

Scatter plot: units_sold (X) vs revenue (Y)
Add product to Color
Add unit_price to Size
Question: Which products are high-volume vs high-value?

Exercise 5: Monthly Trends 📅

Create a calculated field for month (or use date aggregation)
Line chart: month vs revenue
Insight: Identify seasonal patterns for inventory planning!

💡 Business Insights You Might Discover:

🏆 Top performers: Identify which products to stock more
📉 Declining products: Spot products needing promotion
🌍 Regional preferences: Tailor marketing by region
📅 Seasonality: Plan inventory and campaigns
💻 Channel effectiveness: Optimize sales channels

Pro tip: Save these visualizations and share them with your team! Export as PNG for presentations. 📸

👥 Use Case 2: Customer Segmentation

Scenario: You work for a subscription service. Marketing wants to understand customer segments for targeted campaigns.

Goals:

Who are our most valuable customers?
What behaviors distinguish different segments?
How can we reduce churn?

Let's create customer data and segment it!

python

# Create customer dataset
np.random.seed(123)

n_customers = 500

customer_data = pd.DataFrame({
    'customer_id': range(1, n_customers + 1),
    'age': np.random.randint(18, 70, n_customers),
    'subscription_months': np.random.randint(1, 36, n_customers),
    'monthly_spend': np.random.uniform(10, 200, n_customers).round(2),
    'login_frequency': np.random.randint(1, 30, n_customers),
    'support_tickets': np.random.randint(0, 10, n_customers),
    'referrals': np.random.randint(0, 5, n_customers),
})

# Calculate lifetime value
customer_data['lifetime_value'] = (
    customer_data['subscription_months'] * customer_data['monthly_spend']
).round(2)

# Create engagement score
customer_data['engagement_score'] = (
    (customer_data['login_frequency'] * 2) +
    (customer_data['referrals'] * 10) -
    (customer_data['support_tickets'] * 3)
)

# Create segments based on lifetime value
customer_data['value_segment'] = pd.cut(
    customer_data['lifetime_value'],
    bins=[0, 500, 2000, 10000],
    labels=['Low Value', 'Medium Value', 'High Value']
)

# Create age groups
customer_data['age_group'] = pd.cut(
    customer_data['age'],
    bins=[0, 25, 35, 50, 100],
    labels=['18-25', '26-35', '36-50', '50+']
)

# Churn prediction (synthetic)
customer_data['churn_risk'] = np.where(
    (customer_data['engagement_score'] < 20) & (customer_data['subscription_months'] < 6),
    'High Risk',
    np.where(
        (customer_data['engagement_score'] < 40) & (customer_data['subscription_months'] < 12),
        'Medium Risk',
        'Low Risk'
    )
)

print("👥 Customer dataset created!")
print(f"📊 Total Customers: {len(customer_data):,}")
print(f"💰 Average Lifetime Value: ${customer_data['lifetime_value'].mean():,.2f}")
print(f"⚠️ High Churn Risk: {(customer_data['churn_risk'] == 'High Risk').sum()} customers")
print("\n" + "="*60)
customer_data.head(10)

python

# Customer segments overview
print("📊 Customer Segmentation Overview:\n")
print("By Value Segment:")
print(customer_data['value_segment'].value_counts())
print("\nBy Churn Risk:")
print(customer_data['churn_risk'].value_counts())
print("\nBy Age Group:")
print(customer_data['age_group'].value_counts())

python

# Analyze customer segments! 🎯
pyg.walk(customer_data, hide_data_source_config=True)

🎯 Customer Segmentation Exercises:

Exercise 1: Value Segment Distribution 💎

Bar chart: value_segment (X) vs customer_id count (Y)
Drag churn_risk to Color
Use stacked bars
Question: Are high-value customers at risk of churning?

Exercise 2: Engagement Analysis 📊

Scatter plot: login_frequency (X) vs monthly_spend (Y)
Add value_segment to Color
Add subscription_months to Size
Question: Do engaged users spend more?

Exercise 3: Age Demographics 👤

Bar chart: age_group (X) vs lifetime_value average (Y)
Which age group has highest LTV?
Add filter: show only "High Value" customers
Insight: Target similar demographics in marketing!

Exercise 4: Churn Risk Factors ⚠️

Box plot or violin plot: churn_risk (X) vs engagement_score (Y)
Add another view: churn_risk vs support_tickets
Question: What predicts churn? Low engagement? Many issues?

Exercise 5: Referral Champions 🏆

Filter: referrals >= 2
Scatter: subscription_months vs lifetime_value
Color by age_group
Insight: Identify your brand advocates for referral programs!

💡 Marketing Actions Based on Insights:

High-Value + High Churn Risk 🚨

Immediate personal outreach
Exclusive perks or discounts
Address support issues proactively

Medium Value + High Engagement 🌟

Upsell opportunities
Premium features
Referral incentives

Low Value + Young Demographic 🎯

Growth potential
Educational content
Community building

Low Engagement + Any Value 📧

Re-engagement campaigns
Product education
Feature highlights

Real-world impact: Customer segmentation can increase ROI by 200%+ on marketing campaigns! 🎯

python

# Analyze customer segments! 🎯
pyg.walk(customer_data, hide_data_source_config=True)

🎯 Customer Segmentation Exercises:

Exercise 1: Value Segment Distribution 💎

Bar chart: value_segment (X) vs customer_id count (Y)
Drag churn_risk to Color
Use stacked bars
Question: Are high-value customers at risk of churning?

Exercise 2: Engagement Analysis 📊

Scatter plot: login_frequency (X) vs monthly_spend (Y)
Add value_segment to Color
Add subscription_months to Size
Question: Do engaged users spend more?

Exercise 3: Age Demographics 👤

Bar chart: age_group (X) vs lifetime_value average (Y)
Which age group has highest LTV?
Add filter: show only "High Value" customers
Insight: Target similar demographics in marketing!

Exercise 4: Churn Risk Factors ⚠️

Box plot or violin plot: churn_risk (X) vs engagement_score (Y)
Add another view: churn_risk vs support_tickets
Question: What predicts churn? Low engagement? Many issues?

Exercise 5: Referral Champions 🏆

Filter: referrals >= 2
Scatter: subscription_months vs lifetime_value
Color by age_group
Insight: Identify your brand advocates for referral programs!

💡 Marketing Actions Based on Insights:

High-Value + High Churn Risk 🚨

Immediate personal outreach
Exclusive perks or discounts
Address support issues proactively

Medium Value + High Engagement 🌟

Upsell opportunities
Premium features
Referral incentives

Low Value + Young Demographic 🎯

Growth potential
Educational content
Community building

Low Engagement + Any Value 📧

Re-engagement campaigns
Product education
Feature highlights

Real-world impact: Customer segmentation can increase ROI by 200%+ on marketing campaigns! 🎯

🔍 Use Case 3: Data Quality Checks

Scenario: You've received a new dataset from a vendor. Before analysis, you need to check data quality!

What to look for:

Missing values
Outliers and anomalies
Inconsistent data
Distribution issues

PyGWalker is PERFECT for visual data quality checks! 🔍

python

# Create a "messy" dataset for quality checking
np.random.seed(456)

n_records = 300

messy_data = pd.DataFrame({
    'transaction_id': range(1, n_records + 1),
    'amount': np.random.uniform(10, 1000, n_records),
    'category': np.random.choice(['Electronics', 'Clothing', 'Food', 'Other', None], n_records),
    'quantity': np.random.randint(1, 20, n_records),
    'customer_age': np.random.randint(15, 80, n_records),
    'rating': np.random.choice([1, 2, 3, 4, 5, None], n_records),
})

# Introduce data quality issues

# 1. Missing values (10% in category, 15% in rating)
messy_data.loc[np.random.choice(messy_data.index, 30), 'category'] = None
messy_data.loc[np.random.choice(messy_data.index, 45), 'rating'] = None

# 2. Outliers in amount (some crazy high values)
messy_data.loc[np.random.choice(messy_data.index, 5), 'amount'] = np.random.uniform(5000, 10000, 5)

# 3. Impossible values (negative amounts, ages > 100)
messy_data.loc[np.random.choice(messy_data.index, 3), 'amount'] = -np.random.uniform(10, 100, 3)
messy_data.loc[np.random.choice(messy_data.index, 4), 'customer_age'] = np.random.randint(100, 150, 4)

# 4. Duplicates
duplicate_rows = messy_data.sample(10)
messy_data = pd.concat([messy_data, duplicate_rows], ignore_index=True)

print("🔍 Messy dataset created (intentionally flawed!):")
print(f"📊 Total Records: {len(messy_data):,}")
print(f"❌ Missing values: {messy_data.isnull().sum().sum()}")
print(f"🔄 Duplicate rows: {messy_data.duplicated().sum()}")
print(f"⚠️ Negative amounts: {(messy_data['amount'] < 0).sum()}")
print(f"⚠️ Invalid ages: {(messy_data['customer_age'] > 100).sum()}")
print("\n" + "="*60)
messy_data.head(15)

python

# Check missing values
print("📊 Missing Values Report:\n")
missing_report = pd.DataFrame({
    'Column': messy_data.columns,
    'Missing': messy_data.isnull().sum(),
    'Percentage': (messy_data.isnull().sum() / len(messy_data) * 100).round(2)
})
print(missing_report)

python

# Visual data quality check! 🔍
pyg.walk(messy_data, hide_data_source_config=True)

🎯 Data Quality Check Exercises:

Exercise 1: Spot Outliers in Amount 💰

Histogram: Drag amount to X-axis
Question: See those bars way out on the right? Those are outliers!
Box plot: Change chart type to see whiskers and outliers clearly
Action: Investigate transactions > $5,000

Exercise 2: Find Missing Values Patterns ❌

Create a calculated field: is_missing_category (or filter by null)
Compare missing vs non-missing records
Question: Is missingness random or systematic?
Bar chart: category counts - see how many nulls exist

Exercise 3: Detect Impossible Values 🚨

Scatter plot: transaction_id (X) vs customer_age (Y)
Question: See any points above 100? Those are errors!
Repeat for amount - look for negative values
Action: Create filters to isolate problematic records

Exercise 4: Check Distributions 📊

Histogram: rating distribution
Question: Is the distribution reasonable? Too many nulls?
Compare across category
Insight: Some categories might have more missing ratings

Exercise 5: Identify Patterns in Quality Issues 🔍

Create a flag: has_issues = TRUE if (age > 100 OR amount < 0)
Analyze: Do issues cluster in certain categories?
Insight: Data quality issues might be systematic!

🛠️ Data Cleaning Actions:

After visual inspection, here's what to fix:

python

# Clean the messy data based on insights
messy_data_cleaned = messy_data.copy()

# 1. Remove duplicates
messy_data_cleaned = messy_data_cleaned.drop_duplicates()

# 2. Fix impossible values
messy_data_cleaned = messy_data_cleaned[
    (messy_data_cleaned['amount'] >= 0) &
    (messy_data_cleaned['customer_age'] <= 100)
]

# 3. Handle outliers (cap at 99th percentile)
amount_99th = messy_data_cleaned['amount'].quantile(0.99)
messy_data_cleaned.loc[messy_data_cleaned['amount'] > amount_99th, 'amount'] = amount_99th

# 4. Fill missing categories
messy_data_cleaned['category'] = messy_data_cleaned['category'].fillna('Unknown')

print("✅ Data cleaned!")
print(f"📊 Records before: {len(messy_data):,} → after: {len(messy_data_cleaned):,}")
print(f"✨ Removed: {len(messy_data) - len(messy_data_cleaned):,} problematic records")
print(f"❌ Missing values: {messy_data_cleaned.isnull().sum().sum()}")

python

# Compare before and after! 📊
print("Let's visualize the cleaned data:")
pyg.walk(messy_data_cleaned, hide_data_source_config=True)

💡 Data Quality Best Practices:

Always Check Before Analysis: ✅

📊 Distributions: Histograms reveal outliers and skewness
❌ Missing values: Identify patterns, not just counts
🔢 Range checks: Min/max should make business sense
🔄 Duplicates: Visual patterns can reveal duplicate records
📈 Trends: Unexpected spikes might indicate data issues

PyGWalker for QA:

⚡ Faster than writing multiple plotting commands
👁️ Interactive exploration helps spot subtle issues
🎯 Visual patterns are easier to spot than statistics
📸 Export problematic charts for documentation

Real-world impact: Catching data quality issues early saves hours (or days!) of debugging later! 🎯

🧪 Use Case 4: A/B Testing Results

Scenario: Your product team ran an A/B test on a new feature. You need to analyze if variant B performs better than variant A.

Metrics to compare:

Conversion rate
Average order value
User engagement
Statistical significance

Let's analyze test results! 🧪

python

# Create A/B test dataset
np.random.seed(789)

n_users = 1000

# Variant B performs slightly better (simulate this)
variant_a_users = n_users // 2
variant_b_users = n_users - variant_a_users

ab_test_data = pd.DataFrame({
    'user_id': range(1, n_users + 1),
    'variant': ['A'] * variant_a_users + ['B'] * variant_b_users,
    'converted': (
        list(np.random.choice([0, 1], variant_a_users, p=[0.75, 0.25])) +  # A: 25% conversion
        list(np.random.choice([0, 1], variant_b_users, p=[0.65, 0.35]))    # B: 35% conversion
    ),
    'time_on_page': np.concatenate([
        np.random.uniform(30, 180, variant_a_users),   # A: average 105 seconds
        np.random.uniform(45, 210, variant_b_users)     # B: average 127 seconds
    ]),
    'pages_viewed': np.concatenate([
        np.random.randint(1, 8, variant_a_users),      # A: fewer pages
        np.random.randint(2, 10, variant_b_users)       # B: more pages
    ]),
})

# Add order value (only for converted users)
ab_test_data['order_value'] = 0
ab_test_data.loc[ab_test_data['converted'] == 1, 'order_value'] = np.random.uniform(20, 200, ab_test_data['converted'].sum())

# Round numeric columns
ab_test_data['time_on_page'] = ab_test_data['time_on_page'].round(1)
ab_test_data['order_value'] = ab_test_data['order_value'].round(2)

# Add day of test
ab_test_data['test_day'] = np.random.randint(1, 15, n_users)

print("🧪 A/B Test dataset created!")
print(f"👥 Total Users: {len(ab_test_data):,}")
print(f"📊 Variant A: {variant_a_users:,} users")
print(f"📊 Variant B: {variant_b_users:,} users")
print("\n" + "="*60)
ab_test_data.head(10)

python

# Quick A/B test summary
print("📊 A/B Test Results Summary:\n")
summary = ab_test_data.groupby('variant').agg({
    'converted': ['sum', 'mean'],
    'time_on_page': 'mean',
    'pages_viewed': 'mean',
    'order_value': 'mean'
}).round(3)

summary.columns = ['Total Conversions', 'Conversion Rate', 'Avg Time (sec)', 'Avg Pages', 'Avg Order Value']
print(summary)

# Calculate lift
conv_rate_a = ab_test_data[ab_test_data['variant'] == 'A']['converted'].mean()
conv_rate_b = ab_test_data[ab_test_data['variant'] == 'B']['converted'].mean()
lift = ((conv_rate_b - conv_rate_a) / conv_rate_a * 100)

print(f"\n🚀 Lift (B vs A): {lift:.1f}%")
print(f"{'🎉 Variant B wins!' if lift > 0 else '📉 Variant A is better'}")

python

# Analyze the A/B test! 🎯
pyg.walk(ab_test_data, hide_data_source_config=True)

🎯 A/B Testing Analysis Exercises:

Exercise 1: Conversion Rate Comparison 📊

Bar chart: variant (X) vs converted (Y, mean aggregation)
Question: Which variant has higher conversion rate?
Change to actual counts (sum) to see volume
Insight: B converts at ~35% vs A at ~25% = 40% lift! 🚀

Exercise 2: Engagement Metrics ⏱️

Box plot: variant (X) vs time_on_page (Y)
See the distribution differences
Repeat for pages_viewed
Question: Is B more engaging overall?

Exercise 3: Order Value Analysis 💰

Filter: converted = 1 (only converted users)
Box plot: variant vs order_value
Question: Do B users spend more per order?
Important: Similar values = increased revenue comes from MORE conversions, not bigger orders

Exercise 4: Time-Series Check 📅

Line chart: test_day (X) vs converted mean (Y)
Color by variant
Question: Is performance consistent across days?
Watch for: Novelty effects or contamination

Exercise 5: Segment Analysis 🎯

Create bins for time_on_page (low, medium, high engagement)
Compare conversion by engagement level and variant
Question: Does B work better for certain user types?
Advanced: Look for interaction effects

📈 Statistical Considerations:

What PyGWalker Shows:

✅ Descriptive statistics visually
✅ Distribution shapes and outliers
✅ Trends over time
✅ Segment-level differences

What You Still Need:

📊 Statistical significance tests (t-test, chi-square)
📊 Confidence intervals
📊 Power analysis

Pro tip: Use PyGWalker for exploratory analysis, then confirm with statistical tests in Python!

python

# Quick statistical test (bonus!)
from scipy import stats

# Chi-square test for conversion rate
contingency_table = pd.crosstab(ab_test_data['variant'], ab_test_data['converted'])
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)

print("📊 Statistical Significance Test (Chi-Square):")
print("="*60)
print(f"Chi-square statistic: {chi2:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"\n{'✅ Statistically significant (p < 0.05)!' if p_value < 0.05 else '❌ Not statistically significant (p >= 0.05)'}")
print("\nConclusion:")
if p_value < 0.05 and conv_rate_b > conv_rate_a:
    print("🎉 Variant B is significantly better! Ship it! 🚀")
elif p_value < 0.05:
    print("📉 Variant A is significantly better. Keep the original.")
else:
    print("🤷 No significant difference. Need more data or run longer.")

💡 A/B Testing Best Practices:

Before the Test: 📋

Define success metrics clearly
Calculate required sample size
Ensure random assignment
Set test duration

During Analysis with PyGWalker: 🔍

✅ Check for outliers (can skew results)
✅ Verify randomization (variants should look similar demographically)
✅ Look for time-based patterns (novelty effects)
✅ Segment analysis (does it work for everyone?)

Making the Decision: ✅

Visual exploration (PyGWalker) ✨
Statistical tests (scipy/statsmodels)
Business context (cost, feasibility)
Segment analysis (any negative impacts?)

Real-world impact: Proper A/B analysis can increase revenue by 10-30% through optimized features! 📈

You've completed all 4 real-world use cases! 🎊

✅ Sales analysis for business insights
✅ Customer segmentation for targeted marketing
✅ Data quality checks for reliable analysis
✅ A/B testing for product decisions

Next up: Best practices and pro tips! 🚀

💎 Best Practices & Pro Tips

You've learned the fundamentals and seen real-world applications. Now let's level up with advanced techniques and best practices! 🚀

In this section:

⚡ Performance optimization for large datasets
🎯 Workflow recommendations
🐛 Troubleshooting common issues
🔥 Pro tips from power users
📚 Additional resources

⚡ Performance Optimization

PyGWalker is fast, but with large datasets, a few tweaks can make it even faster! ⚡

📊 How Big is Too Big?

Performance Guidelines:

Dataset Size	Performance	Recommendations
< 10K rows	🟢 Excellent	Use as-is, no optimization needed
10K - 100K	🟡 Good	Consider sampling for exploration
100K - 1M	🟠 Moderate	Use sampling + kernel calc
> 1M rows	🔴 Slow	Aggregate first or use database

Rule of thumb: If your DataFrame takes >2 seconds to display, it's time to optimize!

🎯 Optimization Technique 1: Smart Sampling

For initial exploration, you don't always need ALL the data!

python

# Create a large dataset for demonstration
import pandas as pd
import numpy as np

np.random.seed(42)
large_dataset = pd.DataFrame({
    'date': pd.date_range('2020-01-01', periods=500000, freq='min'),
    'user_id': np.random.randint(1, 10000, 500000),
    'event_type': np.random.choice(['click', 'view', 'purchase', 'cart'], 500000),
    'value': np.random.uniform(0, 100, 500000),
    'session_duration': np.random.randint(10, 3600, 500000)
})

print(f"📊 Large dataset created: {len(large_dataset):,} rows")
print(f"💾 Memory usage: {large_dataset.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

python

# ❌ Bad Practice: Using entire large dataset
# pyg.walk(large_dataset)  # This will be slow!

# ✅ Good Practice: Sample for exploration
sample_size = 10000
df_sample = large_dataset.sample(n=sample_size, random_state=42)

print(f"✅ Sampled {sample_size:,} rows for exploration")
print(f"📊 That's {(sample_size/len(large_dataset)*100):.1f}% of the data")
print(f"⚡ Speed improvement: ~{len(large_dataset)//sample_size}x faster!")

python

# Fast exploration with sampled data
pyg.walk(df_sample, hide_data_source_config=True, kernel_computation=True)

🎯 Optimization Technique 2: Pre-Aggregation

If you're analyzing trends, aggregate BEFORE visualizing!

python

# ❌ Bad Practice: Visualizing 500K raw records for time trends

# ✅ Good Practice: Aggregate first!
daily_summary = large_dataset.groupby([
    large_dataset['date'].dt.date,
    'event_type'
]).agg({
    'user_id': 'nunique',  # Unique users
    'value': ['sum', 'mean'],
    'session_duration': 'mean'
}).reset_index()

daily_summary.columns = ['date', 'event_type', 'unique_users', 'total_value', 'avg_value', 'avg_duration']

print(f"✅ Aggregated from {len(large_dataset):,} → {len(daily_summary):,} rows")
print(f"⚡ That's a {len(large_dataset)//len(daily_summary)}x reduction!")
print("\nNow this will be lightning fast! ⚡")
daily_summary.head()

python

# Super fast visualization of aggregated data
pyg.walk(daily_summary, hide_data_source_config=True, kernel_computation=True)

🎯 Optimization Technique 3: Data Type Optimization

Smaller data types = less memory = faster performance!

python

# Check current memory usage
print("📊 Memory Usage by Column (BEFORE optimization):")
print("="*60)
memory_before = large_dataset.memory_usage(deep=True)
print(memory_before)
print(f"\n💾 Total: {memory_before.sum() / 1024**2:.2f} MB")

python

# ✅ Optimize data types
large_dataset_optimized = large_dataset.copy()

# Convert object to category (huge savings!)
large_dataset_optimized['event_type'] = large_dataset_optimized['event_type'].astype('category')

# Use smaller int types
large_dataset_optimized['user_id'] = large_dataset_optimized['user_id'].astype('int32')
large_dataset_optimized['session_duration'] = large_dataset_optimized['session_duration'].astype('int16')

# Use float32 instead of float64
large_dataset_optimized['value'] = large_dataset_optimized['value'].astype('float32')

print("📊 Memory Usage by Column (AFTER optimization):")
print("="*60)
memory_after = large_dataset_optimized.memory_usage(deep=True)
print(memory_after)
print(f"\n💾 Total: {memory_after.sum() / 1024**2:.2f} MB")

savings = (1 - memory_after.sum() / memory_before.sum()) * 100
print(f"\n🎉 Memory savings: {savings:.1f}%!")

🎯 Optimization Technique 4: Use Kernel Calculation

PyGWalker can offload calculations to your Python kernel for better performance!

python

# ✅ Best Practice: Enable kernel calculations
pyg.walk(
    df_sample,
    kernel_computation=True,  # 🔥 This is the magic parameter!
    hide_data_source_config=True
)

📋 Performance Optimization Checklist

Before using PyGWalker on large datasets:

✅ Step 1: Check dataset size

If > 100K rows, consider optimization

✅ Step 2: Sample for exploration

Use .sample() for initial analysis
10K-50K rows is usually plenty

✅ Step 3: Aggregate when possible

Daily/weekly summaries for time-series
Group by categories for comparisons

✅ Step 4: Optimize data types

Use category for text with few unique values
Use smaller numeric types (int32, float32)

✅ Step 5: Enable kernel calculation

Set kernel_computation=True

✅ Step 6: Clean up first

Remove unnecessary columns
Drop duplicates
Handle missing values

Pro tip: Profile your code with %%time to measure improvements! ⏱️

🎯 Workflow Best Practices

How to integrate PyGWalker into your data science workflow efficiently!

📊 The Recommended Workflow

Phase 1: Initial Exploration 🔍

python

# 1. Load your data
df = pd.read_csv("your_data.csv")

# 2. Quick overview
print(f"Shape: {df.shape}")
print(f"\nData types:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")

# 3. Basic statistics
df.describe()

python

# 4. PyGWalker for visual exploration ⭐
# Spend 10-15 minutes exploring interactively
pyg.walk(df, hide_data_source_config=True)

Phase 2: Deep Dive Analysis 🎯

After initial exploration, you'll have questions. Answer them systematically:

python

# Example: Based on PyGWalker exploration, you noticed something interesting
# Now create focused analysis

# 5. Clean and prepare data based on insights
df_clean = df.dropna(subset=['important_column'])
df_clean = df_clean[df_clean['value'] > 0]

# 6. Create calculated fields you identified as useful
df_clean['new_metric'] = df_clean['a'] / df_clean['b']

# 7. Explore the refined dataset
pyg.walk(df_clean, hide_data_source_config=True)

Phase 3: Documentation & Sharing 📝

python

# 8. Export key visualizations
# Use PyGWalker's export button to save charts as PNG/SVG

# 9. Document insights in markdown cells
"""
Key Findings:
- Insight 1: [description]
- Insight 2: [description]
- Recommendation: [action items]
"""

# 10. Create final summary statistics
final_summary = df_clean.groupby('category').agg({
    'metric1': 'mean',
    'metric2': 'sum'
})
print(final_summary)

💡 PyGWalker in Different Workflows

For Data Scientists: 🧪

✅ Use PyGWalker for EDA before modeling
✅ Visualize feature distributions
✅ Spot outliers that might affect models
✅ Understand feature relationships

For Analysts: 📊

✅ Quick ad-hoc analysis
✅ Create presentation-ready charts
✅ Interactive dashboards in notebooks
✅ Self-service analytics

For Data Engineers: 🔧

✅ Data quality validation
✅ Pipeline monitoring
✅ Quick sanity checks
✅ Distribution verification

For Business Users: 💼

✅ Explore data without coding (mostly!)
✅ Answer business questions quickly
✅ Drag-and-drop simplicity
✅ Share insights with stakeholders

🐛 Troubleshooting & Common Issues

Running into problems? Here are solutions to the most common issues! 🔧

Symptoms:

Blank output cell
No interactive interface appears
Just see <pygwalker.walker.Walker object at 0x...>

Solutions:

python

# ✅ Solution 1: Make sure you're in a supported environment
import sys
print(f"Python version: {sys.version}")
print(f"Environment: {'Google Colab' if 'google.colab' in sys.modules else 'Other'}")

# ✅ Solution 2: Update PyGWalker to latest version
# !pip install --upgrade pygwalker

# ✅ Solution 3: Restart runtime and try again
# In Colab: Runtime > Restart runtime

❌ Issue 2: Slow Performance / Browser Freezing

Symptoms:

Interface takes forever to load
Browser becomes unresponsive
Lag when dragging fields

Solutions:

python

# ✅ Solution 1: Sample your data
df_sample = df.sample(min(10000, len(df)))
pyg.walk(df_sample)

# ✅ Solution 2: Use kernel calculation
pyg.walk(df, kernel_computation=True)

# ✅ Solution 3: Drop unnecessary columns
df_slim = df[['col1', 'col2', 'col3']]  # Only columns you need
pyg.walk(df_slim)

❌ Issue 3: Charts Look Wrong / Unexpected Aggregations

Symptoms:

Numbers don't match expectations
Chart shows "sum" when you want "count"
Weird groupings

Solutions:

💡 Understand auto-aggregation:

When you drag a measure (numeric) to an axis with dimensions, PyGWalker aggregates!
Default is usually SUM or MEAN
Click on the field in the shelf to change aggregation type

💡 Check data types:

Text fields stored as numbers? Convert them!
Dates recognized as strings? Parse them!

python

# ✅ Fix data types
df['date_column'] = pd.to_datetime(df['date_column'])
df['category_column'] = df['category_column'].astype('category')
df['numeric_column'] = pd.to_numeric(df['numeric_column'], errors='coerce')

❌ Issue 4: Missing Values Causing Problems

Symptoms:

Filters not working as expected
Aggregations returning NaN
Charts missing data points

Solutions:

python

# ✅ Option 1: Drop missing values
df_clean = df.dropna()

# ✅ Option 2: Fill missing values
df['column'] = df['column'].fillna(0)  # or mean, median, etc.

# ✅ Option 3: Create "Missing" category
df['column'] = df['column'].fillna('Unknown')

❌ Issue 5: Can't Export or Save Visualizations

Symptoms:

Export button not working
Can't save chart configurations

Solutions:

💡 Export as image:

Look for the download/export icon (usually top-right)
Choose PNG or SVG format
Save to your local machine

💡 Save configuration:

python

# ✅ Save your chart setup as JSON
pyg.walk(df, spec="./my_chart_config.json")

# ✅ Load it later
pyg.walk(df, spec="./my_chart_config.json")

❌ Issue 6: Colors/Themes Not Applying

Symptoms:

Dark mode not working
Custom colors not showing

Solutions:

python

# ✅ Explicitly set theme
pyg.walk(df,appearance='light')  # or 'dark'

# ✅ For custom styling, modify after rendering
# (Advanced: requires CSS knowledge)

🆘 Still Having Issues?

Debug Checklist: ✅

✅ Updated to latest PyGWalker version?
✅ Restarted your notebook kernel?
✅ Checked DataFrame has data? (df.head())
✅ Verified data types? (df.dtypes)
✅ Tried with a simple example first?
✅ Checked GitHub issues for similar problems?

Get Help:

📚 Official Documentation
💬 Discord Community
🐛 GitHub Issues
📧 Email support (check docs for contact)

🔥 Pro Tips from Power Users

Advanced techniques that will make you a PyGWalker master! 🎯

💎 Pro Tip 1: Save & Reuse Chart Configurations

Create once, reuse everywhere!

python

# ✅ Save your perfect chart setup
pyg.walk(df, spec="./sales_dashboard.json")

# Later, load it with new data (same structure)
df_new = pd.read_csv("next_month_data.csv")
pyg.walk(df_new, spec="./sales_dashboard.json")

# 🎉 Instant dashboard with new data!

💎 Pro Tip 2: Combine with Other Libraries

PyGWalker plays nicely with the Python ecosystem!

python

# Example: Use pandas for heavy preprocessing, PyGWalker for visualization
import pandas as pd

# Complex aggregation in pandas
summary = df.groupby(['category', 'region']).agg({
    'revenue': ['sum', 'mean'],
    'units': 'sum',
    'customers': 'nunique'
}).reset_index()

summary.columns = ['category', 'region', 'total_revenue', 'avg_revenue', 'total_units', 'unique_customers']

# Beautiful visualization in PyGWalker
pyg.walk(summary)

💎 Pro Tip 3: Use for Jupyter Presentations

Create interactive presentations with RISE + PyGWalker!

python

# Install RISE for slideshows
# !pip install RISE

# Then use PyGWalker in your slides for interactive demos
# Your audience can explore data in real-time! 🎪

💎 Pro Tip 4: Quick Data Quality Dashboard

Create a reusable data quality checker!

python

def data_quality_report(df):
    """
    Create a comprehensive data quality report with PyGWalker
    """
    import pandas as pd

    # Create quality metrics DataFrame
    quality_df = pd.DataFrame({
        'column': df.columns,
        'dtype': df.dtypes.astype(str),
        'missing_count': df.isnull().sum(),
        'missing_pct': (df.isnull().sum() / len(df) * 100).round(2),
        'unique_values': [df[col].nunique() for col in df.columns],
        'sample_value': [str(df[col].iloc[0]) if len(df) > 0 else '' for col in df.columns]
    })

    print("📊 Data Quality Report:")
    print("="*60)
    print(quality_df.to_string())

    # Visualize with PyGWalker
    return pyg.walk(quality_df, hide_data_source_config=True)

# Use it on any DataFrame!
# data_quality_report(your_df)

💎 Pro Tip 5: Create Custom Analysis Templates

Build reusable analysis workflows!

python

def customer_analysis(df, customer_col, value_col, date_col):
    """
    Standardized customer analysis with PyGWalker
    """
    # Create summary
    summary = df.groupby(customer_col).agg({
        value_col: ['sum', 'mean', 'count'],
        date_col: ['min', 'max']
    }).reset_index()

    summary.columns = [customer_col, 'total_value', 'avg_value', 'transactions', 'first_purchase', 'last_purchase']

    # Calculate additional metrics
    summary['customer_tenure_days'] = (summary['last_purchase'] - summary['first_purchase']).dt.days
    summary['value_segment'] = pd.qcut(summary['total_value'], q=3, labels=['Low', 'Medium', 'High'])

    return pyg.walk(summary, hide_data_source_config=True)

# One function, works with any customer dataset! 🎯

💎 Pro Tip 6: Keyboard Shortcuts

Speed up your workflow! ⚡

Common shortcuts (may vary by version):

Ctrl/Cmd + Z: Undo last action
Ctrl/Cmd + C: Copy chart
ESC: Clear selection
Drag field while holding Shift: Duplicate field

Pro move: Hover over buttons for tooltips! 💡

💎 Pro Tip 7: Mobile-Friendly Dashboards

PyGWalker visualizations work on mobile browsers!

python

# ✅ For better mobile experience
pyg.walk(df, hide_data_source_config=True)  # Cleaner interface

# Share your Colab notebook link with stakeholders
# They can view (and interact!) on their phones! 📱

📊 PyGWalker vs Other Tools: Deep Dive

Let's see how PyGWalker compares to popular alternatives:

🆚 PyGWalker vs Matplotlib/Seaborn

Matplotlib/Seaborn:

python

# Traditional approach - multiple lines of code
import matplotlib.pyplot as plt
import seaborn as sns

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Plot 1: Scatter
axes[0, 0].scatter(df['x'], df['y'])
axes[0, 0].set_title('X vs Y')

# Plot 2: Histogram
axes[0, 1].hist(df['value'], bins=20)
axes[0, 1].set_title('Value Distribution')

# Plot 3: Box plot
sns.boxplot(data=df, x='category', y='value', ax=axes[1, 0])
axes[1, 0].set_title('Value by Category')

# Plot 4: Line chart
df.groupby('date')['value'].mean().plot(ax=axes[1, 1])
axes[1, 1].set_title('Trend Over Time')

plt.tight_layout()
plt.show()

PyGWalker approach:

python

# One line! 🎉
pyg.walk(df)
# Then drag and drop to create all 4 visualizations interactively!

Verdict: 🏆

Aspect	Matplotlib/Seaborn	PyGWalker
Code Required	Many lines	1 line
Flexibility	🟢 Extreme	🟡 High
Speed (to insight)	🔴 Slow	🟢 Fast
Interactivity	🔴 None	🟢 Full
Learning Curve	🔴 Steep	🟢 Easy
Publication Quality	🟢 Excellent	🟡 Good
Best For	Final charts	Exploration

Use Matplotlib/Seaborn when: You need pixel-perfect, publication-ready static charts

Use PyGWalker when: You're exploring data and want insights fast! ⚡

🆚 PyGWalker vs Plotly

Plotly:

python

# Plotly - still requires code for each chart
import plotly.express as px

fig = px.scatter(df, x='x', y='y', color='category', size='value')
fig.show()

# Different chart? New code!
fig = px.bar(df, x='category', y='value')
fig.show()

PyGWalker:

python

# Switch between chart types with clicks!
pyg.walk(df)

Verdict: 🏆

Aspect	Plotly	PyGWalker
Interactivity	🟢 Excellent	🟢 Excellent
Code Required	🟡 Moderate	🟢 Minimal
Chart Types	🟢 Extensive	🟡 Good
Ease of Use	🟡 Medium	🟢 Easy
Customization	🟢 Very High	🟡 Moderate
Dashboard Building	🟢 Dash/Streamlit	🟡 Notebook only

Use Plotly when: Building production dashboards or need specific chart types

Use PyGWalker when: Rapid exploration in notebooks! 🚀

🆚 PyGWalker vs Tableau/Power BI

Tableau/Power BI:

💰 Expensive (hundreds/thousands per year)
🖥️ Separate desktop application
❌ Not integrated with Python
✅ Enterprise features (collaboration, permissions)
✅ Polished UI

PyGWalker:

🆓 Free and open source
📓 Lives in your notebook
🐍 Native Python integration
❌ Limited collaboration features
✅ Simple and effective

Verdict: 🏆

Use Tableau/Power BI when you need enterprise-wide BI solution

Use PyGWalker when you want "Tableau-like" exploration IN Python! 🎯

🆚 PyGWalker vs pandas.plot()

pandas.plot():

python

# Quick but limited
df['value'].plot(kind='hist')
df.groupby('category')['value'].mean().plot(kind='bar')

PyGWalker:

python

# Quick AND powerful
pyg.walk(df)

Verdict: 🏆

pandas.plot() is great for quick checks, but PyGWalker is better for serious exploration!

Pro tip: Use both! pandas.plot() for ultra-quick checks, PyGWalker for deeper dives. 🎯

📚 Additional Resources

Keep learning and stay updated! 📖

📖 Official Documentation & Learning

Essential Links:

📘 Official Documentation - Complete guide
🎥 Video Tutorials - Watch and learn
💻 GitHub Repository - Source code & issues
📝 Release Notes - What's new
🎓 Example Gallery - Inspiration

Community:

💬 Discord Server - Get help, share tips
🐦 Twitter/X - Latest updates
📧 Newsletter - Monthly insights

PyGWalker is part of the Kanaries ecosystem:

🎨 Graphic Walker - Web-based version (JavaScript)
🚀 RATH - Automated data analysis & insights
📊 Kanaries Cloud - Hosted analytics platform

📚 Recommended Learning Path

Beginner (You are here! 🎉):

✅ Complete this tutorial
✅ Practice with your own datasets
✅ Join the Discord community

Intermediate:

📊 Explore advanced configurations
🔧 Integrate into your workflow
💡 Contribute examples to the community

Advanced:

🚀 Optimize for large datasets
🎨 Customize with themes
🤝 Contribute to the project!

🎓 Practice Datasets

Want more practice? Try these datasets:

Built-in (via seaborn-data):

🐧 Penguins (what we used!)
💎 Diamonds
🚢 Titanic
🚕 Taxis
🌸 Iris

External:

python

# Quick access to seaborn datasets
datasets = ['penguins', 'diamonds', 'titanic', 'taxis', 'iris', 'tips', 'flights']

print("📊 Available seaborn datasets:")
for dataset in datasets:
    url = f"https://raw.githubusercontent.com/mwaskom/seaborn-data/master/{dataset}.csv"
    print(f"  • {dataset}: {url}")

# Try any of them!
# df = pd.read_csv(url)
# pyg.walk(df)

🤝 Contributing to PyGWalker

Want to give back? Here's how! ❤️

🎯 Ways to Contribute

Even if you're not a developer:

⭐ Star the Repository
- Go to GitHub
- Click the ⭐ star button
- Helps the project grow!
📝 Share Your Use Cases
- Write blog posts
- Create video tutorials
- Share on social media
- Tag @kanaries_data
🐛 Report Bugs
- Found an issue? Report it!
- Include: Python version, code snippet, error message
- Screenshots help a lot!
💡 Suggest Features
- Have an idea? Open a discussion
- Explain the use case
- Why would it help others?
📚 Improve Documentation
- Fix typos
- Add examples
- Clarify confusing sections
- Translate to other languages

If you ARE a developer:

💻 Contribute Code
- Check good first issues
- Fork, code, submit PR
- Follow the contributing guidelines
🧪 Add Tests
- Improve test coverage
- Add edge case tests
- Performance benchmarks

📋 Contributing Guidelines

Before submitting (like this tutorial!):

✅ 1. Check Existing Issues/PRs

Avoid duplicates!

✅ 2. Open an Issue First

Describe what you want to add
Get feedback before spending time

✅ 3. Follow the Style

Match existing code/doc style
Use clear, descriptive names
Add comments where helpful

✅ 4. Test Thoroughly

Test in different environments
Check for edge cases
Include examples

✅ 5. Write Clear PR Description

What does it do?
Why is it useful?
How to test it?
Screenshots if visual

🎉 Recognition

Contributors get:

✅ Name in contributors list
✅ GitHub profile contribution
✅ Satisfaction of helping thousands! 🌍
✅ Resume/portfolio material
✅ Experience with open source

This tutorial is an example of community contribution! 🙌

🎊 Conclusion: You're Now a PyGWalker Pro!

Congratulations! 🎉 You've completed the comprehensive PyGWalker tutorial!

✅ What You've Mastered:

Fundamentals:

✅ Installation and setup
✅ Basic visualizations with drag-and-drop
✅ Understanding the interface
✅ Chart types and when to use them

Advanced Techniques:

✅ Filters and aggregations
✅ Calculated fields
✅ Customization and styling
✅ Performance optimization

Real-World Applications:

✅ Sales analysis
✅ Customer segmentation
✅ Data quality checks
✅ A/B testing

Best Practices:

✅ Workflow integration
✅ Troubleshooting common issues
✅ Pro tips and tricks
✅ Tool comparison

🚀 What's Next?

Immediate Actions:

🎯 Practice - Use PyGWalker on your own datasets
⭐ Star the repo - Show your support!
💬 Join Discord - Connect with the community
📝 Share - Teach others what you learned

This Week:

📊 Integrate PyGWalker into your workflow
🔍 Explore at least 3 different datasets
💡 Share one insight you discovered

This Month:

🤝 Help someone learn PyGWalker
🐛 Report a bug or suggest a feature
📝 Write a blog post or create a video

💡 Remember:

"The best way to learn data analysis is to analyze data!"

PyGWalker makes that process:

⚡ Faster
🎨 More intuitive
😊 More enjoyable

Don't wait for the perfect dataset - start exploring now! Every dataset has a story to tell. 📖

🙏 Thank You!

Thank you for completing this tutorial! We hope PyGWalker becomes an essential part of your data toolkit.

Questions? Ideas? Feedback?

💬 Discord: Join here
🐛 Issues: GitHub
📧 Email: Check official docs

Happy exploring! 🎉🐧📊

Made with ❤️ by the PyGWalker Community

This tutorial was created as a community contribution. Star the repo and contribute your own examples!

Version: PyGWalker 0.4.x+
Last Updated: 2024
License: Apache-2.0

🔗 Share this tutorial: Help others discover PyGWalker!

#DataScience #Python #DataVisualization #PyGWalker #OpenSource

📎 Appendix: Quick Reference

🎯 Common PyGWalker Patterns

Basic Usage:

python

import pygwalker as pyg
pyg.walk(df)

**With Options:**
```python
pyg.walk(
    df,
    hide_data_source_config=True,  # Cleaner UI
   appearance='light',  # or 'dark'
    kernel_computation=True,  # Better performance
    spec="./config.json"  # Save/load config
)

Performance Optimization:

python

# Sample large datasets
df_sample = df.sample(10000)
pyg.walk(df_sample, kernel_computation=True)

🎨 Chart Type Quick Guide

Use Case	Chart Type	Fields
Correlation	Scatter	X: numeric, Y: numeric, Color: category
Comparison	Bar	X: category, Y: numeric (aggregated)
Trend	Line	X: date/time, Y: numeric, Color: category
Distribution	Histogram	X: numeric (binned)
Part-to-whole	Pie	Angle: category, Value: numeric
Relationship	Heatmap	X: category, Y: category, Color: numeric

⌨️ Keyboard Shortcuts

Ctrl/Cmd + Z: Undo
ESC: Clear selection
Delete: Remove field from shelf

🔗 Essential Links

📚 Docs: https://docs.kanaries.net/pygwalker
💻 GitHub: https://github.com/Kanaries/pygwalker
💬 Discord: https://discord.gg/Z4ngFWXz2U
🐦 Twitter: https://twitter.com/kanaries_data

End of Tutorial 🎓

Happy Data Exploring! 🚀📊🐧

🙏 Acknowledgments

This tutorial was created by Leonardo Braga as a community contribution to PyGWalker.

💼 Data Science & AI Student | 🇧🇷 Brasília, Brazil
💻 GitHub | 📧 [email protected]

Passionate about open-source, data science, and making analytics accessible to everyone.

🌟 Found this helpful?

⭐ Star PyGWalker on GitHub
🤝 Contribute your own tutorials
💬 Join the Discord community
🐦 Follow @kanaries_data

Questions or feedback? Open an issue or reach out directly!

License: This tutorial follows PyGWalker's Apache-2.0 License
Last Updated: 11/06/2025
PyGWalker Version: 0.4.x+

Made with ❤️ for the data community

🎨 PyGWalker: Turn Your Pandas DataFrame into an Interactive UI for Visual Analysis

🎨 PyGWalker: Turn Your Pandas DataFrame into an Interactive UI for Visual Analysis

📋 Table of Contents

✅ Your Progress Tracker

👋 Welcome!

🎯 What You'll Learn

📚 Prerequisites

🤔 What is PyGWalker?

Why PyGWalker is Awesome

🆚 PyGWalker vs Traditional Tools

When to Use PyGWalker:

🎯 What We'll Build Today

Ready? Let's get started! 🎉

🛠️ Setup & Installation

📦 What We Just Installed

🚀 Quick Start: Your First Visualization in 30 Seconds!

🔍 Understanding Our Dataset

🎨 The Magic Moment: One Line of Code!

🎉 Congratulations! You Did It!

🎮 How to Use the Interface:

💡 Try These Quick Experiments:

🤓 Pro Tip: Understanding the Interface

🎯 Core Features: Mastering PyGWalker

📁 Part 1: Loading Different Data Sources

💡 Quick Tip: Data Preparation

📊 Part 2: Visualization Types & When to Use Them

1️⃣ Scatter Plots: Finding Relationships

🎯 Try This - Scatter Plot Exercise:

2️⃣ Bar Charts: Comparing Categories

🎯 Try This - Bar Chart Exercise:

3️⃣ Line Charts: Trends Over... Wait! 🤔

🎯 Try This - Line Chart Exercise:

4️⃣ Histograms & Distributions: Understanding Spread

🎯 Try This - Histogram Exercise:

5️⃣ Heatmaps & Tables: Dense Data Views

🎯 Try This - Heatmap Exercise:

🔧 Part 3: Advanced Features

⚡ Feature 1: Filters - Focus on What Matters

🎯 Try This - Filtering Exercise:

⚡ Feature 2: Aggregations - Summarize Like a Pro

🎯 Try This - Aggregation Exercise:

⚡ Feature 3: Sorting & Ranking

🎯 Try This - Sorting Exercise:

⚡ Feature 4: Calculated Fields (Power User Feature! 🔥)

💡 Calculated Field Example:

🎯 Try This - Calculated Fields Exercise:

🎨 Part 4: Customization & Styling

🎨 Customization Options:

🎓 Quick Recap: Core Features

💡 Practical Use Cases: Real-World Applications

📈 Use Case 1: Sales Analysis

🎯 Sales Analysis Exercises:

💡 Business Insights You Might Discover:

👥 Use Case 2: Customer Segmentation

🎯 Customer Segmentation Exercises:

💡 Marketing Actions Based on Insights:

🎯 Customer Segmentation Exercises:

💡 Marketing Actions Based on Insights:

🔍 Use Case 3: Data Quality Checks

🎯 Data Quality Check Exercises:

🛠️ Data Cleaning Actions:

💡 Data Quality Best Practices:

🧪 Use Case 4: A/B Testing Results

🎯 A/B Testing Analysis Exercises:

📈 Statistical Considerations:

💡 A/B Testing Best Practices:

💎 Best Practices & Pro Tips

⚡ Performance Optimization

📊 How Big is Too Big?

🎯 Optimization Technique 1: Smart Sampling

🎯 Optimization Technique 2: Pre-Aggregation

🎯 Optimization Technique 3: Data Type Optimization

🎯 Optimization Technique 4: Use Kernel Calculation

📋 Performance Optimization Checklist

🎯 Workflow Best Practices

📊 The Recommended Workflow

💡 PyGWalker in Different Workflows

🐛 Troubleshooting & Common Issues

❌ Issue 1: PyGWalker Widget Not Displaying

❌ Issue 2: Slow Performance / Browser Freezing