tutorials/pygwalker_complete_tutorial.ipynb
Transform your DataFrame into a Tableau-style interface with just one line of code!
Documentation | GitHub | Discord Community
</div>๐ Tutorial Info:
Track your learning journey:
Mark them as you go! ๐
<a id="welcome"></a>
Hey there, data explorer! ๐
If you've ever wanted the power of tools like Tableau or Power BI but inside your Python environment, you're in the right place. PyGWalker (Python binding of Graphic Walker) is here to make your data exploration journey smooth, intuitive, and honestly... pretty fun!
This tutorial will walk you through everything you need to know to become a PyGWalker pro. Whether you're a beginner or an experienced data scientist, you'll find something valuable here.
Required:
Helpful (but not required):
Don't worry if you're new! We explain everything as we go. ๐
Let's dive in! ๐โโ๏ธ
PyGWalker (pronounced "Pig Walker" ๐ท) is a Python library that turns your pandas DataFrame into an interactive, Tableau-style user interface for visual exploration.
๐ฏ One-Line Magic
๐ฑ๏ธ Drag-and-Drop Interface
๐ Lightning Fast
๐จ Rich Visualization Options
๐ Exploratory Data Analysis (EDA) Supercharged
| Feature | PyGWalker | Matplotlib/Seaborn | Tableau/Power BI |
|---|---|---|---|
| Code Required | 1 line | Many lines | No code |
| Interactive | โ Yes | โ No | โ Yes |
| Python Integration | โ Native | โ Native | โ Limited |
| Learning Curve | ๐ข Easy | ๐ก Medium | ๐ก Medium |
| Cost | ๐ข Free | ๐ข Free | ๐ด Paid (mostly) |
| Jupyter/Colab | โ Perfect | โ Good | โ No |
| Drag-and-Drop | โ Yes | โ No | โ Yes |
โ Perfect for:
โ ๏ธ Maybe not ideal for:
Throughout this tutorial, we'll explore PyGWalker using real-world datasets. Here's a sneak peek:
By the end, you'll be able to:
Let's get PyGWalker installed and ready to roll! This should take less than a minute. โฑ๏ธ
# Install PyGWalker
!pip install pygwalker -q
print("โ
PyGWalker installed successfully!")
# Import necessary libraries
import pandas as pd
import pygwalker as pyg
import warnings
warnings.filterwarnings('ignore')
# Check PyGWalker version
print(f"๐ท PyGWalker version: {pyg.__version__}")
print("โ
All imports successful!")
PyGWalker comes with everything you need:
Compatibility:
Note: PyGWalker works best with pandas DataFrames, so make sure your data is in that format!
Let's jump right in with a fun dataset: Palmer Penguins ๐ง
This dataset contains measurements of 3 penguin species from islands in Antarctica. It's perfect for learning because:
Let's load it and create our first interactive visualization!
# Load the Palmer Penguins dataset
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv"
df = pd.read_csv(url)
print("๐ง Dataset loaded successfully!")
print(f"๐ Shape: {df.shape[0]} rows ร {df.shape[1]} columns")
print("\n" + "="*50)
print("First look at our penguin friends:")
print("="*50)
df.head()
Let's see what we're working with:
# Dataset overview
print("๐ Dataset Information:")
print("="*50)
df.info()
print("\n๐ Basic Statistics:")
print("="*50)
df.describe()
print("\n๐ง Penguin Species in our dataset:")
print("="*50)
print(df['species'].value_counts())
Our penguin dataset includes:
Now, let's see the magic! โจ
Here it comes... the moment you've been waiting for!
Watch how ONE single line transforms our DataFrame into a full-fledged interactive visualization tool:
# ๐ช THE MAGIC LINE ๐ช
pyg.walk(df)
What just happened?
With that single line of code, you now have:
Left Panel - Fields:
Main Canvas - Visualization Area:
Top Bar - Controls:
Scatter Plot:
bill_length_mm to X-axisbill_depth_mm to Y-axisspecies to ColorBar Chart:
species to X-axisbody_mass_g to Y-axisDistribution:
flipper_length_mm to X-axisspecies to ColorTake a few minutes to play around! There's no wrong way to explore. ๐ช
PyGWalker's interface is divided into key areas:
Encoding Shelves (where you drag fields):
Marks Shelf:
Filters:
The interface automatically suggests the best visualization based on what you drag. Smart, right? ๐ง
Now that you've seen the magic, let's dive deeper into what makes PyGWalker so powerful!
We'll explore:
PyGWalker is flexible! You can feed it data from multiple sources. Let's see how:
# Method 1: From a CSV file (what we just did!)
df_from_url = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv")
print("โ
Method 1: Loaded from URL")
# Method 2: From a local CSV (if you have one)
# df_from_local = pd.read_csv('your_file.csv')
# Method 3: From a dictionary
data_dict = {
'species': ['Adelie', 'Gentoo', 'Chinstrap'],
'avg_mass': [3700, 5076, 3733],
'count': [152, 124, 68]
}
df_from_dict = pd.DataFrame(data_dict)
print("โ
Method 2: Created from dictionary")
# Method 4: From Excel (requires openpyxl)
# df_from_excel = pd.read_excel('your_file.xlsx')
# Method 5: From SQL, APIs, web scraping... anything that becomes a DataFrame!
print("\n๐ฏ The key point: If it's a pandas DataFrame, PyGWalker can visualize it!")
PyGWalker works best when your data is clean! Before using pyg.walk(), consider:
โ Good practices:
โ ๏ธ PyGWalker will still work with messy data, but clean data = better insights!
PyGWalker supports a wide variety of chart types. Let's explore the most useful ones with our penguin friends! ๐ง
Best for: Exploring relationships between two numerical variables
Let's investigate: Do penguins with longer bills also have deeper bills?
# Let's create a clean version of our dataset for visualization
df_clean = df.dropna() # Remove rows with missing values
print(f"๐งน Cleaned dataset: {df_clean.shape[0]} rows (removed {df.shape[0] - df_clean.shape[0]} rows with missing data)")
print("\n๐จ Now let's visualize! Run the cell below:")
# Scatter plot exploration
pyg.walk(df_clean, hide_data_source_config=True)
Step-by-step instructions:
Create the basic scatter:
bill_length_mm to X-axisbill_depth_mm to Y-axisAdd species distinction:
species to Color in the Marks shelfAdd more context:
sex to Shapebody_mass_g to SizeInsights to look for:
Pro move: Try changing the chart type using the dropdown at the top. PyGWalker suggests the best type automatically! ๐ค
Best for: Comparing values across different groups
Let's investigate: Which penguin species is the heaviest on average?
# Bar chart exploration
pyg.walk(df_clean, hide_data_source_config=True)
Creating a comparative bar chart:
Basic bar chart:
species to X-axisbody_mass_g to Y-axisCompare by island:
island to ColorChange aggregation:
body_mass_g in the Y-axisFlip it:
Did you notice? Gentoo penguins are significantly heavier! ๐ช๐ง
Interesting discovery: Our penguin dataset doesn't have a time dimension!
But that's okay - let's create one to demonstrate line charts:
# Let's create a time-series dataset for demonstration
import numpy as np
# Simulate penguin population monitoring over months
months = pd.date_range('2023-01-01', periods=12, freq='M')
penguin_trends = pd.DataFrame({
'month': months,
'Adelie_count': np.random.randint(45, 55, 12),
'Gentoo_count': np.random.randint(35, 45, 12),
'Chinstrap_count': np.random.randint(20, 30, 12)
})
# Reshape for PyGWalker
penguin_trends_long = penguin_trends.melt(
id_vars=['month'],
var_name='species',
value_name='count'
)
print("๐ Time-series data created!")
penguin_trends_long.head(10)
# Line chart exploration
pyg.walk(penguin_trends_long, hide_data_source_config=True)
Create the trend line:
month to X-axiscount to Y-axisCompare species:
species to ColorAdd markers:
Use case: Line charts are perfect for time-series data, trends, and sequential patterns!
Best for: Seeing the distribution and frequency of values
Let's investigate: How are flipper lengths distributed across our penguins?
# Back to our main penguin dataset
pyg.walk(df_clean, hide_data_source_config=True)
Create a histogram:
flipper_length_mm to X-axisSee by species:
species to ColorAdjust bins:
Insight: You'll notice three distinct peaks - one for each species! This is called a "trimodal distribution." ๐ฏ
Best for: Showing values across two categorical dimensions
# Create a summary table for heatmap
summary_df = df_clean.groupby(['species', 'island']).agg({
'body_mass_g': 'mean',
'bill_length_mm': 'mean',
'flipper_length_mm': 'mean'
}).reset_index().round(1)
print("๐ Summary statistics by species and island:")
summary_df
# Heatmap exploration
pyg.walk(summary_df, hide_data_source_config=True)
Create a heatmap:
species to X-axisisland to Y-axisbody_mass_g to ColorInsights at a glance:
Use case: Heatmaps are perfect for correlation matrices, confusion matrices, or any 2D categorical comparison!
Now let's level up! ๐ These features will make you a PyGWalker power user.
Filters help you drill down into specific subsets of your data.
Let's explore: Male vs Female penguins by species
# Full dataset for filtering demo
pyg.walk(df_clean, hide_data_source_config=True)
Add a filter:
sex and choose only "Male"Multiple filters:
islandDynamic filtering:
Pro tip: Filters don't change your DataFrame - they just change what's displayed! Your original data stays safe. ๐ก๏ธ
PyGWalker automatically aggregates data when needed. Let's master this!
pyg.walk(df_clean, hide_data_source_config=True)
Understanding auto-aggregation:
species to X-axisbody_mass_g to Y-axisChange aggregation type:
body_mass_g in the Y-axis shelfMultiple measures:
body_mass_g and flipper_length_mmReal-world use: Aggregations are crucial for sales reports, KPI dashboards, and summary statistics! ๐
Make patterns jump out by ordering your data!
Sort a bar chart:
species on X, body_mass_g (mean) on YSort by multiple fields:
Use case: Rankings, top N analysis, identifying outliers
Create new metrics on-the-fly without modifying your DataFrame!
Example: Let's calculate Body Mass Index (sort of) for penguins
While PyGWalker has calculated field capabilities, the exact implementation varies by version.
Alternative approach - Create in pandas first:
# Add calculated fields to our DataFrame
df_enhanced = df_clean.copy()
# Bill ratio: length to depth
df_enhanced['bill_ratio'] = (df_enhanced['bill_length_mm'] / df_enhanced['bill_depth_mm']).round(2)
# Mass category
df_enhanced['size_category'] = pd.cut(
df_enhanced['body_mass_g'],
bins=[0, 3500, 4500, 6500],
labels=['Small', 'Medium', 'Large']
)
# Flipper to mass ratio (efficiency!)
df_enhanced['flipper_mass_ratio'] = (df_enhanced['flipper_length_mm'] / df_enhanced['body_mass_g'] * 1000).round(2)
print("โจ Enhanced dataset with calculated fields!")
df_enhanced[['species', 'bill_ratio', 'size_category', 'flipper_mass_ratio']].head()
# Explore the enhanced dataset
pyg.walk(df_enhanced, hide_data_source_config=True)
Explore bill ratio:
bill_ratio to X-axisspecies to ColorUse size categories:
size_category to X-axisspecies to ColorEfficiency analysis:
body_mass_g vs flipper_mass_ratiospeciesReal-world use: Calculated fields are essential for:
Make your visualizations beautiful and professional! โจ
# PyGWalker with custom configuration
pyg.walk(
df_enhanced,
hide_data_source_config=True,
spec="./config.json", # Save/load your chart configurations (optional)
kernel_computation=True # Better performance for large datasets
)
Visual Styling:
Interface Options:
hide_data_source_config=True: Cleaner interface, hides data source paneldark='light' or appearance='dark': Theme preferencekernel_computation=True: Offload calculations to Python kernel (faster!)Saving Your Work:
Pro tip: You can save your PyGWalker configuration and load it later for consistent visualizations! ๐ฏ
You've learned A LOT! Let's recap: ๐
โ Data Loading: URLs, files, dictionaries โ any DataFrame works!
โ Chart Types:
โ Advanced Features:
โ Customization:
You're now a PyGWalker intermediate user! ๐
Next up, we'll dive into real-world use cases and best practices. Ready? ๐
Time to see PyGWalker in action with scenarios you'll actually face! ๐
We'll explore:
Each example includes a realistic dataset and step-by-step analysis. Let's go! ๐
Scenario: You're a data analyst at an e-commerce company. Your manager wants insights on:
Let's create a realistic sales dataset and analyze it!
# Create a realistic sales dataset
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
# Set seed for reproducibility
np.random.seed(42)
# Generate dates for the last 12 months
date_range = pd.date_range(end=datetime.now(), periods=365, freq='D')
# Product categories
products = ['Laptop', 'Phone', 'Tablet', 'Headphones', 'Smartwatch', 'Camera']
regions = ['North America', 'Europe', 'Asia', 'South America']
channels = ['Online', 'Retail']
# Generate sales data
n_records = 1000
sales_data = pd.DataFrame({
'date': np.random.choice(date_range, n_records),
'product': np.random.choice(products, n_records),
'region': np.random.choice(regions, n_records),
'channel': np.random.choice(channels, n_records),
'units_sold': np.random.randint(1, 50, n_records),
'unit_price': np.random.uniform(50, 2000, n_records).round(2),
})
# Calculate revenue
sales_data['revenue'] = (sales_data['units_sold'] * sales_data['unit_price']).round(2)
# Add some seasonality (higher sales in Nov-Dec)
sales_data.loc[sales_data['date'].dt.month.isin([11, 12]), 'revenue'] *= 1.5
sales_data['revenue'] = sales_data['revenue'].round(2)
# Sort by date
sales_data = sales_data.sort_values('date').reset_index(drop=True)
print("๐ฐ Sales dataset created!")
print(f"๐ Records: {len(sales_data):,}")
print(f"๐ต Total Revenue: ${sales_data['revenue'].sum():,.2f}")
print(f"๐
Date Range: {sales_data['date'].min().date()} to {sales_data['date'].max().date()}")
print("\n" + "="*60)
sales_data.head(10)
# Quick overview of sales data
print("๐ Sales Summary Statistics:\n")
print(sales_data.describe())
print("\n" + "="*60)
print("๐ฏ Sales by Product:\n")
print(sales_data.groupby('product')['revenue'].agg(['sum', 'mean', 'count']).sort_values('sum', ascending=False))
# Let's analyze! ๐
pyg.walk(sales_data, hide_data_source_config=True)
Exercise 1: Revenue by Product ๐
product to X-axisrevenue to Y-axis (will auto-aggregate to SUM)Exercise 2: Sales Trends Over Time ๐
date to X-axisrevenue to Y-axisproduct to ColorExercise 3: Regional Performance ๐
region vs revenuechannel to ColorExercise 4: Profitability Analysis ๐ฐ
units_sold (X) vs revenue (Y)product to Colorunit_price to SizeExercise 5: Monthly Trends ๐
Pro tip: Save these visualizations and share them with your team! Export as PNG for presentations. ๐ธ
Scenario: You work for a subscription service. Marketing wants to understand customer segments for targeted campaigns.
Goals:
Let's create customer data and segment it!
# Create customer dataset
np.random.seed(123)
n_customers = 500
customer_data = pd.DataFrame({
'customer_id': range(1, n_customers + 1),
'age': np.random.randint(18, 70, n_customers),
'subscription_months': np.random.randint(1, 36, n_customers),
'monthly_spend': np.random.uniform(10, 200, n_customers).round(2),
'login_frequency': np.random.randint(1, 30, n_customers),
'support_tickets': np.random.randint(0, 10, n_customers),
'referrals': np.random.randint(0, 5, n_customers),
})
# Calculate lifetime value
customer_data['lifetime_value'] = (
customer_data['subscription_months'] * customer_data['monthly_spend']
).round(2)
# Create engagement score
customer_data['engagement_score'] = (
(customer_data['login_frequency'] * 2) +
(customer_data['referrals'] * 10) -
(customer_data['support_tickets'] * 3)
)
# Create segments based on lifetime value
customer_data['value_segment'] = pd.cut(
customer_data['lifetime_value'],
bins=[0, 500, 2000, 10000],
labels=['Low Value', 'Medium Value', 'High Value']
)
# Create age groups
customer_data['age_group'] = pd.cut(
customer_data['age'],
bins=[0, 25, 35, 50, 100],
labels=['18-25', '26-35', '36-50', '50+']
)
# Churn prediction (synthetic)
customer_data['churn_risk'] = np.where(
(customer_data['engagement_score'] < 20) & (customer_data['subscription_months'] < 6),
'High Risk',
np.where(
(customer_data['engagement_score'] < 40) & (customer_data['subscription_months'] < 12),
'Medium Risk',
'Low Risk'
)
)
print("๐ฅ Customer dataset created!")
print(f"๐ Total Customers: {len(customer_data):,}")
print(f"๐ฐ Average Lifetime Value: ${customer_data['lifetime_value'].mean():,.2f}")
print(f"โ ๏ธ High Churn Risk: {(customer_data['churn_risk'] == 'High Risk').sum()} customers")
print("\n" + "="*60)
customer_data.head(10)
# Customer segments overview
print("๐ Customer Segmentation Overview:\n")
print("By Value Segment:")
print(customer_data['value_segment'].value_counts())
print("\nBy Churn Risk:")
print(customer_data['churn_risk'].value_counts())
print("\nBy Age Group:")
print(customer_data['age_group'].value_counts())
# Analyze customer segments! ๐ฏ
pyg.walk(customer_data, hide_data_source_config=True)
Exercise 1: Value Segment Distribution ๐
value_segment (X) vs customer_id count (Y)churn_risk to ColorExercise 2: Engagement Analysis ๐
login_frequency (X) vs monthly_spend (Y)value_segment to Colorsubscription_months to SizeExercise 3: Age Demographics ๐ค
age_group (X) vs lifetime_value average (Y)Exercise 4: Churn Risk Factors โ ๏ธ
churn_risk (X) vs engagement_score (Y)churn_risk vs support_ticketsExercise 5: Referral Champions ๐
referrals >= 2subscription_months vs lifetime_valueage_groupHigh-Value + High Churn Risk ๐จ
Medium Value + High Engagement ๐
Low Value + Young Demographic ๐ฏ
Low Engagement + Any Value ๐ง
Real-world impact: Customer segmentation can increase ROI by 200%+ on marketing campaigns! ๐ฏ
# Analyze customer segments! ๐ฏ
pyg.walk(customer_data, hide_data_source_config=True)
Exercise 1: Value Segment Distribution ๐
value_segment (X) vs customer_id count (Y)churn_risk to ColorExercise 2: Engagement Analysis ๐
login_frequency (X) vs monthly_spend (Y)value_segment to Colorsubscription_months to SizeExercise 3: Age Demographics ๐ค
age_group (X) vs lifetime_value average (Y)Exercise 4: Churn Risk Factors โ ๏ธ
churn_risk (X) vs engagement_score (Y)churn_risk vs support_ticketsExercise 5: Referral Champions ๐
referrals >= 2subscription_months vs lifetime_valueage_groupHigh-Value + High Churn Risk ๐จ
Medium Value + High Engagement ๐
Low Value + Young Demographic ๐ฏ
Low Engagement + Any Value ๐ง
Real-world impact: Customer segmentation can increase ROI by 200%+ on marketing campaigns! ๐ฏ
Scenario: You've received a new dataset from a vendor. Before analysis, you need to check data quality!
What to look for:
PyGWalker is PERFECT for visual data quality checks! ๐
# Create a "messy" dataset for quality checking
np.random.seed(456)
n_records = 300
messy_data = pd.DataFrame({
'transaction_id': range(1, n_records + 1),
'amount': np.random.uniform(10, 1000, n_records),
'category': np.random.choice(['Electronics', 'Clothing', 'Food', 'Other', None], n_records),
'quantity': np.random.randint(1, 20, n_records),
'customer_age': np.random.randint(15, 80, n_records),
'rating': np.random.choice([1, 2, 3, 4, 5, None], n_records),
})
# Introduce data quality issues
# 1. Missing values (10% in category, 15% in rating)
messy_data.loc[np.random.choice(messy_data.index, 30), 'category'] = None
messy_data.loc[np.random.choice(messy_data.index, 45), 'rating'] = None
# 2. Outliers in amount (some crazy high values)
messy_data.loc[np.random.choice(messy_data.index, 5), 'amount'] = np.random.uniform(5000, 10000, 5)
# 3. Impossible values (negative amounts, ages > 100)
messy_data.loc[np.random.choice(messy_data.index, 3), 'amount'] = -np.random.uniform(10, 100, 3)
messy_data.loc[np.random.choice(messy_data.index, 4), 'customer_age'] = np.random.randint(100, 150, 4)
# 4. Duplicates
duplicate_rows = messy_data.sample(10)
messy_data = pd.concat([messy_data, duplicate_rows], ignore_index=True)
print("๐ Messy dataset created (intentionally flawed!):")
print(f"๐ Total Records: {len(messy_data):,}")
print(f"โ Missing values: {messy_data.isnull().sum().sum()}")
print(f"๐ Duplicate rows: {messy_data.duplicated().sum()}")
print(f"โ ๏ธ Negative amounts: {(messy_data['amount'] < 0).sum()}")
print(f"โ ๏ธ Invalid ages: {(messy_data['customer_age'] > 100).sum()}")
print("\n" + "="*60)
messy_data.head(15)
# Check missing values
print("๐ Missing Values Report:\n")
missing_report = pd.DataFrame({
'Column': messy_data.columns,
'Missing': messy_data.isnull().sum(),
'Percentage': (messy_data.isnull().sum() / len(messy_data) * 100).round(2)
})
print(missing_report)
# Visual data quality check! ๐
pyg.walk(messy_data, hide_data_source_config=True)
Exercise 1: Spot Outliers in Amount ๐ฐ
amount to X-axisExercise 2: Find Missing Values Patterns โ
is_missing_category (or filter by null)category counts - see how many nulls existExercise 3: Detect Impossible Values ๐จ
transaction_id (X) vs customer_age (Y)amount - look for negative valuesExercise 4: Check Distributions ๐
rating distributioncategoryExercise 5: Identify Patterns in Quality Issues ๐
has_issues = TRUE if (age > 100 OR amount < 0)After visual inspection, here's what to fix:
# Clean the messy data based on insights
messy_data_cleaned = messy_data.copy()
# 1. Remove duplicates
messy_data_cleaned = messy_data_cleaned.drop_duplicates()
# 2. Fix impossible values
messy_data_cleaned = messy_data_cleaned[
(messy_data_cleaned['amount'] >= 0) &
(messy_data_cleaned['customer_age'] <= 100)
]
# 3. Handle outliers (cap at 99th percentile)
amount_99th = messy_data_cleaned['amount'].quantile(0.99)
messy_data_cleaned.loc[messy_data_cleaned['amount'] > amount_99th, 'amount'] = amount_99th
# 4. Fill missing categories
messy_data_cleaned['category'] = messy_data_cleaned['category'].fillna('Unknown')
print("โ
Data cleaned!")
print(f"๐ Records before: {len(messy_data):,} โ after: {len(messy_data_cleaned):,}")
print(f"โจ Removed: {len(messy_data) - len(messy_data_cleaned):,} problematic records")
print(f"โ Missing values: {messy_data_cleaned.isnull().sum().sum()}")
# Compare before and after! ๐
print("Let's visualize the cleaned data:")
pyg.walk(messy_data_cleaned, hide_data_source_config=True)
Always Check Before Analysis: โ
PyGWalker for QA:
Real-world impact: Catching data quality issues early saves hours (or days!) of debugging later! ๐ฏ
Scenario: Your product team ran an A/B test on a new feature. You need to analyze if variant B performs better than variant A.
Metrics to compare:
Let's analyze test results! ๐งช
# Create A/B test dataset
np.random.seed(789)
n_users = 1000
# Variant B performs slightly better (simulate this)
variant_a_users = n_users // 2
variant_b_users = n_users - variant_a_users
ab_test_data = pd.DataFrame({
'user_id': range(1, n_users + 1),
'variant': ['A'] * variant_a_users + ['B'] * variant_b_users,
'converted': (
list(np.random.choice([0, 1], variant_a_users, p=[0.75, 0.25])) + # A: 25% conversion
list(np.random.choice([0, 1], variant_b_users, p=[0.65, 0.35])) # B: 35% conversion
),
'time_on_page': np.concatenate([
np.random.uniform(30, 180, variant_a_users), # A: average 105 seconds
np.random.uniform(45, 210, variant_b_users) # B: average 127 seconds
]),
'pages_viewed': np.concatenate([
np.random.randint(1, 8, variant_a_users), # A: fewer pages
np.random.randint(2, 10, variant_b_users) # B: more pages
]),
})
# Add order value (only for converted users)
ab_test_data['order_value'] = 0
ab_test_data.loc[ab_test_data['converted'] == 1, 'order_value'] = np.random.uniform(20, 200, ab_test_data['converted'].sum())
# Round numeric columns
ab_test_data['time_on_page'] = ab_test_data['time_on_page'].round(1)
ab_test_data['order_value'] = ab_test_data['order_value'].round(2)
# Add day of test
ab_test_data['test_day'] = np.random.randint(1, 15, n_users)
print("๐งช A/B Test dataset created!")
print(f"๐ฅ Total Users: {len(ab_test_data):,}")
print(f"๐ Variant A: {variant_a_users:,} users")
print(f"๐ Variant B: {variant_b_users:,} users")
print("\n" + "="*60)
ab_test_data.head(10)
# Quick A/B test summary
print("๐ A/B Test Results Summary:\n")
summary = ab_test_data.groupby('variant').agg({
'converted': ['sum', 'mean'],
'time_on_page': 'mean',
'pages_viewed': 'mean',
'order_value': 'mean'
}).round(3)
summary.columns = ['Total Conversions', 'Conversion Rate', 'Avg Time (sec)', 'Avg Pages', 'Avg Order Value']
print(summary)
# Calculate lift
conv_rate_a = ab_test_data[ab_test_data['variant'] == 'A']['converted'].mean()
conv_rate_b = ab_test_data[ab_test_data['variant'] == 'B']['converted'].mean()
lift = ((conv_rate_b - conv_rate_a) / conv_rate_a * 100)
print(f"\n๐ Lift (B vs A): {lift:.1f}%")
print(f"{'๐ Variant B wins!' if lift > 0 else '๐ Variant A is better'}")
# Analyze the A/B test! ๐ฏ
pyg.walk(ab_test_data, hide_data_source_config=True)
Exercise 1: Conversion Rate Comparison ๐
variant (X) vs converted (Y, mean aggregation)Exercise 2: Engagement Metrics โฑ๏ธ
variant (X) vs time_on_page (Y)pages_viewedExercise 3: Order Value Analysis ๐ฐ
converted = 1 (only converted users)variant vs order_valueExercise 4: Time-Series Check ๐
test_day (X) vs converted mean (Y)variantExercise 5: Segment Analysis ๐ฏ
time_on_page (low, medium, high engagement)What PyGWalker Shows:
What You Still Need:
Pro tip: Use PyGWalker for exploratory analysis, then confirm with statistical tests in Python!
# Quick statistical test (bonus!)
from scipy import stats
# Chi-square test for conversion rate
contingency_table = pd.crosstab(ab_test_data['variant'], ab_test_data['converted'])
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
print("๐ Statistical Significance Test (Chi-Square):")
print("="*60)
print(f"Chi-square statistic: {chi2:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"\n{'โ
Statistically significant (p < 0.05)!' if p_value < 0.05 else 'โ Not statistically significant (p >= 0.05)'}")
print("\nConclusion:")
if p_value < 0.05 and conv_rate_b > conv_rate_a:
print("๐ Variant B is significantly better! Ship it! ๐")
elif p_value < 0.05:
print("๐ Variant A is significantly better. Keep the original.")
else:
print("๐คท No significant difference. Need more data or run longer.")
Before the Test: ๐
During Analysis with PyGWalker: ๐
Making the Decision: โ
Real-world impact: Proper A/B analysis can increase revenue by 10-30% through optimized features! ๐
You've completed all 4 real-world use cases! ๐
Next up: Best practices and pro tips! ๐
You've learned the fundamentals and seen real-world applications. Now let's level up with advanced techniques and best practices! ๐
In this section:
PyGWalker is fast, but with large datasets, a few tweaks can make it even faster! โก
Performance Guidelines:
| Dataset Size | Performance | Recommendations |
|---|---|---|
| < 10K rows | ๐ข Excellent | Use as-is, no optimization needed |
| 10K - 100K | ๐ก Good | Consider sampling for exploration |
| 100K - 1M | ๐ Moderate | Use sampling + kernel calc |
| > 1M rows | ๐ด Slow | Aggregate first or use database |
Rule of thumb: If your DataFrame takes >2 seconds to display, it's time to optimize!
For initial exploration, you don't always need ALL the data!
# Create a large dataset for demonstration
import pandas as pd
import numpy as np
np.random.seed(42)
large_dataset = pd.DataFrame({
'date': pd.date_range('2020-01-01', periods=500000, freq='min'),
'user_id': np.random.randint(1, 10000, 500000),
'event_type': np.random.choice(['click', 'view', 'purchase', 'cart'], 500000),
'value': np.random.uniform(0, 100, 500000),
'session_duration': np.random.randint(10, 3600, 500000)
})
print(f"๐ Large dataset created: {len(large_dataset):,} rows")
print(f"๐พ Memory usage: {large_dataset.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
# โ Bad Practice: Using entire large dataset
# pyg.walk(large_dataset) # This will be slow!
# โ
Good Practice: Sample for exploration
sample_size = 10000
df_sample = large_dataset.sample(n=sample_size, random_state=42)
print(f"โ
Sampled {sample_size:,} rows for exploration")
print(f"๐ That's {(sample_size/len(large_dataset)*100):.1f}% of the data")
print(f"โก Speed improvement: ~{len(large_dataset)//sample_size}x faster!")
# Fast exploration with sampled data
pyg.walk(df_sample, hide_data_source_config=True, kernel_computation=True)
If you're analyzing trends, aggregate BEFORE visualizing!
# โ Bad Practice: Visualizing 500K raw records for time trends
# โ
Good Practice: Aggregate first!
daily_summary = large_dataset.groupby([
large_dataset['date'].dt.date,
'event_type'
]).agg({
'user_id': 'nunique', # Unique users
'value': ['sum', 'mean'],
'session_duration': 'mean'
}).reset_index()
daily_summary.columns = ['date', 'event_type', 'unique_users', 'total_value', 'avg_value', 'avg_duration']
print(f"โ
Aggregated from {len(large_dataset):,} โ {len(daily_summary):,} rows")
print(f"โก That's a {len(large_dataset)//len(daily_summary)}x reduction!")
print("\nNow this will be lightning fast! โก")
daily_summary.head()
# Super fast visualization of aggregated data
pyg.walk(daily_summary, hide_data_source_config=True, kernel_computation=True)
Smaller data types = less memory = faster performance!
# Check current memory usage
print("๐ Memory Usage by Column (BEFORE optimization):")
print("="*60)
memory_before = large_dataset.memory_usage(deep=True)
print(memory_before)
print(f"\n๐พ Total: {memory_before.sum() / 1024**2:.2f} MB")
# โ
Optimize data types
large_dataset_optimized = large_dataset.copy()
# Convert object to category (huge savings!)
large_dataset_optimized['event_type'] = large_dataset_optimized['event_type'].astype('category')
# Use smaller int types
large_dataset_optimized['user_id'] = large_dataset_optimized['user_id'].astype('int32')
large_dataset_optimized['session_duration'] = large_dataset_optimized['session_duration'].astype('int16')
# Use float32 instead of float64
large_dataset_optimized['value'] = large_dataset_optimized['value'].astype('float32')
print("๐ Memory Usage by Column (AFTER optimization):")
print("="*60)
memory_after = large_dataset_optimized.memory_usage(deep=True)
print(memory_after)
print(f"\n๐พ Total: {memory_after.sum() / 1024**2:.2f} MB")
savings = (1 - memory_after.sum() / memory_before.sum()) * 100
print(f"\n๐ Memory savings: {savings:.1f}%!")
PyGWalker can offload calculations to your Python kernel for better performance!
# โ
Best Practice: Enable kernel calculations
pyg.walk(
df_sample,
kernel_computation=True, # ๐ฅ This is the magic parameter!
hide_data_source_config=True
)
Before using PyGWalker on large datasets:
โ Step 1: Check dataset size
โ Step 2: Sample for exploration
.sample() for initial analysisโ Step 3: Aggregate when possible
โ Step 4: Optimize data types
category for text with few unique valuesโ Step 5: Enable kernel calculation
kernel_computation=Trueโ Step 6: Clean up first
Pro tip: Profile your code with %%time to measure improvements! โฑ๏ธ
How to integrate PyGWalker into your data science workflow efficiently!
Phase 1: Initial Exploration ๐
# 1. Load your data
df = pd.read_csv("your_data.csv")
# 2. Quick overview
print(f"Shape: {df.shape}")
print(f"\nData types:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")
# 3. Basic statistics
df.describe()
# 4. PyGWalker for visual exploration โญ
# Spend 10-15 minutes exploring interactively
pyg.walk(df, hide_data_source_config=True)
Phase 2: Deep Dive Analysis ๐ฏ
After initial exploration, you'll have questions. Answer them systematically:
# Example: Based on PyGWalker exploration, you noticed something interesting
# Now create focused analysis
# 5. Clean and prepare data based on insights
df_clean = df.dropna(subset=['important_column'])
df_clean = df_clean[df_clean['value'] > 0]
# 6. Create calculated fields you identified as useful
df_clean['new_metric'] = df_clean['a'] / df_clean['b']
# 7. Explore the refined dataset
pyg.walk(df_clean, hide_data_source_config=True)
Phase 3: Documentation & Sharing ๐
# 8. Export key visualizations
# Use PyGWalker's export button to save charts as PNG/SVG
# 9. Document insights in markdown cells
"""
Key Findings:
- Insight 1: [description]
- Insight 2: [description]
- Recommendation: [action items]
"""
# 10. Create final summary statistics
final_summary = df_clean.groupby('category').agg({
'metric1': 'mean',
'metric2': 'sum'
})
print(final_summary)
For Data Scientists: ๐งช
For Analysts: ๐
For Data Engineers: ๐ง
For Business Users: ๐ผ
Running into problems? Here are solutions to the most common issues! ๐ง
Symptoms:
<pygwalker.walker.Walker object at 0x...>Solutions:
# โ
Solution 1: Make sure you're in a supported environment
import sys
print(f"Python version: {sys.version}")
print(f"Environment: {'Google Colab' if 'google.colab' in sys.modules else 'Other'}")
# โ
Solution 2: Update PyGWalker to latest version
# !pip install --upgrade pygwalker
# โ
Solution 3: Restart runtime and try again
# In Colab: Runtime > Restart runtime
Symptoms:
Solutions:
# โ
Solution 1: Sample your data
df_sample = df.sample(min(10000, len(df)))
pyg.walk(df_sample)
# โ
Solution 2: Use kernel calculation
pyg.walk(df, kernel_computation=True)
# โ
Solution 3: Drop unnecessary columns
df_slim = df[['col1', 'col2', 'col3']] # Only columns you need
pyg.walk(df_slim)
Symptoms:
Solutions:
๐ก Understand auto-aggregation:
๐ก Check data types:
# โ
Fix data types
df['date_column'] = pd.to_datetime(df['date_column'])
df['category_column'] = df['category_column'].astype('category')
df['numeric_column'] = pd.to_numeric(df['numeric_column'], errors='coerce')
Symptoms:
Solutions:
# โ
Option 1: Drop missing values
df_clean = df.dropna()
# โ
Option 2: Fill missing values
df['column'] = df['column'].fillna(0) # or mean, median, etc.
# โ
Option 3: Create "Missing" category
df['column'] = df['column'].fillna('Unknown')
Symptoms:
Solutions:
๐ก Export as image:
๐ก Save configuration:
# โ
Save your chart setup as JSON
pyg.walk(df, spec="./my_chart_config.json")
# โ
Load it later
pyg.walk(df, spec="./my_chart_config.json")
Symptoms:
Solutions:
# โ
Explicitly set theme
pyg.walk(df,appearance='light') # or 'dark'
# โ
For custom styling, modify after rendering
# (Advanced: requires CSS knowledge)
Debug Checklist: โ
df.head())df.dtypes)Get Help:
Advanced techniques that will make you a PyGWalker master! ๐ฏ
Create once, reuse everywhere!
# โ
Save your perfect chart setup
pyg.walk(df, spec="./sales_dashboard.json")
# Later, load it with new data (same structure)
df_new = pd.read_csv("next_month_data.csv")
pyg.walk(df_new, spec="./sales_dashboard.json")
# ๐ Instant dashboard with new data!
PyGWalker plays nicely with the Python ecosystem!
# Example: Use pandas for heavy preprocessing, PyGWalker for visualization
import pandas as pd
# Complex aggregation in pandas
summary = df.groupby(['category', 'region']).agg({
'revenue': ['sum', 'mean'],
'units': 'sum',
'customers': 'nunique'
}).reset_index()
summary.columns = ['category', 'region', 'total_revenue', 'avg_revenue', 'total_units', 'unique_customers']
# Beautiful visualization in PyGWalker
pyg.walk(summary)
Create interactive presentations with RISE + PyGWalker!
# Install RISE for slideshows
# !pip install RISE
# Then use PyGWalker in your slides for interactive demos
# Your audience can explore data in real-time! ๐ช
Create a reusable data quality checker!
def data_quality_report(df):
"""
Create a comprehensive data quality report with PyGWalker
"""
import pandas as pd
# Create quality metrics DataFrame
quality_df = pd.DataFrame({
'column': df.columns,
'dtype': df.dtypes.astype(str),
'missing_count': df.isnull().sum(),
'missing_pct': (df.isnull().sum() / len(df) * 100).round(2),
'unique_values': [df[col].nunique() for col in df.columns],
'sample_value': [str(df[col].iloc[0]) if len(df) > 0 else '' for col in df.columns]
})
print("๐ Data Quality Report:")
print("="*60)
print(quality_df.to_string())
# Visualize with PyGWalker
return pyg.walk(quality_df, hide_data_source_config=True)
# Use it on any DataFrame!
# data_quality_report(your_df)
Build reusable analysis workflows!
def customer_analysis(df, customer_col, value_col, date_col):
"""
Standardized customer analysis with PyGWalker
"""
# Create summary
summary = df.groupby(customer_col).agg({
value_col: ['sum', 'mean', 'count'],
date_col: ['min', 'max']
}).reset_index()
summary.columns = [customer_col, 'total_value', 'avg_value', 'transactions', 'first_purchase', 'last_purchase']
# Calculate additional metrics
summary['customer_tenure_days'] = (summary['last_purchase'] - summary['first_purchase']).dt.days
summary['value_segment'] = pd.qcut(summary['total_value'], q=3, labels=['Low', 'Medium', 'High'])
return pyg.walk(summary, hide_data_source_config=True)
# One function, works with any customer dataset! ๐ฏ
Speed up your workflow! โก
Common shortcuts (may vary by version):
Ctrl/Cmd + Z: Undo last actionCtrl/Cmd + C: Copy chartESC: Clear selectionShift: Duplicate fieldPro move: Hover over buttons for tooltips! ๐ก
PyGWalker visualizations work on mobile browsers!
# โ
For better mobile experience
pyg.walk(df, hide_data_source_config=True) # Cleaner interface
# Share your Colab notebook link with stakeholders
# They can view (and interact!) on their phones! ๐ฑ
Let's see how PyGWalker compares to popular alternatives:
Matplotlib/Seaborn:
# Traditional approach - multiple lines of code
import matplotlib.pyplot as plt
import seaborn as sns
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Plot 1: Scatter
axes[0, 0].scatter(df['x'], df['y'])
axes[0, 0].set_title('X vs Y')
# Plot 2: Histogram
axes[0, 1].hist(df['value'], bins=20)
axes[0, 1].set_title('Value Distribution')
# Plot 3: Box plot
sns.boxplot(data=df, x='category', y='value', ax=axes[1, 0])
axes[1, 0].set_title('Value by Category')
# Plot 4: Line chart
df.groupby('date')['value'].mean().plot(ax=axes[1, 1])
axes[1, 1].set_title('Trend Over Time')
plt.tight_layout()
plt.show()
PyGWalker approach:
# One line! ๐
pyg.walk(df)
# Then drag and drop to create all 4 visualizations interactively!
Verdict: ๐
| Aspect | Matplotlib/Seaborn | PyGWalker |
|---|---|---|
| Code Required | Many lines | 1 line |
| Flexibility | ๐ข Extreme | ๐ก High |
| Speed (to insight) | ๐ด Slow | ๐ข Fast |
| Interactivity | ๐ด None | ๐ข Full |
| Learning Curve | ๐ด Steep | ๐ข Easy |
| Publication Quality | ๐ข Excellent | ๐ก Good |
| Best For | Final charts | Exploration |
Use Matplotlib/Seaborn when: You need pixel-perfect, publication-ready static charts
Use PyGWalker when: You're exploring data and want insights fast! โก
Plotly:
# Plotly - still requires code for each chart
import plotly.express as px
fig = px.scatter(df, x='x', y='y', color='category', size='value')
fig.show()
# Different chart? New code!
fig = px.bar(df, x='category', y='value')
fig.show()
PyGWalker:
# Switch between chart types with clicks!
pyg.walk(df)
Verdict: ๐
| Aspect | Plotly | PyGWalker |
|---|---|---|
| Interactivity | ๐ข Excellent | ๐ข Excellent |
| Code Required | ๐ก Moderate | ๐ข Minimal |
| Chart Types | ๐ข Extensive | ๐ก Good |
| Ease of Use | ๐ก Medium | ๐ข Easy |
| Customization | ๐ข Very High | ๐ก Moderate |
| Dashboard Building | ๐ข Dash/Streamlit | ๐ก Notebook only |
Use Plotly when: Building production dashboards or need specific chart types
Use PyGWalker when: Rapid exploration in notebooks! ๐
Tableau/Power BI:
PyGWalker:
Verdict: ๐
Use Tableau/Power BI when you need enterprise-wide BI solution
Use PyGWalker when you want "Tableau-like" exploration IN Python! ๐ฏ
pandas.plot():
# Quick but limited
df['value'].plot(kind='hist')
df.groupby('category')['value'].mean().plot(kind='bar')
PyGWalker:
# Quick AND powerful
pyg.walk(df)
Verdict: ๐
pandas.plot() is great for quick checks, but PyGWalker is better for serious exploration!
Pro tip: Use both! pandas.plot() for ultra-quick checks, PyGWalker for deeper dives. ๐ฏ
Keep learning and stay updated! ๐
Essential Links:
Community:
PyGWalker is part of the Kanaries ecosystem:
Beginner (You are here! ๐):
Intermediate:
Advanced:
Want more practice? Try these datasets:
Built-in (via seaborn-data):
External:
# Quick access to seaborn datasets
datasets = ['penguins', 'diamonds', 'titanic', 'taxis', 'iris', 'tips', 'flights']
print("๐ Available seaborn datasets:")
for dataset in datasets:
url = f"https://raw.githubusercontent.com/mwaskom/seaborn-data/master/{dataset}.csv"
print(f" โข {dataset}: {url}")
# Try any of them!
# df = pd.read_csv(url)
# pyg.walk(df)
Want to give back? Here's how! โค๏ธ
Even if you're not a developer:
โญ Star the Repository
๐ Share Your Use Cases
๐ Report Bugs
๐ก Suggest Features
๐ Improve Documentation
If you ARE a developer:
๐ป Contribute Code
๐งช Add Tests
Before submitting (like this tutorial!):
โ 1. Check Existing Issues/PRs
โ 2. Open an Issue First
โ 3. Follow the Style
โ 4. Test Thoroughly
โ 5. Write Clear PR Description
Contributors get:
This tutorial is an example of community contribution! ๐
Congratulations! ๐ You've completed the comprehensive PyGWalker tutorial!
Fundamentals:
Advanced Techniques:
Real-World Applications:
Best Practices:
Immediate Actions:
This Week:
This Month:
"The best way to learn data analysis is to analyze data!"
PyGWalker makes that process:
Don't wait for the perfect dataset - start exploring now! Every dataset has a story to tell. ๐
Thank you for completing this tutorial! We hope PyGWalker becomes an essential part of your data toolkit.
Questions? Ideas? Feedback?
Happy exploring! ๐๐ง๐
Made with โค๏ธ by the PyGWalker Community
This tutorial was created as a community contribution. Star the repo and contribute your own examples!
Version: PyGWalker 0.4.x+
Last Updated: 2024
License: Apache-2.0
๐ Share this tutorial: Help others discover PyGWalker!
#DataScience #Python #DataVisualization #PyGWalker #OpenSource
Basic Usage:
import pygwalker as pyg
pyg.walk(df)
**With Options:**
```python
pyg.walk(
df,
hide_data_source_config=True, # Cleaner UI
appearance='light', # or 'dark'
kernel_computation=True, # Better performance
spec="./config.json" # Save/load config
)
Performance Optimization:
# Sample large datasets
df_sample = df.sample(10000)
pyg.walk(df_sample, kernel_computation=True)
| Use Case | Chart Type | Fields |
|---|---|---|
| Correlation | Scatter | X: numeric, Y: numeric, Color: category |
| Comparison | Bar | X: category, Y: numeric (aggregated) |
| Trend | Line | X: date/time, Y: numeric, Color: category |
| Distribution | Histogram | X: numeric (binned) |
| Part-to-whole | Pie | Angle: category, Value: numeric |
| Relationship | Heatmap | X: category, Y: category, Color: numeric |
Ctrl/Cmd + Z: UndoESC: Clear selectionDelete: Remove field from shelfEnd of Tutorial ๐
Happy Data Exploring! ๐๐๐ง
This tutorial was created by Leonardo Braga as a community contribution to PyGWalker.
๐ผ Data Science & AI Student | ๐ง๐ท Brasรญlia, Brazil
๐ป GitHub | ๐ง [email protected]
Passionate about open-source, data science, and making analytics accessible to everyone.
Questions or feedback? Open an issue or reach out directly!
License: This tutorial follows PyGWalker's Apache-2.0 License
Last Updated: 11/06/2025
PyGWalker Version: 0.4.x+
Made with โค๏ธ for the data community