Back to Pygwalker

๐ŸŽจ PyGWalker: Turn Your Pandas DataFrame into an Interactive UI for Visual Analysis

tutorials/pygwalker_complete_tutorial.ipynb

0.5.0.173.9 KB
Original Source

๐ŸŽจ PyGWalker: Turn Your Pandas DataFrame into an Interactive UI for Visual Analysis

<div align="center">

Transform your DataFrame into a Tableau-style interface with just one line of code!

Documentation | GitHub | Discord Community

</div>

๐Ÿ“Š Tutorial Info:

  • โฑ๏ธ Estimated Time: 30-45 minutes
  • ๐Ÿ“ˆ Level: Beginner to Intermediate
  • ๐Ÿ Prerequisites: Basic Python and pandas knowledge

๐Ÿ“‹ Table of Contents

  1. ๐Ÿ‘‹ Welcome & Introduction
  2. ๐Ÿ› ๏ธ Setup & Installation
  3. ๐Ÿš€ Quick Start
  4. ๐ŸŽฏ Core Features
    • Loading Data
    • Visualization Types
    • Advanced Features
  5. ๐Ÿ’ก Practical Use Cases
    • Sales Analysis
    • Customer Segmentation
    • Data Quality Checks
    • A/B Testing
  6. ๐Ÿ’Ž Best Practices
  7. ๐Ÿ› Troubleshooting
  8. ๐Ÿ“š Additional Resources

โœ… Your Progress Tracker

Track your learning journey:

  • Completed Setup & Installation
  • Completed Quick Start
  • Completed Core Features
  • Completed Practical Use Cases
  • Completed Best Practices

Mark them as you go! ๐ŸŽ‰


<a id="welcome"></a>

๐Ÿ‘‹ Welcome!

Hey there, data explorer! ๐Ÿš€

If you've ever wanted the power of tools like Tableau or Power BI but inside your Python environment, you're in the right place. PyGWalker (Python binding of Graphic Walker) is here to make your data exploration journey smooth, intuitive, and honestly... pretty fun!

This tutorial will walk you through everything you need to know to become a PyGWalker pro. Whether you're a beginner or an experienced data scientist, you'll find something valuable here.

๐ŸŽฏ What You'll Learn

  • โœ… How to get started in minutes
  • โœ… Creating stunning visualizations with drag-and-drop
  • โœ… Advanced features and customization
  • โœ… Real-world use cases and best practices
  • โœ… Performance optimization tips
  • โœ… Troubleshooting common issues

๐Ÿ“š Prerequisites

Required:

  • ๐Ÿ Basic Python knowledge
  • ๐Ÿ“Š Familiarity with pandas DataFrames

Helpful (but not required):

  • ๐Ÿ“ˆ Basic data visualization concepts
  • ๐Ÿ“Š Understanding of statistical terms (mean, median, etc.)

Don't worry if you're new! We explain everything as we go. ๐ŸŽ“

Let's dive in! ๐ŸŠโ€โ™‚๏ธ


๐Ÿค” What is PyGWalker?

PyGWalker (pronounced "Pig Walker" ๐Ÿท) is a Python library that turns your pandas DataFrame into an interactive, Tableau-style user interface for visual exploration.

Why PyGWalker is Awesome

๐ŸŽฏ One-Line Magic

  • Seriously. One line of code. That's all it takes to get a full visual analysis interface.

๐Ÿ–ฑ๏ธ Drag-and-Drop Interface

  • No need to memorize complex plotting syntax
  • Drag columns to create charts instantly
  • Switch between visualization types with a click

๐Ÿš€ Lightning Fast

  • Built for performance with large datasets
  • Real-time interaction with your data
  • Smooth experience in Jupyter and Google Colab

๐ŸŽจ Rich Visualization Options

  • Scatter plots, bar charts, line charts, heatmaps, and more
  • Automatic chart recommendations based on your data
  • Customizable colors, scales, and styling

๐Ÿ” Exploratory Data Analysis (EDA) Supercharged

  • Filter, aggregate, and drill down into your data
  • Discover patterns and insights visually
  • Perfect for the initial data exploration phase

๐Ÿ†š PyGWalker vs Traditional Tools

FeaturePyGWalkerMatplotlib/SeabornTableau/Power BI
Code Required1 lineMany linesNo code
Interactiveโœ… YesโŒ Noโœ… Yes
Python Integrationโœ… Nativeโœ… NativeโŒ Limited
Learning Curve๐ŸŸข Easy๐ŸŸก Medium๐ŸŸก Medium
Cost๐ŸŸข Free๐ŸŸข Free๐Ÿ”ด Paid (mostly)
Jupyter/Colabโœ… Perfectโœ… GoodโŒ No
Drag-and-Dropโœ… YesโŒ Noโœ… Yes

When to Use PyGWalker:

โœ… Perfect for:

  • Quick exploratory data analysis (EDA)
  • Interactive presentations in notebooks
  • When you want visual insights without writing plotting code
  • Sharing interactive analysis with non-technical stakeholders
  • Teaching data analysis concepts

โš ๏ธ Maybe not ideal for:

  • Production dashboards (use Plotly Dash or Streamlit)
  • Highly customized, publication-ready static plots (use Matplotlib)
  • Automated reporting pipelines

๐ŸŽฏ What We'll Build Today

Throughout this tutorial, we'll explore PyGWalker using real-world datasets. Here's a sneak peek:

  1. Quick Start: Get your first visualization in under 2 minutes โฑ๏ธ
  2. Core Features: Master the drag-and-drop interface ๐ŸŽจ
  3. Advanced Techniques: Calculated fields, filters, and aggregations ๐Ÿ”ง
  4. Real Use Cases: Sales analysis, customer segmentation, and more ๐Ÿ“Š
  5. Pro Tips: Best practices and performance optimization ๐Ÿ’ก

By the end, you'll be able to:

  • โœ… Quickly explore any dataset visually
  • โœ… Create insightful visualizations without complex code
  • โœ… Impress your team with interactive data stories
  • โœ… Speed up your data analysis workflow significantly

Ready? Let's get started! ๐ŸŽ‰


๐Ÿ› ๏ธ Setup & Installation

Let's get PyGWalker installed and ready to roll! This should take less than a minute. โฑ๏ธ

python
# Install PyGWalker
!pip install pygwalker -q

print("โœ… PyGWalker installed successfully!")
python
# Import necessary libraries
import pandas as pd
import pygwalker as pyg
import warnings
warnings.filterwarnings('ignore')

# Check PyGWalker version
print(f"๐Ÿท PyGWalker version: {pyg.__version__}")
print("โœ… All imports successful!")

๐Ÿ“ฆ What We Just Installed

PyGWalker comes with everything you need:

  • Interactive visualization interface
  • Automatic chart type recommendations
  • Data processing capabilities
  • Export functionality

Compatibility:

  • โœ… Google Colab (where we are now!)
  • โœ… Jupyter Notebook
  • โœ… Jupyter Lab
  • โœ… VS Code notebooks
  • โœ… Any IPython environment

Note: PyGWalker works best with pandas DataFrames, so make sure your data is in that format!


๐Ÿš€ Quick Start: Your First Visualization in 30 Seconds!

Let's jump right in with a fun dataset: Palmer Penguins ๐Ÿง

This dataset contains measurements of 3 penguin species from islands in Antarctica. It's perfect for learning because:

  • ๐ŸŽฏ Small and easy to understand
  • ๐Ÿ“Š Mix of numerical and categorical data
  • ๐Ÿง Penguins are adorable!

Let's load it and create our first interactive visualization!

python
# Load the Palmer Penguins dataset
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv"
df = pd.read_csv(url)

print("๐Ÿง Dataset loaded successfully!")
print(f"๐Ÿ“Š Shape: {df.shape[0]} rows ร— {df.shape[1]} columns")
print("\n" + "="*50)
print("First look at our penguin friends:")
print("="*50)
df.head()

๐Ÿ” Understanding Our Dataset

Let's see what we're working with:

python
# Dataset overview
print("๐Ÿ“‹ Dataset Information:")
print("="*50)
df.info()

print("\n๐Ÿ“Š Basic Statistics:")
print("="*50)
df.describe()

print("\n๐Ÿง Penguin Species in our dataset:")
print("="*50)
print(df['species'].value_counts())

Our penguin dataset includes:

  • ๐Ÿ๏ธ species: Penguin species (Adelie, Chinstrap, Gentoo)
  • ๐Ÿ๏ธ island: Island where observed (Torgersen, Biscoe, Dream)
  • ๐Ÿ“ bill_length_mm: Length of the bill in millimeters
  • ๐Ÿ“ bill_depth_mm: Depth of the bill in millimeters
  • ๐Ÿฆ… flipper_length_mm: Length of the flipper in millimeters
  • โš–๏ธ body_mass_g: Body mass in grams
  • โšง๏ธ sex: Penguin sex (Male, Female)

Now, let's see the magic! โœจ


๐ŸŽจ The Magic Moment: One Line of Code!

Here it comes... the moment you've been waiting for!

Watch how ONE single line transforms our DataFrame into a full-fledged interactive visualization tool:

python
# ๐Ÿช„ THE MAGIC LINE ๐Ÿช„
pyg.walk(df)

๐ŸŽ‰ Congratulations! You Did It!

What just happened?

With that single line of code, you now have:

  • โœ… An interactive drag-and-drop interface
  • โœ… Multiple chart types at your fingertips
  • โœ… Automatic data type detection
  • โœ… Real-time filtering and aggregation
  • โœ… The power of Tableau... in your notebook!

๐ŸŽฎ How to Use the Interface:

Left Panel - Fields:

  • Drag any field (column) to the shelves on the right

Main Canvas - Visualization Area:

  • See your charts come to life in real-time

Top Bar - Controls:

  • ๐Ÿ“Š Change chart types (bar, line, scatter, etc.)
  • ๐ŸŽจ Customize colors and styling
  • ๐Ÿ’พ Export your visualizations
  • โš™๏ธ Access advanced settings

๐Ÿ’ก Try These Quick Experiments:

  1. Scatter Plot:

    • Drag bill_length_mm to X-axis
    • Drag bill_depth_mm to Y-axis
    • Drag species to Color
    • ๐ŸŽ‰ See how different species cluster!
  2. Bar Chart:

    • Drag species to X-axis
    • Drag body_mass_g to Y-axis
    • The interface will auto-aggregate (mean by default)
    • ๐Ÿง Compare penguin sizes!
  3. Distribution:

    • Drag flipper_length_mm to X-axis
    • Change chart type to histogram
    • Drag species to Color
    • ๐Ÿ“Š See the distribution patterns!

Take a few minutes to play around! There's no wrong way to explore. ๐ŸŽช

๐Ÿค“ Pro Tip: Understanding the Interface

PyGWalker's interface is divided into key areas:

Encoding Shelves (where you drag fields):

  • Dimensions ๐Ÿ“: Categorical/discrete data (species, island, sex)
  • Measures ๐Ÿ“Š: Numerical/continuous data (bill_length, body_mass)

Marks Shelf:

  • Color: Add visual distinction
  • Size: Vary point/bar sizes
  • Opacity: Control transparency
  • Shape: Different marker shapes

Filters:

  • Click the filter icon on any field to narrow down your data

The interface automatically suggests the best visualization based on what you drag. Smart, right? ๐Ÿง 


๐ŸŽฏ Core Features: Mastering PyGWalker

Now that you've seen the magic, let's dive deeper into what makes PyGWalker so powerful!

We'll explore:

  • ๐Ÿ“ Loading different data sources
  • ๐Ÿ“Š Creating various visualization types
  • ๐Ÿ”ง Advanced features and customization
  • ๐ŸŽจ Making your visualizations pop!

๐Ÿ“ Part 1: Loading Different Data Sources

PyGWalker is flexible! You can feed it data from multiple sources. Let's see how:

python
# Method 1: From a CSV file (what we just did!)
df_from_url = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv")
print("โœ… Method 1: Loaded from URL")

# Method 2: From a local CSV (if you have one)
# df_from_local = pd.read_csv('your_file.csv')

# Method 3: From a dictionary
data_dict = {
    'species': ['Adelie', 'Gentoo', 'Chinstrap'],
    'avg_mass': [3700, 5076, 3733],
    'count': [152, 124, 68]
}
df_from_dict = pd.DataFrame(data_dict)
print("โœ… Method 2: Created from dictionary")

# Method 4: From Excel (requires openpyxl)
# df_from_excel = pd.read_excel('your_file.xlsx')

# Method 5: From SQL, APIs, web scraping... anything that becomes a DataFrame!

print("\n๐ŸŽฏ The key point: If it's a pandas DataFrame, PyGWalker can visualize it!")

๐Ÿ’ก Quick Tip: Data Preparation

PyGWalker works best when your data is clean! Before using pyg.walk(), consider:

โœ… Good practices:

  • Handle missing values (we'll see this in action!)
  • Use meaningful column names
  • Ensure proper data types (dates as datetime, numbers as numeric)
  • Keep reasonable dataset sizes (< 100k rows for best performance)

โš ๏ธ PyGWalker will still work with messy data, but clean data = better insights!


๐Ÿ“Š Part 2: Visualization Types & When to Use Them

PyGWalker supports a wide variety of chart types. Let's explore the most useful ones with our penguin friends! ๐Ÿง

1๏ธโƒฃ Scatter Plots: Finding Relationships

Best for: Exploring relationships between two numerical variables

Let's investigate: Do penguins with longer bills also have deeper bills?

python
# Let's create a clean version of our dataset for visualization
df_clean = df.dropna()  # Remove rows with missing values

print(f"๐Ÿงน Cleaned dataset: {df_clean.shape[0]} rows (removed {df.shape[0] - df_clean.shape[0]} rows with missing data)")
print("\n๐ŸŽจ Now let's visualize! Run the cell below:")
python
# Scatter plot exploration
pyg.walk(df_clean, hide_data_source_config=True)

๐ŸŽฏ Try This - Scatter Plot Exercise:

Step-by-step instructions:

  • ๐Ÿ’ก If "Aggregation" is enabled, disable it. (cube symbol in the top bar)
  1. Create the basic scatter:

    • Drag bill_length_mm to X-axis
    • Drag bill_depth_mm to Y-axis
    • You should see a scatter plot!
  2. Add species distinction:

    • Drag species to Color in the Marks shelf
    • Wow! See how each species forms its own cluster? ๐ŸŽจ
  3. Add more context:

    • Drag sex to Shape
    • Drag body_mass_g to Size
    • Now you can see size, species, and sex all at once!
  4. Insights to look for:

    • ๐Ÿ” Gentoo penguins have longer but shallower bills
    • ๐Ÿ” Adelie penguins cluster in the upper-left
    • ๐Ÿ” Each species has distinct bill characteristics!

Pro move: Try changing the chart type using the dropdown at the top. PyGWalker suggests the best type automatically! ๐Ÿค–


2๏ธโƒฃ Bar Charts: Comparing Categories

Best for: Comparing values across different groups

Let's investigate: Which penguin species is the heaviest on average?

python
# Bar chart exploration
pyg.walk(df_clean, hide_data_source_config=True)

๐ŸŽฏ Try This - Bar Chart Exercise:

Creating a comparative bar chart:

  1. Basic bar chart:

    • Drag species to X-axis
    • Drag body_mass_g to Y-axis
    • PyGWalker automatically calculates the average (mean)!
  2. Compare by island:

    • Drag island to Color
    • Now you can see how species weights vary by location ๐Ÿ๏ธ
  3. Change aggregation:

    • Click on body_mass_g in the Y-axis
    • Try different aggregations: Sum, Count, Median, Min, Max
    • Each tells a different story!
  4. Flip it:

    • Try swapping X and Y axes (drag them to opposite positions)
    • Horizontal bars can be easier to read with long labels

Did you notice? Gentoo penguins are significantly heavier! ๐Ÿ’ช๐Ÿง


Interesting discovery: Our penguin dataset doesn't have a time dimension!

But that's okay - let's create one to demonstrate line charts:

python
# Let's create a time-series dataset for demonstration
import numpy as np

# Simulate penguin population monitoring over months
months = pd.date_range('2023-01-01', periods=12, freq='M')
penguin_trends = pd.DataFrame({
    'month': months,
    'Adelie_count': np.random.randint(45, 55, 12),
    'Gentoo_count': np.random.randint(35, 45, 12),
    'Chinstrap_count': np.random.randint(20, 30, 12)
})

# Reshape for PyGWalker
penguin_trends_long = penguin_trends.melt(
    id_vars=['month'],
    var_name='species',
    value_name='count'
)

print("๐Ÿ“ˆ Time-series data created!")
penguin_trends_long.head(10)
python
# Line chart exploration
pyg.walk(penguin_trends_long, hide_data_source_config=True)

๐ŸŽฏ Try This - Line Chart Exercise:

  1. Create the trend line:

    • Drag month to X-axis
    • Drag count to Y-axis
    • Change chart type to Line ๐Ÿ“ˆ
  2. Compare species:

    • Drag species to Color
    • Now you see all three species trends!
  3. Add markers:

    • In the marks settings, enable data points
    • Makes it easier to see individual observations

Use case: Line charts are perfect for time-series data, trends, and sequential patterns!


4๏ธโƒฃ Histograms & Distributions: Understanding Spread

Best for: Seeing the distribution and frequency of values

Let's investigate: How are flipper lengths distributed across our penguins?

python
# Back to our main penguin dataset
pyg.walk(df_clean, hide_data_source_config=True)

๐ŸŽฏ Try This - Histogram Exercise:

  1. Create a histogram:

    • Drag flipper_length_mm to X-axis
    • Change chart type to Histogram or Bar
    • PyGWalker automatically bins the data!
  2. See by species:

    • Drag species to Color
    • Choose "Stack" or "Dodge" layout
    • See how each species has different flipper sizes! ๐Ÿฆ…
  3. Adjust bins:

    • Click on the X-axis settings
    • Try different bin sizes (narrower = more detail)

Insight: You'll notice three distinct peaks - one for each species! This is called a "trimodal distribution." ๐ŸŽฏ


5๏ธโƒฃ Heatmaps & Tables: Dense Data Views

Best for: Showing values across two categorical dimensions

python
# Create a summary table for heatmap
summary_df = df_clean.groupby(['species', 'island']).agg({
    'body_mass_g': 'mean',
    'bill_length_mm': 'mean',
    'flipper_length_mm': 'mean'
}).reset_index().round(1)

print("๐Ÿ“Š Summary statistics by species and island:")
summary_df
python
# Heatmap exploration
pyg.walk(summary_df, hide_data_source_config=True)

๐ŸŽฏ Try This - Heatmap Exercise:

  1. Create a heatmap:

    • Drag species to X-axis
    • Drag island to Y-axis
    • Drag body_mass_g to Color
    • Change chart type to Heatmap or Square
  2. Insights at a glance:

    • Dark colors = higher values
    • Light colors = lower values
    • Empty cells = no data (e.g., no Chinstrap on Biscoe)

Use case: Heatmaps are perfect for correlation matrices, confusion matrices, or any 2D categorical comparison!


๐Ÿ”ง Part 3: Advanced Features

Now let's level up! ๐Ÿš€ These features will make you a PyGWalker power user.

โšก Feature 1: Filters - Focus on What Matters

Filters help you drill down into specific subsets of your data.

Let's explore: Male vs Female penguins by species

python
# Full dataset for filtering demo
pyg.walk(df_clean, hide_data_source_config=True)

๐ŸŽฏ Try This - Filtering Exercise:

  1. Add a filter:

    • Find the Filters section (usually top-right)
    • Click "Add Filter"
    • Select sex and choose only "Male"
    • Watch your visualization update instantly! โšก
  2. Multiple filters:

    • Add another filter for island
    • Select only "Biscoe"
    • Now you're looking at male penguins from Biscoe only!
  3. Dynamic filtering:

    • Try selecting different values
    • Remove filters to go back to full data
    • Use range filters for numeric fields (e.g., body_mass_g > 4000)

Pro tip: Filters don't change your DataFrame - they just change what's displayed! Your original data stays safe. ๐Ÿ›ก๏ธ


โšก Feature 2: Aggregations - Summarize Like a Pro

PyGWalker automatically aggregates data when needed. Let's master this!

python
pyg.walk(df_clean, hide_data_source_config=True)

๐ŸŽฏ Try This - Aggregation Exercise:

  1. Understanding auto-aggregation:

    • Drag species to X-axis
    • Drag body_mass_g to Y-axis
    • Notice it shows average by default
  2. Change aggregation type:

    • Click on body_mass_g in the Y-axis shelf
    • Try these:
      • Count: How many penguins per species?
      • Sum: Total mass of all penguins per species
      • Median: Middle value (less affected by outliers)
      • Min/Max: Smallest/largest penguin per species
      • Std Dev: How varied are the weights?
  3. Multiple measures:

    • You can add multiple fields to Y-axis!
    • Try adding both body_mass_g and flipper_length_mm
    • Compare two metrics side by side

Real-world use: Aggregations are crucial for sales reports, KPI dashboards, and summary statistics! ๐Ÿ“Š


โšก Feature 3: Sorting & Ranking

Make patterns jump out by ordering your data!

๐ŸŽฏ Try This - Sorting Exercise:

  1. Sort a bar chart:

    • Create: species on X, body_mass_g (mean) on Y
    • Click on the axis or bar
    • Look for sort options (ascending/descending)
    • Watch bars rearrange! ๐Ÿ“Š
  2. Sort by multiple fields:

    • Some chart types allow nested sorting
    • Great for complex categorical data

Use case: Rankings, top N analysis, identifying outliers


โšก Feature 4: Calculated Fields (Power User Feature! ๐Ÿ”ฅ)

Create new metrics on-the-fly without modifying your DataFrame!

Example: Let's calculate Body Mass Index (sort of) for penguins

๐Ÿ’ก Calculated Field Example:

While PyGWalker has calculated field capabilities, the exact implementation varies by version.

Alternative approach - Create in pandas first:

python
# Add calculated fields to our DataFrame
df_enhanced = df_clean.copy()

# Bill ratio: length to depth
df_enhanced['bill_ratio'] = (df_enhanced['bill_length_mm'] / df_enhanced['bill_depth_mm']).round(2)

# Mass category
df_enhanced['size_category'] = pd.cut(
    df_enhanced['body_mass_g'],
    bins=[0, 3500, 4500, 6500],
    labels=['Small', 'Medium', 'Large']
)

# Flipper to mass ratio (efficiency!)
df_enhanced['flipper_mass_ratio'] = (df_enhanced['flipper_length_mm'] / df_enhanced['body_mass_g'] * 1000).round(2)

print("โœจ Enhanced dataset with calculated fields!")
df_enhanced[['species', 'bill_ratio', 'size_category', 'flipper_mass_ratio']].head()
python
# Explore the enhanced dataset
pyg.walk(df_enhanced, hide_data_source_config=True)

๐ŸŽฏ Try This - Calculated Fields Exercise:

  1. Explore bill ratio:

    • Drag bill_ratio to X-axis
    • Drag species to Color
    • Which species has the longest bills relative to depth?
  2. Use size categories:

    • Drag size_category to X-axis
    • Drag species to Color
    • Create a stacked bar chart
    • See the size distribution per species!
  3. Efficiency analysis:

    • Create scatter: body_mass_g vs flipper_mass_ratio
    • Color by species
    • Higher ratio = more flipper per unit of body mass (more "efficient" flippers!)

Real-world use: Calculated fields are essential for:

  • KPIs (conversion rates, profit margins)
  • Normalized metrics (per capita, percentages)
  • Custom business logic

๐ŸŽจ Part 4: Customization & Styling

Make your visualizations beautiful and professional! โœจ

python
# PyGWalker with custom configuration
pyg.walk(
    df_enhanced,
    hide_data_source_config=True,
    spec="./config.json",  # Save/load your chart configurations (optional)
    kernel_computation=True  # Better performance for large datasets
)

๐ŸŽจ Customization Options:

Visual Styling:

  • ๐ŸŒ— Theme: Switch between light and dark modes
  • ๐ŸŽจ Colors: Click on color legends to customize palettes
  • ๐Ÿ“ Axes: Customize labels, scales (linear, log), and ranges
  • ๐Ÿ“Š Titles: Add descriptive titles to your charts

Interface Options:

  • hide_data_source_config=True: Cleaner interface, hides data source panel
  • dark='light' or appearance='dark': Theme preference
  • kernel_computation=True: Offload calculations to Python kernel (faster!)

Saving Your Work:

  • ๐Ÿ’พ Export charts: Use the export button for PNG/SVG
  • ๐Ÿ“‹ Save configuration: Export your chart setup as JSON
  • ๐Ÿ”„ Load configuration: Reuse your favorite chart setups

Pro tip: You can save your PyGWalker configuration and load it later for consistent visualizations! ๐ŸŽฏ


๐ŸŽ“ Quick Recap: Core Features

You've learned A LOT! Let's recap: ๐ŸŽ‰

โœ… Data Loading: URLs, files, dictionaries โ†’ any DataFrame works!

โœ… Chart Types:

  • ๐Ÿ“Š Scatter plots โ†’ relationships
  • ๐Ÿ“Š Bar charts โ†’ comparisons
  • ๐Ÿ“ˆ Line charts โ†’ trends
  • ๐Ÿ“Š Histograms โ†’ distributions
  • ๐Ÿ”ฅ Heatmaps โ†’ 2D patterns

โœ… Advanced Features:

  • ๐Ÿ” Filters โ†’ focus on subsets
  • ๐Ÿ“Š Aggregations โ†’ summarize data
  • ๐Ÿ”„ Sorting โ†’ reveal patterns
  • โšก Calculated fields โ†’ custom metrics

โœ… Customization:

  • ๐ŸŽจ Themes and colors
  • ๐Ÿ’พ Export and save
  • โš™๏ธ Performance options

You're now a PyGWalker intermediate user! ๐ŸŽŠ

Next up, we'll dive into real-world use cases and best practices. Ready? ๐Ÿš€


๐Ÿ’ก Practical Use Cases: Real-World Applications

Time to see PyGWalker in action with scenarios you'll actually face! ๐ŸŒ

We'll explore:

  1. ๐Ÿ“ˆ Sales Analysis: Revenue trends and performance
  2. ๐Ÿ‘ฅ Customer Segmentation: Understanding your audience
  3. ๐Ÿ” Data Quality Checks: Finding problems in your data
  4. ๐Ÿงช A/B Testing Results: Comparing experiments

Each example includes a realistic dataset and step-by-step analysis. Let's go! ๐Ÿš€


๐Ÿ“ˆ Use Case 1: Sales Analysis

Scenario: You're a data analyst at an e-commerce company. Your manager wants insights on:

  • Which products are selling best?
  • Are sales trending up or down?
  • Which regions are performing well?

Let's create a realistic sales dataset and analyze it!

python
# Create a realistic sales dataset
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Set seed for reproducibility
np.random.seed(42)

# Generate dates for the last 12 months
date_range = pd.date_range(end=datetime.now(), periods=365, freq='D')

# Product categories
products = ['Laptop', 'Phone', 'Tablet', 'Headphones', 'Smartwatch', 'Camera']
regions = ['North America', 'Europe', 'Asia', 'South America']
channels = ['Online', 'Retail']

# Generate sales data
n_records = 1000
sales_data = pd.DataFrame({
    'date': np.random.choice(date_range, n_records),
    'product': np.random.choice(products, n_records),
    'region': np.random.choice(regions, n_records),
    'channel': np.random.choice(channels, n_records),
    'units_sold': np.random.randint(1, 50, n_records),
    'unit_price': np.random.uniform(50, 2000, n_records).round(2),
})

# Calculate revenue
sales_data['revenue'] = (sales_data['units_sold'] * sales_data['unit_price']).round(2)

# Add some seasonality (higher sales in Nov-Dec)
sales_data.loc[sales_data['date'].dt.month.isin([11, 12]), 'revenue'] *= 1.5
sales_data['revenue'] = sales_data['revenue'].round(2)

# Sort by date
sales_data = sales_data.sort_values('date').reset_index(drop=True)

print("๐Ÿ’ฐ Sales dataset created!")
print(f"๐Ÿ“Š Records: {len(sales_data):,}")
print(f"๐Ÿ’ต Total Revenue: ${sales_data['revenue'].sum():,.2f}")
print(f"๐Ÿ“… Date Range: {sales_data['date'].min().date()} to {sales_data['date'].max().date()}")
print("\n" + "="*60)
sales_data.head(10)
python
# Quick overview of sales data
print("๐Ÿ“Š Sales Summary Statistics:\n")
print(sales_data.describe())
print("\n" + "="*60)
print("๐ŸŽฏ Sales by Product:\n")
print(sales_data.groupby('product')['revenue'].agg(['sum', 'mean', 'count']).sort_values('sum', ascending=False))
python
# Let's analyze! ๐Ÿš€
pyg.walk(sales_data, hide_data_source_config=True)

๐ŸŽฏ Sales Analysis Exercises:

Exercise 1: Revenue by Product ๐Ÿ“Š

  1. Drag product to X-axis
  2. Drag revenue to Y-axis (will auto-aggregate to SUM)
  3. Sort descending to see top performers
  4. Question: Which product generates the most revenue?

Exercise 2: Sales Trends Over Time ๐Ÿ“ˆ

  1. Drag date to X-axis
  2. Drag revenue to Y-axis
  3. Change to Line chart
  4. Drag product to Color
  5. Question: Do you see the holiday season spike? (Nov-Dec)

Exercise 3: Regional Performance ๐ŸŒ

  1. Create a bar chart: region vs revenue
  2. Drag channel to Color
  3. Use "Dodge" layout to compare side-by-side
  4. Question: Which region prefers online vs retail?

Exercise 4: Profitability Analysis ๐Ÿ’ฐ

  1. Scatter plot: units_sold (X) vs revenue (Y)
  2. Add product to Color
  3. Add unit_price to Size
  4. Question: Which products are high-volume vs high-value?

Exercise 5: Monthly Trends ๐Ÿ“…

  1. Create a calculated field for month (or use date aggregation)
  2. Line chart: month vs revenue
  3. Insight: Identify seasonal patterns for inventory planning!

๐Ÿ’ก Business Insights You Might Discover:

  • ๐Ÿ† Top performers: Identify which products to stock more
  • ๐Ÿ“‰ Declining products: Spot products needing promotion
  • ๐ŸŒ Regional preferences: Tailor marketing by region
  • ๐Ÿ“… Seasonality: Plan inventory and campaigns
  • ๐Ÿ’ป Channel effectiveness: Optimize sales channels

Pro tip: Save these visualizations and share them with your team! Export as PNG for presentations. ๐Ÿ“ธ


๐Ÿ‘ฅ Use Case 2: Customer Segmentation

Scenario: You work for a subscription service. Marketing wants to understand customer segments for targeted campaigns.

Goals:

  • Who are our most valuable customers?
  • What behaviors distinguish different segments?
  • How can we reduce churn?

Let's create customer data and segment it!

python
# Create customer dataset
np.random.seed(123)

n_customers = 500

customer_data = pd.DataFrame({
    'customer_id': range(1, n_customers + 1),
    'age': np.random.randint(18, 70, n_customers),
    'subscription_months': np.random.randint(1, 36, n_customers),
    'monthly_spend': np.random.uniform(10, 200, n_customers).round(2),
    'login_frequency': np.random.randint(1, 30, n_customers),
    'support_tickets': np.random.randint(0, 10, n_customers),
    'referrals': np.random.randint(0, 5, n_customers),
})

# Calculate lifetime value
customer_data['lifetime_value'] = (
    customer_data['subscription_months'] * customer_data['monthly_spend']
).round(2)

# Create engagement score
customer_data['engagement_score'] = (
    (customer_data['login_frequency'] * 2) +
    (customer_data['referrals'] * 10) -
    (customer_data['support_tickets'] * 3)
)

# Create segments based on lifetime value
customer_data['value_segment'] = pd.cut(
    customer_data['lifetime_value'],
    bins=[0, 500, 2000, 10000],
    labels=['Low Value', 'Medium Value', 'High Value']
)

# Create age groups
customer_data['age_group'] = pd.cut(
    customer_data['age'],
    bins=[0, 25, 35, 50, 100],
    labels=['18-25', '26-35', '36-50', '50+']
)

# Churn prediction (synthetic)
customer_data['churn_risk'] = np.where(
    (customer_data['engagement_score'] < 20) & (customer_data['subscription_months'] < 6),
    'High Risk',
    np.where(
        (customer_data['engagement_score'] < 40) & (customer_data['subscription_months'] < 12),
        'Medium Risk',
        'Low Risk'
    )
)

print("๐Ÿ‘ฅ Customer dataset created!")
print(f"๐Ÿ“Š Total Customers: {len(customer_data):,}")
print(f"๐Ÿ’ฐ Average Lifetime Value: ${customer_data['lifetime_value'].mean():,.2f}")
print(f"โš ๏ธ High Churn Risk: {(customer_data['churn_risk'] == 'High Risk').sum()} customers")
print("\n" + "="*60)
customer_data.head(10)
python
# Customer segments overview
print("๐Ÿ“Š Customer Segmentation Overview:\n")
print("By Value Segment:")
print(customer_data['value_segment'].value_counts())
print("\nBy Churn Risk:")
print(customer_data['churn_risk'].value_counts())
print("\nBy Age Group:")
print(customer_data['age_group'].value_counts())
python
# Analyze customer segments! ๐ŸŽฏ
pyg.walk(customer_data, hide_data_source_config=True)

๐ŸŽฏ Customer Segmentation Exercises:

Exercise 1: Value Segment Distribution ๐Ÿ’Ž

  1. Bar chart: value_segment (X) vs customer_id count (Y)
  2. Drag churn_risk to Color
  3. Use stacked bars
  4. Question: Are high-value customers at risk of churning?

Exercise 2: Engagement Analysis ๐Ÿ“Š

  1. Scatter plot: login_frequency (X) vs monthly_spend (Y)
  2. Add value_segment to Color
  3. Add subscription_months to Size
  4. Question: Do engaged users spend more?

Exercise 3: Age Demographics ๐Ÿ‘ค

  1. Bar chart: age_group (X) vs lifetime_value average (Y)
  2. Which age group has highest LTV?
  3. Add filter: show only "High Value" customers
  4. Insight: Target similar demographics in marketing!

Exercise 4: Churn Risk Factors โš ๏ธ

  1. Box plot or violin plot: churn_risk (X) vs engagement_score (Y)
  2. Add another view: churn_risk vs support_tickets
  3. Question: What predicts churn? Low engagement? Many issues?

Exercise 5: Referral Champions ๐Ÿ†

  1. Filter: referrals >= 2
  2. Scatter: subscription_months vs lifetime_value
  3. Color by age_group
  4. Insight: Identify your brand advocates for referral programs!

๐Ÿ’ก Marketing Actions Based on Insights:

High-Value + High Churn Risk ๐Ÿšจ

  • Immediate personal outreach
  • Exclusive perks or discounts
  • Address support issues proactively

Medium Value + High Engagement ๐ŸŒŸ

  • Upsell opportunities
  • Premium features
  • Referral incentives

Low Value + Young Demographic ๐ŸŽฏ

  • Growth potential
  • Educational content
  • Community building

Low Engagement + Any Value ๐Ÿ“ง

  • Re-engagement campaigns
  • Product education
  • Feature highlights

Real-world impact: Customer segmentation can increase ROI by 200%+ on marketing campaigns! ๐ŸŽฏ

python
# Analyze customer segments! ๐ŸŽฏ
pyg.walk(customer_data, hide_data_source_config=True)

๐ŸŽฏ Customer Segmentation Exercises:

Exercise 1: Value Segment Distribution ๐Ÿ’Ž

  1. Bar chart: value_segment (X) vs customer_id count (Y)
  2. Drag churn_risk to Color
  3. Use stacked bars
  4. Question: Are high-value customers at risk of churning?

Exercise 2: Engagement Analysis ๐Ÿ“Š

  1. Scatter plot: login_frequency (X) vs monthly_spend (Y)
  2. Add value_segment to Color
  3. Add subscription_months to Size
  4. Question: Do engaged users spend more?

Exercise 3: Age Demographics ๐Ÿ‘ค

  1. Bar chart: age_group (X) vs lifetime_value average (Y)
  2. Which age group has highest LTV?
  3. Add filter: show only "High Value" customers
  4. Insight: Target similar demographics in marketing!

Exercise 4: Churn Risk Factors โš ๏ธ

  1. Box plot or violin plot: churn_risk (X) vs engagement_score (Y)
  2. Add another view: churn_risk vs support_tickets
  3. Question: What predicts churn? Low engagement? Many issues?

Exercise 5: Referral Champions ๐Ÿ†

  1. Filter: referrals >= 2
  2. Scatter: subscription_months vs lifetime_value
  3. Color by age_group
  4. Insight: Identify your brand advocates for referral programs!

๐Ÿ’ก Marketing Actions Based on Insights:

High-Value + High Churn Risk ๐Ÿšจ

  • Immediate personal outreach
  • Exclusive perks or discounts
  • Address support issues proactively

Medium Value + High Engagement ๐ŸŒŸ

  • Upsell opportunities
  • Premium features
  • Referral incentives

Low Value + Young Demographic ๐ŸŽฏ

  • Growth potential
  • Educational content
  • Community building

Low Engagement + Any Value ๐Ÿ“ง

  • Re-engagement campaigns
  • Product education
  • Feature highlights

Real-world impact: Customer segmentation can increase ROI by 200%+ on marketing campaigns! ๐ŸŽฏ


๐Ÿ” Use Case 3: Data Quality Checks

Scenario: You've received a new dataset from a vendor. Before analysis, you need to check data quality!

What to look for:

  • Missing values
  • Outliers and anomalies
  • Inconsistent data
  • Distribution issues

PyGWalker is PERFECT for visual data quality checks! ๐Ÿ”

python
# Create a "messy" dataset for quality checking
np.random.seed(456)

n_records = 300

messy_data = pd.DataFrame({
    'transaction_id': range(1, n_records + 1),
    'amount': np.random.uniform(10, 1000, n_records),
    'category': np.random.choice(['Electronics', 'Clothing', 'Food', 'Other', None], n_records),
    'quantity': np.random.randint(1, 20, n_records),
    'customer_age': np.random.randint(15, 80, n_records),
    'rating': np.random.choice([1, 2, 3, 4, 5, None], n_records),
})

# Introduce data quality issues

# 1. Missing values (10% in category, 15% in rating)
messy_data.loc[np.random.choice(messy_data.index, 30), 'category'] = None
messy_data.loc[np.random.choice(messy_data.index, 45), 'rating'] = None

# 2. Outliers in amount (some crazy high values)
messy_data.loc[np.random.choice(messy_data.index, 5), 'amount'] = np.random.uniform(5000, 10000, 5)

# 3. Impossible values (negative amounts, ages > 100)
messy_data.loc[np.random.choice(messy_data.index, 3), 'amount'] = -np.random.uniform(10, 100, 3)
messy_data.loc[np.random.choice(messy_data.index, 4), 'customer_age'] = np.random.randint(100, 150, 4)

# 4. Duplicates
duplicate_rows = messy_data.sample(10)
messy_data = pd.concat([messy_data, duplicate_rows], ignore_index=True)

print("๐Ÿ” Messy dataset created (intentionally flawed!):")
print(f"๐Ÿ“Š Total Records: {len(messy_data):,}")
print(f"โŒ Missing values: {messy_data.isnull().sum().sum()}")
print(f"๐Ÿ”„ Duplicate rows: {messy_data.duplicated().sum()}")
print(f"โš ๏ธ Negative amounts: {(messy_data['amount'] < 0).sum()}")
print(f"โš ๏ธ Invalid ages: {(messy_data['customer_age'] > 100).sum()}")
print("\n" + "="*60)
messy_data.head(15)
python
# Check missing values
print("๐Ÿ“Š Missing Values Report:\n")
missing_report = pd.DataFrame({
    'Column': messy_data.columns,
    'Missing': messy_data.isnull().sum(),
    'Percentage': (messy_data.isnull().sum() / len(messy_data) * 100).round(2)
})
print(missing_report)
python
# Visual data quality check! ๐Ÿ”
pyg.walk(messy_data, hide_data_source_config=True)

๐ŸŽฏ Data Quality Check Exercises:

Exercise 1: Spot Outliers in Amount ๐Ÿ’ฐ

  1. Histogram: Drag amount to X-axis
  2. Question: See those bars way out on the right? Those are outliers!
  3. Box plot: Change chart type to see whiskers and outliers clearly
  4. Action: Investigate transactions > $5,000

Exercise 2: Find Missing Values Patterns โŒ

  1. Create a calculated field: is_missing_category (or filter by null)
  2. Compare missing vs non-missing records
  3. Question: Is missingness random or systematic?
  4. Bar chart: category counts - see how many nulls exist

Exercise 3: Detect Impossible Values ๐Ÿšจ

  1. Scatter plot: transaction_id (X) vs customer_age (Y)
  2. Question: See any points above 100? Those are errors!
  3. Repeat for amount - look for negative values
  4. Action: Create filters to isolate problematic records

Exercise 4: Check Distributions ๐Ÿ“Š

  1. Histogram: rating distribution
  2. Question: Is the distribution reasonable? Too many nulls?
  3. Compare across category
  4. Insight: Some categories might have more missing ratings

Exercise 5: Identify Patterns in Quality Issues ๐Ÿ”

  1. Create a flag: has_issues = TRUE if (age > 100 OR amount < 0)
  2. Analyze: Do issues cluster in certain categories?
  3. Insight: Data quality issues might be systematic!

๐Ÿ› ๏ธ Data Cleaning Actions:

After visual inspection, here's what to fix:

python
# Clean the messy data based on insights
messy_data_cleaned = messy_data.copy()

# 1. Remove duplicates
messy_data_cleaned = messy_data_cleaned.drop_duplicates()

# 2. Fix impossible values
messy_data_cleaned = messy_data_cleaned[
    (messy_data_cleaned['amount'] >= 0) &
    (messy_data_cleaned['customer_age'] <= 100)
]

# 3. Handle outliers (cap at 99th percentile)
amount_99th = messy_data_cleaned['amount'].quantile(0.99)
messy_data_cleaned.loc[messy_data_cleaned['amount'] > amount_99th, 'amount'] = amount_99th

# 4. Fill missing categories
messy_data_cleaned['category'] = messy_data_cleaned['category'].fillna('Unknown')

print("โœ… Data cleaned!")
print(f"๐Ÿ“Š Records before: {len(messy_data):,} โ†’ after: {len(messy_data_cleaned):,}")
print(f"โœจ Removed: {len(messy_data) - len(messy_data_cleaned):,} problematic records")
print(f"โŒ Missing values: {messy_data_cleaned.isnull().sum().sum()}")
python
# Compare before and after! ๐Ÿ“Š
print("Let's visualize the cleaned data:")
pyg.walk(messy_data_cleaned, hide_data_source_config=True)

๐Ÿ’ก Data Quality Best Practices:

Always Check Before Analysis: โœ…

  • ๐Ÿ“Š Distributions: Histograms reveal outliers and skewness
  • โŒ Missing values: Identify patterns, not just counts
  • ๐Ÿ”ข Range checks: Min/max should make business sense
  • ๐Ÿ”„ Duplicates: Visual patterns can reveal duplicate records
  • ๐Ÿ“ˆ Trends: Unexpected spikes might indicate data issues

PyGWalker for QA:

  • โšก Faster than writing multiple plotting commands
  • ๐Ÿ‘๏ธ Interactive exploration helps spot subtle issues
  • ๐ŸŽฏ Visual patterns are easier to spot than statistics
  • ๐Ÿ“ธ Export problematic charts for documentation

Real-world impact: Catching data quality issues early saves hours (or days!) of debugging later! ๐ŸŽฏ


๐Ÿงช Use Case 4: A/B Testing Results

Scenario: Your product team ran an A/B test on a new feature. You need to analyze if variant B performs better than variant A.

Metrics to compare:

  • Conversion rate
  • Average order value
  • User engagement
  • Statistical significance

Let's analyze test results! ๐Ÿงช

python
# Create A/B test dataset
np.random.seed(789)

n_users = 1000

# Variant B performs slightly better (simulate this)
variant_a_users = n_users // 2
variant_b_users = n_users - variant_a_users

ab_test_data = pd.DataFrame({
    'user_id': range(1, n_users + 1),
    'variant': ['A'] * variant_a_users + ['B'] * variant_b_users,
    'converted': (
        list(np.random.choice([0, 1], variant_a_users, p=[0.75, 0.25])) +  # A: 25% conversion
        list(np.random.choice([0, 1], variant_b_users, p=[0.65, 0.35]))    # B: 35% conversion
    ),
    'time_on_page': np.concatenate([
        np.random.uniform(30, 180, variant_a_users),   # A: average 105 seconds
        np.random.uniform(45, 210, variant_b_users)     # B: average 127 seconds
    ]),
    'pages_viewed': np.concatenate([
        np.random.randint(1, 8, variant_a_users),      # A: fewer pages
        np.random.randint(2, 10, variant_b_users)       # B: more pages
    ]),
})

# Add order value (only for converted users)
ab_test_data['order_value'] = 0
ab_test_data.loc[ab_test_data['converted'] == 1, 'order_value'] = np.random.uniform(20, 200, ab_test_data['converted'].sum())

# Round numeric columns
ab_test_data['time_on_page'] = ab_test_data['time_on_page'].round(1)
ab_test_data['order_value'] = ab_test_data['order_value'].round(2)

# Add day of test
ab_test_data['test_day'] = np.random.randint(1, 15, n_users)

print("๐Ÿงช A/B Test dataset created!")
print(f"๐Ÿ‘ฅ Total Users: {len(ab_test_data):,}")
print(f"๐Ÿ“Š Variant A: {variant_a_users:,} users")
print(f"๐Ÿ“Š Variant B: {variant_b_users:,} users")
print("\n" + "="*60)
ab_test_data.head(10)
python
# Quick A/B test summary
print("๐Ÿ“Š A/B Test Results Summary:\n")
summary = ab_test_data.groupby('variant').agg({
    'converted': ['sum', 'mean'],
    'time_on_page': 'mean',
    'pages_viewed': 'mean',
    'order_value': 'mean'
}).round(3)

summary.columns = ['Total Conversions', 'Conversion Rate', 'Avg Time (sec)', 'Avg Pages', 'Avg Order Value']
print(summary)

# Calculate lift
conv_rate_a = ab_test_data[ab_test_data['variant'] == 'A']['converted'].mean()
conv_rate_b = ab_test_data[ab_test_data['variant'] == 'B']['converted'].mean()
lift = ((conv_rate_b - conv_rate_a) / conv_rate_a * 100)

print(f"\n๐Ÿš€ Lift (B vs A): {lift:.1f}%")
print(f"{'๐ŸŽ‰ Variant B wins!' if lift > 0 else '๐Ÿ“‰ Variant A is better'}")
python
# Analyze the A/B test! ๐ŸŽฏ
pyg.walk(ab_test_data, hide_data_source_config=True)

๐ŸŽฏ A/B Testing Analysis Exercises:

Exercise 1: Conversion Rate Comparison ๐Ÿ“Š

  1. Bar chart: variant (X) vs converted (Y, mean aggregation)
  2. Question: Which variant has higher conversion rate?
  3. Change to actual counts (sum) to see volume
  4. Insight: B converts at ~35% vs A at ~25% = 40% lift! ๐Ÿš€

Exercise 2: Engagement Metrics โฑ๏ธ

  1. Box plot: variant (X) vs time_on_page (Y)
  2. See the distribution differences
  3. Repeat for pages_viewed
  4. Question: Is B more engaging overall?

Exercise 3: Order Value Analysis ๐Ÿ’ฐ

  1. Filter: converted = 1 (only converted users)
  2. Box plot: variant vs order_value
  3. Question: Do B users spend more per order?
  4. Important: Similar values = increased revenue comes from MORE conversions, not bigger orders

Exercise 4: Time-Series Check ๐Ÿ“…

  1. Line chart: test_day (X) vs converted mean (Y)
  2. Color by variant
  3. Question: Is performance consistent across days?
  4. Watch for: Novelty effects or contamination

Exercise 5: Segment Analysis ๐ŸŽฏ

  1. Create bins for time_on_page (low, medium, high engagement)
  2. Compare conversion by engagement level and variant
  3. Question: Does B work better for certain user types?
  4. Advanced: Look for interaction effects

๐Ÿ“ˆ Statistical Considerations:

What PyGWalker Shows:

  • โœ… Descriptive statistics visually
  • โœ… Distribution shapes and outliers
  • โœ… Trends over time
  • โœ… Segment-level differences

What You Still Need:

  • ๐Ÿ“Š Statistical significance tests (t-test, chi-square)
  • ๐Ÿ“Š Confidence intervals
  • ๐Ÿ“Š Power analysis

Pro tip: Use PyGWalker for exploratory analysis, then confirm with statistical tests in Python!

python
# Quick statistical test (bonus!)
from scipy import stats

# Chi-square test for conversion rate
contingency_table = pd.crosstab(ab_test_data['variant'], ab_test_data['converted'])
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)

print("๐Ÿ“Š Statistical Significance Test (Chi-Square):")
print("="*60)
print(f"Chi-square statistic: {chi2:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"\n{'โœ… Statistically significant (p < 0.05)!' if p_value < 0.05 else 'โŒ Not statistically significant (p >= 0.05)'}")
print("\nConclusion:")
if p_value < 0.05 and conv_rate_b > conv_rate_a:
    print("๐ŸŽ‰ Variant B is significantly better! Ship it! ๐Ÿš€")
elif p_value < 0.05:
    print("๐Ÿ“‰ Variant A is significantly better. Keep the original.")
else:
    print("๐Ÿคท No significant difference. Need more data or run longer.")

๐Ÿ’ก A/B Testing Best Practices:

Before the Test: ๐Ÿ“‹

  • Define success metrics clearly
  • Calculate required sample size
  • Ensure random assignment
  • Set test duration

During Analysis with PyGWalker: ๐Ÿ”

  • โœ… Check for outliers (can skew results)
  • โœ… Verify randomization (variants should look similar demographically)
  • โœ… Look for time-based patterns (novelty effects)
  • โœ… Segment analysis (does it work for everyone?)

Making the Decision: โœ…

  1. Visual exploration (PyGWalker) โœจ
  2. Statistical tests (scipy/statsmodels)
  3. Business context (cost, feasibility)
  4. Segment analysis (any negative impacts?)

Real-world impact: Proper A/B analysis can increase revenue by 10-30% through optimized features! ๐Ÿ“ˆ


You've completed all 4 real-world use cases! ๐ŸŽŠ

  • โœ… Sales analysis for business insights
  • โœ… Customer segmentation for targeted marketing
  • โœ… Data quality checks for reliable analysis
  • โœ… A/B testing for product decisions

Next up: Best practices and pro tips! ๐Ÿš€


๐Ÿ’Ž Best Practices & Pro Tips

You've learned the fundamentals and seen real-world applications. Now let's level up with advanced techniques and best practices! ๐Ÿš€

In this section:

  • โšก Performance optimization for large datasets
  • ๐ŸŽฏ Workflow recommendations
  • ๐Ÿ› Troubleshooting common issues
  • ๐Ÿ”ฅ Pro tips from power users
  • ๐Ÿ“š Additional resources

โšก Performance Optimization

PyGWalker is fast, but with large datasets, a few tweaks can make it even faster! โšก

๐Ÿ“Š How Big is Too Big?

Performance Guidelines:

Dataset SizePerformanceRecommendations
< 10K rows๐ŸŸข ExcellentUse as-is, no optimization needed
10K - 100K๐ŸŸก GoodConsider sampling for exploration
100K - 1M๐ŸŸ  ModerateUse sampling + kernel calc
> 1M rows๐Ÿ”ด SlowAggregate first or use database

Rule of thumb: If your DataFrame takes >2 seconds to display, it's time to optimize!

๐ŸŽฏ Optimization Technique 1: Smart Sampling

For initial exploration, you don't always need ALL the data!

python
# Create a large dataset for demonstration
import pandas as pd
import numpy as np

np.random.seed(42)
large_dataset = pd.DataFrame({
    'date': pd.date_range('2020-01-01', periods=500000, freq='min'),
    'user_id': np.random.randint(1, 10000, 500000),
    'event_type': np.random.choice(['click', 'view', 'purchase', 'cart'], 500000),
    'value': np.random.uniform(0, 100, 500000),
    'session_duration': np.random.randint(10, 3600, 500000)
})

print(f"๐Ÿ“Š Large dataset created: {len(large_dataset):,} rows")
print(f"๐Ÿ’พ Memory usage: {large_dataset.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
python
# โŒ Bad Practice: Using entire large dataset
# pyg.walk(large_dataset)  # This will be slow!

# โœ… Good Practice: Sample for exploration
sample_size = 10000
df_sample = large_dataset.sample(n=sample_size, random_state=42)

print(f"โœ… Sampled {sample_size:,} rows for exploration")
print(f"๐Ÿ“Š That's {(sample_size/len(large_dataset)*100):.1f}% of the data")
print(f"โšก Speed improvement: ~{len(large_dataset)//sample_size}x faster!")
python
# Fast exploration with sampled data
pyg.walk(df_sample, hide_data_source_config=True, kernel_computation=True)

๐ŸŽฏ Optimization Technique 2: Pre-Aggregation

If you're analyzing trends, aggregate BEFORE visualizing!

python
# โŒ Bad Practice: Visualizing 500K raw records for time trends

# โœ… Good Practice: Aggregate first!
daily_summary = large_dataset.groupby([
    large_dataset['date'].dt.date,
    'event_type'
]).agg({
    'user_id': 'nunique',  # Unique users
    'value': ['sum', 'mean'],
    'session_duration': 'mean'
}).reset_index()

daily_summary.columns = ['date', 'event_type', 'unique_users', 'total_value', 'avg_value', 'avg_duration']

print(f"โœ… Aggregated from {len(large_dataset):,} โ†’ {len(daily_summary):,} rows")
print(f"โšก That's a {len(large_dataset)//len(daily_summary)}x reduction!")
print("\nNow this will be lightning fast! โšก")
daily_summary.head()
python
# Super fast visualization of aggregated data
pyg.walk(daily_summary, hide_data_source_config=True, kernel_computation=True)

๐ŸŽฏ Optimization Technique 3: Data Type Optimization

Smaller data types = less memory = faster performance!

python
# Check current memory usage
print("๐Ÿ“Š Memory Usage by Column (BEFORE optimization):")
print("="*60)
memory_before = large_dataset.memory_usage(deep=True)
print(memory_before)
print(f"\n๐Ÿ’พ Total: {memory_before.sum() / 1024**2:.2f} MB")
python
# โœ… Optimize data types
large_dataset_optimized = large_dataset.copy()

# Convert object to category (huge savings!)
large_dataset_optimized['event_type'] = large_dataset_optimized['event_type'].astype('category')

# Use smaller int types
large_dataset_optimized['user_id'] = large_dataset_optimized['user_id'].astype('int32')
large_dataset_optimized['session_duration'] = large_dataset_optimized['session_duration'].astype('int16')

# Use float32 instead of float64
large_dataset_optimized['value'] = large_dataset_optimized['value'].astype('float32')

print("๐Ÿ“Š Memory Usage by Column (AFTER optimization):")
print("="*60)
memory_after = large_dataset_optimized.memory_usage(deep=True)
print(memory_after)
print(f"\n๐Ÿ’พ Total: {memory_after.sum() / 1024**2:.2f} MB")

savings = (1 - memory_after.sum() / memory_before.sum()) * 100
print(f"\n๐ŸŽ‰ Memory savings: {savings:.1f}%!")

๐ŸŽฏ Optimization Technique 4: Use Kernel Calculation

PyGWalker can offload calculations to your Python kernel for better performance!

python
# โœ… Best Practice: Enable kernel calculations
pyg.walk(
    df_sample,
    kernel_computation=True,  # ๐Ÿ”ฅ This is the magic parameter!
    hide_data_source_config=True
)

๐Ÿ“‹ Performance Optimization Checklist

Before using PyGWalker on large datasets:

โœ… Step 1: Check dataset size

  • If > 100K rows, consider optimization

โœ… Step 2: Sample for exploration

  • Use .sample() for initial analysis
  • 10K-50K rows is usually plenty

โœ… Step 3: Aggregate when possible

  • Daily/weekly summaries for time-series
  • Group by categories for comparisons

โœ… Step 4: Optimize data types

  • Use category for text with few unique values
  • Use smaller numeric types (int32, float32)

โœ… Step 5: Enable kernel calculation

  • Set kernel_computation=True

โœ… Step 6: Clean up first

  • Remove unnecessary columns
  • Drop duplicates
  • Handle missing values

Pro tip: Profile your code with %%time to measure improvements! โฑ๏ธ


๐ŸŽฏ Workflow Best Practices

How to integrate PyGWalker into your data science workflow efficiently!

Phase 1: Initial Exploration ๐Ÿ”

python
# 1. Load your data
df = pd.read_csv("your_data.csv")

# 2. Quick overview
print(f"Shape: {df.shape}")
print(f"\nData types:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")

# 3. Basic statistics
df.describe()
python
# 4. PyGWalker for visual exploration โญ
# Spend 10-15 minutes exploring interactively
pyg.walk(df, hide_data_source_config=True)

Phase 2: Deep Dive Analysis ๐ŸŽฏ

After initial exploration, you'll have questions. Answer them systematically:

python
# Example: Based on PyGWalker exploration, you noticed something interesting
# Now create focused analysis

# 5. Clean and prepare data based on insights
df_clean = df.dropna(subset=['important_column'])
df_clean = df_clean[df_clean['value'] > 0]

# 6. Create calculated fields you identified as useful
df_clean['new_metric'] = df_clean['a'] / df_clean['b']

# 7. Explore the refined dataset
pyg.walk(df_clean, hide_data_source_config=True)

Phase 3: Documentation & Sharing ๐Ÿ“

python
# 8. Export key visualizations
# Use PyGWalker's export button to save charts as PNG/SVG

# 9. Document insights in markdown cells
"""
Key Findings:
- Insight 1: [description]
- Insight 2: [description]
- Recommendation: [action items]
"""

# 10. Create final summary statistics
final_summary = df_clean.groupby('category').agg({
    'metric1': 'mean',
    'metric2': 'sum'
})
print(final_summary)

๐Ÿ’ก PyGWalker in Different Workflows

For Data Scientists: ๐Ÿงช

  • โœ… Use PyGWalker for EDA before modeling
  • โœ… Visualize feature distributions
  • โœ… Spot outliers that might affect models
  • โœ… Understand feature relationships

For Analysts: ๐Ÿ“Š

  • โœ… Quick ad-hoc analysis
  • โœ… Create presentation-ready charts
  • โœ… Interactive dashboards in notebooks
  • โœ… Self-service analytics

For Data Engineers: ๐Ÿ”ง

  • โœ… Data quality validation
  • โœ… Pipeline monitoring
  • โœ… Quick sanity checks
  • โœ… Distribution verification

For Business Users: ๐Ÿ’ผ

  • โœ… Explore data without coding (mostly!)
  • โœ… Answer business questions quickly
  • โœ… Drag-and-drop simplicity
  • โœ… Share insights with stakeholders

๐Ÿ› Troubleshooting & Common Issues

Running into problems? Here are solutions to the most common issues! ๐Ÿ”ง

โŒ Issue 1: PyGWalker Widget Not Displaying

Symptoms:

  • Blank output cell
  • No interactive interface appears
  • Just see <pygwalker.walker.Walker object at 0x...>

Solutions:

python
# โœ… Solution 1: Make sure you're in a supported environment
import sys
print(f"Python version: {sys.version}")
print(f"Environment: {'Google Colab' if 'google.colab' in sys.modules else 'Other'}")

# โœ… Solution 2: Update PyGWalker to latest version
# !pip install --upgrade pygwalker

# โœ… Solution 3: Restart runtime and try again
# In Colab: Runtime > Restart runtime

โŒ Issue 2: Slow Performance / Browser Freezing

Symptoms:

  • Interface takes forever to load
  • Browser becomes unresponsive
  • Lag when dragging fields

Solutions:

python
# โœ… Solution 1: Sample your data
df_sample = df.sample(min(10000, len(df)))
pyg.walk(df_sample)

# โœ… Solution 2: Use kernel calculation
pyg.walk(df, kernel_computation=True)

# โœ… Solution 3: Drop unnecessary columns
df_slim = df[['col1', 'col2', 'col3']]  # Only columns you need
pyg.walk(df_slim)

โŒ Issue 3: Charts Look Wrong / Unexpected Aggregations

Symptoms:

  • Numbers don't match expectations
  • Chart shows "sum" when you want "count"
  • Weird groupings

Solutions:

๐Ÿ’ก Understand auto-aggregation:

  • When you drag a measure (numeric) to an axis with dimensions, PyGWalker aggregates!
  • Default is usually SUM or MEAN
  • Click on the field in the shelf to change aggregation type

๐Ÿ’ก Check data types:

  • Text fields stored as numbers? Convert them!
  • Dates recognized as strings? Parse them!
python
# โœ… Fix data types
df['date_column'] = pd.to_datetime(df['date_column'])
df['category_column'] = df['category_column'].astype('category')
df['numeric_column'] = pd.to_numeric(df['numeric_column'], errors='coerce')

โŒ Issue 4: Missing Values Causing Problems

Symptoms:

  • Filters not working as expected
  • Aggregations returning NaN
  • Charts missing data points

Solutions:

python
# โœ… Option 1: Drop missing values
df_clean = df.dropna()

# โœ… Option 2: Fill missing values
df['column'] = df['column'].fillna(0)  # or mean, median, etc.

# โœ… Option 3: Create "Missing" category
df['column'] = df['column'].fillna('Unknown')

โŒ Issue 5: Can't Export or Save Visualizations

Symptoms:

  • Export button not working
  • Can't save chart configurations

Solutions:

๐Ÿ’ก Export as image:

  1. Look for the download/export icon (usually top-right)
  2. Choose PNG or SVG format
  3. Save to your local machine

๐Ÿ’ก Save configuration:

python
# โœ… Save your chart setup as JSON
pyg.walk(df, spec="./my_chart_config.json")

# โœ… Load it later
pyg.walk(df, spec="./my_chart_config.json")

โŒ Issue 6: Colors/Themes Not Applying

Symptoms:

  • Dark mode not working
  • Custom colors not showing

Solutions:

python
# โœ… Explicitly set theme
pyg.walk(df,appearance='light')  # or 'dark'

# โœ… For custom styling, modify after rendering
# (Advanced: requires CSS knowledge)

๐Ÿ†˜ Still Having Issues?

Debug Checklist: โœ…

  1. โœ… Updated to latest PyGWalker version?
  2. โœ… Restarted your notebook kernel?
  3. โœ… Checked DataFrame has data? (df.head())
  4. โœ… Verified data types? (df.dtypes)
  5. โœ… Tried with a simple example first?
  6. โœ… Checked GitHub issues for similar problems?

Get Help:


๐Ÿ”ฅ Pro Tips from Power Users

Advanced techniques that will make you a PyGWalker master! ๐ŸŽฏ

๐Ÿ’Ž Pro Tip 1: Save & Reuse Chart Configurations

Create once, reuse everywhere!

python
# โœ… Save your perfect chart setup
pyg.walk(df, spec="./sales_dashboard.json")

# Later, load it with new data (same structure)
df_new = pd.read_csv("next_month_data.csv")
pyg.walk(df_new, spec="./sales_dashboard.json")

# ๐ŸŽ‰ Instant dashboard with new data!

๐Ÿ’Ž Pro Tip 2: Combine with Other Libraries

PyGWalker plays nicely with the Python ecosystem!

python
# Example: Use pandas for heavy preprocessing, PyGWalker for visualization
import pandas as pd

# Complex aggregation in pandas
summary = df.groupby(['category', 'region']).agg({
    'revenue': ['sum', 'mean'],
    'units': 'sum',
    'customers': 'nunique'
}).reset_index()

summary.columns = ['category', 'region', 'total_revenue', 'avg_revenue', 'total_units', 'unique_customers']

# Beautiful visualization in PyGWalker
pyg.walk(summary)

๐Ÿ’Ž Pro Tip 3: Use for Jupyter Presentations

Create interactive presentations with RISE + PyGWalker!

python
# Install RISE for slideshows
# !pip install RISE

# Then use PyGWalker in your slides for interactive demos
# Your audience can explore data in real-time! ๐ŸŽช

๐Ÿ’Ž Pro Tip 4: Quick Data Quality Dashboard

Create a reusable data quality checker!

python
def data_quality_report(df):
    """
    Create a comprehensive data quality report with PyGWalker
    """
    import pandas as pd

    # Create quality metrics DataFrame
    quality_df = pd.DataFrame({
        'column': df.columns,
        'dtype': df.dtypes.astype(str),
        'missing_count': df.isnull().sum(),
        'missing_pct': (df.isnull().sum() / len(df) * 100).round(2),
        'unique_values': [df[col].nunique() for col in df.columns],
        'sample_value': [str(df[col].iloc[0]) if len(df) > 0 else '' for col in df.columns]
    })

    print("๐Ÿ“Š Data Quality Report:")
    print("="*60)
    print(quality_df.to_string())

    # Visualize with PyGWalker
    return pyg.walk(quality_df, hide_data_source_config=True)

# Use it on any DataFrame!
# data_quality_report(your_df)

๐Ÿ’Ž Pro Tip 5: Create Custom Analysis Templates

Build reusable analysis workflows!

python
def customer_analysis(df, customer_col, value_col, date_col):
    """
    Standardized customer analysis with PyGWalker
    """
    # Create summary
    summary = df.groupby(customer_col).agg({
        value_col: ['sum', 'mean', 'count'],
        date_col: ['min', 'max']
    }).reset_index()

    summary.columns = [customer_col, 'total_value', 'avg_value', 'transactions', 'first_purchase', 'last_purchase']

    # Calculate additional metrics
    summary['customer_tenure_days'] = (summary['last_purchase'] - summary['first_purchase']).dt.days
    summary['value_segment'] = pd.qcut(summary['total_value'], q=3, labels=['Low', 'Medium', 'High'])

    return pyg.walk(summary, hide_data_source_config=True)

# One function, works with any customer dataset! ๐ŸŽฏ

๐Ÿ’Ž Pro Tip 6: Keyboard Shortcuts

Speed up your workflow! โšก

Common shortcuts (may vary by version):

  • Ctrl/Cmd + Z: Undo last action
  • Ctrl/Cmd + C: Copy chart
  • ESC: Clear selection
  • Drag field while holding Shift: Duplicate field

Pro move: Hover over buttons for tooltips! ๐Ÿ’ก

๐Ÿ’Ž Pro Tip 7: Mobile-Friendly Dashboards

PyGWalker visualizations work on mobile browsers!

python
# โœ… For better mobile experience
pyg.walk(df, hide_data_source_config=True)  # Cleaner interface

# Share your Colab notebook link with stakeholders
# They can view (and interact!) on their phones! ๐Ÿ“ฑ

๐Ÿ“Š PyGWalker vs Other Tools: Deep Dive

Let's see how PyGWalker compares to popular alternatives:

๐Ÿ†š PyGWalker vs Matplotlib/Seaborn

Matplotlib/Seaborn:

python
# Traditional approach - multiple lines of code
import matplotlib.pyplot as plt
import seaborn as sns

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Plot 1: Scatter
axes[0, 0].scatter(df['x'], df['y'])
axes[0, 0].set_title('X vs Y')

# Plot 2: Histogram
axes[0, 1].hist(df['value'], bins=20)
axes[0, 1].set_title('Value Distribution')

# Plot 3: Box plot
sns.boxplot(data=df, x='category', y='value', ax=axes[1, 0])
axes[1, 0].set_title('Value by Category')

# Plot 4: Line chart
df.groupby('date')['value'].mean().plot(ax=axes[1, 1])
axes[1, 1].set_title('Trend Over Time')

plt.tight_layout()
plt.show()

PyGWalker approach:

python
# One line! ๐ŸŽ‰
pyg.walk(df)
# Then drag and drop to create all 4 visualizations interactively!

Verdict: ๐Ÿ†

AspectMatplotlib/SeabornPyGWalker
Code RequiredMany lines1 line
Flexibility๐ŸŸข Extreme๐ŸŸก High
Speed (to insight)๐Ÿ”ด Slow๐ŸŸข Fast
Interactivity๐Ÿ”ด None๐ŸŸข Full
Learning Curve๐Ÿ”ด Steep๐ŸŸข Easy
Publication Quality๐ŸŸข Excellent๐ŸŸก Good
Best ForFinal chartsExploration

Use Matplotlib/Seaborn when: You need pixel-perfect, publication-ready static charts

Use PyGWalker when: You're exploring data and want insights fast! โšก

๐Ÿ†š PyGWalker vs Plotly

Plotly:

python
# Plotly - still requires code for each chart
import plotly.express as px

fig = px.scatter(df, x='x', y='y', color='category', size='value')
fig.show()

# Different chart? New code!
fig = px.bar(df, x='category', y='value')
fig.show()

PyGWalker:

python
# Switch between chart types with clicks!
pyg.walk(df)

Verdict: ๐Ÿ†

AspectPlotlyPyGWalker
Interactivity๐ŸŸข Excellent๐ŸŸข Excellent
Code Required๐ŸŸก Moderate๐ŸŸข Minimal
Chart Types๐ŸŸข Extensive๐ŸŸก Good
Ease of Use๐ŸŸก Medium๐ŸŸข Easy
Customization๐ŸŸข Very High๐ŸŸก Moderate
Dashboard Building๐ŸŸข Dash/Streamlit๐ŸŸก Notebook only

Use Plotly when: Building production dashboards or need specific chart types

Use PyGWalker when: Rapid exploration in notebooks! ๐Ÿš€

๐Ÿ†š PyGWalker vs Tableau/Power BI

Tableau/Power BI:

  • ๐Ÿ’ฐ Expensive (hundreds/thousands per year)
  • ๐Ÿ–ฅ๏ธ Separate desktop application
  • โŒ Not integrated with Python
  • โœ… Enterprise features (collaboration, permissions)
  • โœ… Polished UI

PyGWalker:

  • ๐Ÿ†“ Free and open source
  • ๐Ÿ““ Lives in your notebook
  • ๐Ÿ Native Python integration
  • โŒ Limited collaboration features
  • โœ… Simple and effective

Verdict: ๐Ÿ†

Use Tableau/Power BI when you need enterprise-wide BI solution

Use PyGWalker when you want "Tableau-like" exploration IN Python! ๐ŸŽฏ

๐Ÿ†š PyGWalker vs pandas.plot()

pandas.plot():

python
# Quick but limited
df['value'].plot(kind='hist')
df.groupby('category')['value'].mean().plot(kind='bar')

PyGWalker:

python
# Quick AND powerful
pyg.walk(df)

Verdict: ๐Ÿ†

pandas.plot() is great for quick checks, but PyGWalker is better for serious exploration!

Pro tip: Use both! pandas.plot() for ultra-quick checks, PyGWalker for deeper dives. ๐ŸŽฏ


๐Ÿ“š Additional Resources

Keep learning and stay updated! ๐Ÿ“–

๐Ÿ“– Official Documentation & Learning

Essential Links:

Community:

PyGWalker is part of the Kanaries ecosystem:

  • ๐ŸŽจ Graphic Walker - Web-based version (JavaScript)
  • ๐Ÿš€ RATH - Automated data analysis & insights
  • ๐Ÿ“Š Kanaries Cloud - Hosted analytics platform

Beginner (You are here! ๐ŸŽ‰):

  1. โœ… Complete this tutorial
  2. โœ… Practice with your own datasets
  3. โœ… Join the Discord community

Intermediate:

  1. ๐Ÿ“Š Explore advanced configurations
  2. ๐Ÿ”ง Integrate into your workflow
  3. ๐Ÿ’ก Contribute examples to the community

Advanced:

  1. ๐Ÿš€ Optimize for large datasets
  2. ๐ŸŽจ Customize with themes
  3. ๐Ÿค Contribute to the project!

๐ŸŽ“ Practice Datasets

Want more practice? Try these datasets:

Built-in (via seaborn-data):

  • ๐Ÿง Penguins (what we used!)
  • ๐Ÿ’Ž Diamonds
  • ๐Ÿšข Titanic
  • ๐Ÿš• Taxis
  • ๐ŸŒธ Iris

External:

python
# Quick access to seaborn datasets
datasets = ['penguins', 'diamonds', 'titanic', 'taxis', 'iris', 'tips', 'flights']

print("๐Ÿ“Š Available seaborn datasets:")
for dataset in datasets:
    url = f"https://raw.githubusercontent.com/mwaskom/seaborn-data/master/{dataset}.csv"
    print(f"  โ€ข {dataset}: {url}")

# Try any of them!
# df = pd.read_csv(url)
# pyg.walk(df)

๐Ÿค Contributing to PyGWalker

Want to give back? Here's how! โค๏ธ

๐ŸŽฏ Ways to Contribute

Even if you're not a developer:

  1. โญ Star the Repository

    • Go to GitHub
    • Click the โญ star button
    • Helps the project grow!
  2. ๐Ÿ“ Share Your Use Cases

    • Write blog posts
    • Create video tutorials
    • Share on social media
    • Tag @kanaries_data
  3. ๐Ÿ› Report Bugs

    • Found an issue? Report it!
    • Include: Python version, code snippet, error message
    • Screenshots help a lot!
  4. ๐Ÿ’ก Suggest Features

  5. ๐Ÿ“š Improve Documentation

    • Fix typos
    • Add examples
    • Clarify confusing sections
    • Translate to other languages

If you ARE a developer:

  1. ๐Ÿ’ป Contribute Code

  2. ๐Ÿงช Add Tests

    • Improve test coverage
    • Add edge case tests
    • Performance benchmarks

๐Ÿ“‹ Contributing Guidelines

Before submitting (like this tutorial!):

โœ… 1. Check Existing Issues/PRs

  • Avoid duplicates!

โœ… 2. Open an Issue First

  • Describe what you want to add
  • Get feedback before spending time

โœ… 3. Follow the Style

  • Match existing code/doc style
  • Use clear, descriptive names
  • Add comments where helpful

โœ… 4. Test Thoroughly

  • Test in different environments
  • Check for edge cases
  • Include examples

โœ… 5. Write Clear PR Description

  • What does it do?
  • Why is it useful?
  • How to test it?
  • Screenshots if visual

๐ŸŽ‰ Recognition

Contributors get:

  • โœ… Name in contributors list
  • โœ… GitHub profile contribution
  • โœ… Satisfaction of helping thousands! ๐ŸŒ
  • โœ… Resume/portfolio material
  • โœ… Experience with open source

This tutorial is an example of community contribution! ๐Ÿ™Œ


๐ŸŽŠ Conclusion: You're Now a PyGWalker Pro!

Congratulations! ๐ŸŽ‰ You've completed the comprehensive PyGWalker tutorial!

โœ… What You've Mastered:

Fundamentals:

  • โœ… Installation and setup
  • โœ… Basic visualizations with drag-and-drop
  • โœ… Understanding the interface
  • โœ… Chart types and when to use them

Advanced Techniques:

  • โœ… Filters and aggregations
  • โœ… Calculated fields
  • โœ… Customization and styling
  • โœ… Performance optimization

Real-World Applications:

  • โœ… Sales analysis
  • โœ… Customer segmentation
  • โœ… Data quality checks
  • โœ… A/B testing

Best Practices:

  • โœ… Workflow integration
  • โœ… Troubleshooting common issues
  • โœ… Pro tips and tricks
  • โœ… Tool comparison

๐Ÿš€ What's Next?

Immediate Actions:

  1. ๐ŸŽฏ Practice - Use PyGWalker on your own datasets
  2. โญ Star the repo - Show your support!
  3. ๐Ÿ’ฌ Join Discord - Connect with the community
  4. ๐Ÿ“ Share - Teach others what you learned

This Week:

  1. ๐Ÿ“Š Integrate PyGWalker into your workflow
  2. ๐Ÿ” Explore at least 3 different datasets
  3. ๐Ÿ’ก Share one insight you discovered

This Month:

  1. ๐Ÿค Help someone learn PyGWalker
  2. ๐Ÿ› Report a bug or suggest a feature
  3. ๐Ÿ“ Write a blog post or create a video

๐Ÿ’ก Remember:

"The best way to learn data analysis is to analyze data!"

PyGWalker makes that process:

  • โšก Faster
  • ๐ŸŽจ More intuitive
  • ๐Ÿ˜Š More enjoyable

Don't wait for the perfect dataset - start exploring now! Every dataset has a story to tell. ๐Ÿ“–

๐Ÿ™ Thank You!

Thank you for completing this tutorial! We hope PyGWalker becomes an essential part of your data toolkit.

Questions? Ideas? Feedback?

  • ๐Ÿ’ฌ Discord: Join here
  • ๐Ÿ› Issues: GitHub
  • ๐Ÿ“ง Email: Check official docs

Happy exploring! ๐ŸŽ‰๐Ÿง๐Ÿ“Š


Made with โค๏ธ by the PyGWalker Community

This tutorial was created as a community contribution. Star the repo and contribute your own examples!

Version: PyGWalker 0.4.x+
Last Updated: 2024
License: Apache-2.0

๐Ÿ”— Share this tutorial: Help others discover PyGWalker!

#DataScience #Python #DataVisualization #PyGWalker #OpenSource


๐Ÿ“Ž Appendix: Quick Reference

๐ŸŽฏ Common PyGWalker Patterns

Basic Usage:

python
import pygwalker as pyg
pyg.walk(df)

**With Options:**
```python
pyg.walk(
    df,
    hide_data_source_config=True,  # Cleaner UI
   appearance='light',  # or 'dark'
    kernel_computation=True,  # Better performance
    spec="./config.json"  # Save/load config
)

Performance Optimization:

python
# Sample large datasets
df_sample = df.sample(10000)
pyg.walk(df_sample, kernel_computation=True)

๐ŸŽจ Chart Type Quick Guide

Use CaseChart TypeFields
CorrelationScatterX: numeric, Y: numeric, Color: category
ComparisonBarX: category, Y: numeric (aggregated)
TrendLineX: date/time, Y: numeric, Color: category
DistributionHistogramX: numeric (binned)
Part-to-wholePieAngle: category, Value: numeric
RelationshipHeatmapX: category, Y: category, Color: numeric

โŒจ๏ธ Keyboard Shortcuts

  • Ctrl/Cmd + Z: Undo
  • ESC: Clear selection
  • Delete: Remove field from shelf

End of Tutorial ๐ŸŽ“

Happy Data Exploring! ๐Ÿš€๐Ÿ“Š๐Ÿง


๐Ÿ™ Acknowledgments

This tutorial was created by Leonardo Braga as a community contribution to PyGWalker.

๐Ÿ’ผ Data Science & AI Student | ๐Ÿ‡ง๐Ÿ‡ท Brasรญlia, Brazil
๐Ÿ’ป GitHub | ๐Ÿ“ง [email protected]

Passionate about open-source, data science, and making analytics accessible to everyone.


๐ŸŒŸ Found this helpful?

Questions or feedback? Open an issue or reach out directly!


License: This tutorial follows PyGWalker's Apache-2.0 License
Last Updated: 11/06/2025
PyGWalker Version: 0.4.x+

Made with โค๏ธ for the data community