<center><h2>Scale your pandas workflows by changing one line of code</h2>

Exercise 2: Speed improvements

GOAL: Learn about common functionality that Modin speeds up by using all of your machine's cores.

Concept for Exercise: `read_csv` speedups

The most commonly used data ingestion method used in pandas is CSV files (link to pandas survey). This concept is designed to give an idea of the kinds of speedups possible, even on a non-distributed filesystem. Modin also supports other file formats for parallel and distributed reads, which can be found in the documentation.

We will import both Modin and pandas so that the speedups are evident.

Note: Rerunning the read_csv cells many times may result in degraded performance, depending on the memory of the machine

python

import modin.pandas as pd
import pandas
import time
from IPython.display import Markdown, display

def printmd(string):
    display(Markdown(string))

Dataset: 2015 NYC taxi trip data

We will be using a version of this data already in S3, originally posted in this blog post: https://matthewrocklin.com/blog/work/2017/01/12/dask-dataframes

Size: ~1.8GB

python

path = "s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-01.csv"

Optional: Note that the dataset takes a while to download. To speed things up a bit, if you prefer to download this file once locally, you can run the following code in the notebook:

python

# [Optional] Download data locally. This may take a few minutes to download.
# import urllib.request
# url_path = "https://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-01.csv"
# urllib.request.urlretrieve(url_path, "taxi.csv")
# path = "taxi.csv"

`pandas.read_csv`

python

start = time.time()

pandas_df = pandas.read_csv(path, parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"], quoting=3)

end = time.time()
pandas_duration = end - start
print("Time to read with pandas: {} seconds".format(round(pandas_duration, 3)))

Expect pandas to take >3 minutes on EC2, longer locally

This is a good time to chat with your neighbor Dicussion topics

Do you work with a large amount of data daily?
How big is your data?
What’s the common use case of your data?
Do you use any big data analytics tools?
Do you use any interactive analytics tool?
What’s are some drawbacks of your current interative analytic tools today?

`modin.pandas.read_csv`

python

start = time.time()

modin_df = pd.read_csv(path, parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"], quoting=3)

end = time.time()
modin_duration = end - start
print("Time to read with Modin: {} seconds".format(round(modin_duration, 3)))

printmd("### Modin is {}x faster than pandas at `read_csv`!".format(round(pandas_duration / modin_duration, 2)))

Are they equal?

python

pandas_df

python

modin_df

Concept for exercise: Reduces

In pandas, a reduce would be something along the lines of a sum or count. It computes some summary statistics about the rows or columns. We will be using count.

python

start = time.time()

pandas_count = pandas_df.count()

end = time.time()
pandas_duration = end - start

print("Time to count with pandas: {} seconds".format(round(pandas_duration, 3)))

python

start = time.time()

modin_count = modin_df.count()

end = time.time()
modin_duration = end - start
print("Time to count with Modin: {} seconds".format(round(modin_duration, 3)))

printmd("### Modin is {}x faster than pandas at `count`!".format(round(pandas_duration / modin_duration, 2)))

Are they equal?

python

pandas_count

python

modin_count

Concept for exercise: Map operations

In pandas, map operations are operations that do a single pass over the data and do not change its shape. Operations like isnull and applymap are included in this. We will be using isnull.

python

start = time.time()

pandas_isnull = pandas_df.isnull()

end = time.time()
pandas_duration = end - start

print("Time to isnull with pandas: {} seconds".format(round(pandas_duration, 3)))

python

start = time.time()

modin_isnull = modin_df.isnull()

end = time.time()
modin_duration = end - start
print("Time to isnull with Modin: {} seconds".format(round(modin_duration, 3)))

printmd("### Modin is {}x faster than pandas at `isnull`!".format(round(pandas_duration / modin_duration, 2)))

Are they equal?

python

pandas_isnull

python

modin_isnull

Concept for exercise: Apply over a single column

Sometimes we want to compute some summary statistics on a single column from our dataset.

python

start = time.time()
rounded_trip_distance_pandas = pandas_df["trip_distance"].apply(round)

end = time.time()
pandas_duration = end - start
print("Time to groupby with pandas: {} seconds".format(round(pandas_duration, 3)))

python

start = time.time()

rounded_trip_distance_modin = modin_df["trip_distance"].apply(round)

end = time.time()
modin_duration = end - start
print("Time to add a column with Modin: {} seconds".format(round(modin_duration, 3)))

printmd("### Modin is {}x faster than pandas at `apply` on one column!".format(round(pandas_duration / modin_duration, 2)))

Are they equal?

python

rounded_trip_distance_pandas

python

rounded_trip_distance_modin

Concept for exercise: Add a column

It is common to need to add a new column to an existing dataframe, here we show that this is significantly faster in Modin due to metadata management and an efficient zero copy implementation.

python

start = time.time()
pandas_df["rounded_trip_distance"] = rounded_trip_distance_pandas

end = time.time()
pandas_duration = end - start
print("Time to groupby with pandas: {} seconds".format(round(pandas_duration, 3)))

python

start = time.time()

modin_df["rounded_trip_distance"] = rounded_trip_distance_modin

end = time.time()
modin_duration = end - start
print("Time to add a column with Modin: {} seconds".format(round(modin_duration, 3)))

printmd("### Modin is {}x faster than pandas add a column!".format(round(pandas_duration / modin_duration, 2)))

Are they equal?

python

pandas_df

python

modin_df

Please move on to Exercise 3 when you are ready

Exercise 2: Speed improvements

Exercise 2: Speed improvements

Concept for Exercise: read_csv speedups

Dataset: 2015 NYC taxi trip data

pandas.read_csv

Expect pandas to take >3 minutes on EC2, longer locally

modin.pandas.read_csv

Are they equal?

Concept for exercise: Reduces

Are they equal?

Concept for exercise: Map operations

Are they equal?

Concept for exercise: Apply over a single column

Are they equal?

Concept for exercise: Add a column

Are they equal?

Concept for Exercise: `read_csv` speedups

`pandas.read_csv`

`modin.pandas.read_csv`