examples/tutorial/jupyter/execution/pandas_on_ray/local/exercise_2.ipynb
GOAL: Learn about common functionality that Modin speeds up by using all of your machine's cores.
read_csv speedupsThe most commonly used data ingestion method used in pandas is CSV files (link to pandas survey). This concept is designed to give an idea of the kinds of speedups possible, even on a non-distributed filesystem. Modin also supports other file formats for parallel and distributed reads, which can be found in the documentation.
We will import both Modin and pandas so that the speedups are evident.
Note: Rerunning the read_csv cells many times may result in degraded performance, depending on the memory of the machine
import modin.pandas as pd
import pandas
import time
from IPython.display import Markdown, display
def printmd(string):
display(Markdown(string))
We will be using a version of this data already in S3, originally posted in this blog post: https://matthewrocklin.com/blog/work/2017/01/12/dask-dataframes
Size: ~1.8GB
path = "s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-01.csv"
Optional: Note that the dataset takes a while to download. To speed things up a bit, if you prefer to download this file once locally, you can run the following code in the notebook:
# [Optional] Download data locally. This may take a few minutes to download.
# import urllib.request
# url_path = "https://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-01.csv"
# urllib.request.urlretrieve(url_path, "taxi.csv")
# path = "taxi.csv"
pandas.read_csvstart = time.time()
pandas_df = pandas.read_csv(path, parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"], quoting=3)
end = time.time()
pandas_duration = end - start
print("Time to read with pandas: {} seconds".format(round(pandas_duration, 3)))
This is a good time to chat with your neighbor Dicussion topics
modin.pandas.read_csvstart = time.time()
modin_df = pd.read_csv(path, parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"], quoting=3)
end = time.time()
modin_duration = end - start
print("Time to read with Modin: {} seconds".format(round(modin_duration, 3)))
printmd("### Modin is {}x faster than pandas at `read_csv`!".format(round(pandas_duration / modin_duration, 2)))
pandas_df
modin_df
In pandas, a reduce would be something along the lines of a sum or count. It computes some summary statistics about the rows or columns. We will be using count.
start = time.time()
pandas_count = pandas_df.count()
end = time.time()
pandas_duration = end - start
print("Time to count with pandas: {} seconds".format(round(pandas_duration, 3)))
start = time.time()
modin_count = modin_df.count()
end = time.time()
modin_duration = end - start
print("Time to count with Modin: {} seconds".format(round(modin_duration, 3)))
printmd("### Modin is {}x faster than pandas at `count`!".format(round(pandas_duration / modin_duration, 2)))
pandas_count
modin_count
In pandas, map operations are operations that do a single pass over the data and do not change its shape. Operations like isnull and applymap are included in this. We will be using isnull.
start = time.time()
pandas_isnull = pandas_df.isnull()
end = time.time()
pandas_duration = end - start
print("Time to isnull with pandas: {} seconds".format(round(pandas_duration, 3)))
start = time.time()
modin_isnull = modin_df.isnull()
end = time.time()
modin_duration = end - start
print("Time to isnull with Modin: {} seconds".format(round(modin_duration, 3)))
printmd("### Modin is {}x faster than pandas at `isnull`!".format(round(pandas_duration / modin_duration, 2)))
pandas_isnull
modin_isnull
Sometimes we want to compute some summary statistics on a single column from our dataset.
start = time.time()
rounded_trip_distance_pandas = pandas_df["trip_distance"].apply(round)
end = time.time()
pandas_duration = end - start
print("Time to groupby with pandas: {} seconds".format(round(pandas_duration, 3)))
start = time.time()
rounded_trip_distance_modin = modin_df["trip_distance"].apply(round)
end = time.time()
modin_duration = end - start
print("Time to add a column with Modin: {} seconds".format(round(modin_duration, 3)))
printmd("### Modin is {}x faster than pandas at `apply` on one column!".format(round(pandas_duration / modin_duration, 2)))
rounded_trip_distance_pandas
rounded_trip_distance_modin
It is common to need to add a new column to an existing dataframe, here we show that this is significantly faster in Modin due to metadata management and an efficient zero copy implementation.
start = time.time()
pandas_df["rounded_trip_distance"] = rounded_trip_distance_pandas
end = time.time()
pandas_duration = end - start
print("Time to groupby with pandas: {} seconds".format(round(pandas_duration, 3)))
start = time.time()
modin_df["rounded_trip_distance"] = rounded_trip_distance_modin
end = time.time()
modin_duration = end - start
print("Time to add a column with Modin: {} seconds".format(round(modin_duration, 3)))
printmd("### Modin is {}x faster than pandas add a column!".format(round(pandas_duration / modin_duration, 2)))
pandas_df
modin_df
Please move on to Exercise 3 when you are ready