Back to Modin

For all ways to install Modin see official documentation at:

examples/spreadsheet/tutorial.ipynb

0.37.16.1 KB
Original Source

modin.spreadsheet

modin.spreadsheet is a Jupyter notebook widget that allows users to interact with Modin DataFrames in a spreadsheet-like fashion while taking advantage of the underlying capabilities of Modin. The widget makes it quick and easy to explore, sort, filter, edit data and export reproducible code.

This tutorial will showcase how to use modin.spreadsheet. Before starting, please install the required packages using pip install -r requirements.txt in the current directory. Then just run the cells; no editing required!

python
# Please install the required packages using `pip install -r requirements.txt` in the current directory
# For all ways to install Modin see official documentation at:
# https://modin.readthedocs.io/en/latest/installation.html
import modin.pandas as pd
import modin.spreadsheet as mss

Create a Modin DataFrame

The following cells creates a DataFrame using a NYC taxi dataset.

python
columns_names = [
        "trip_id", "vendor_id", "pickup_datetime", "dropoff_datetime", "store_and_fwd_flag",
        "rate_code_id", "pickup_longitude", "pickup_latitude", "dropoff_longitude", "dropoff_latitude",
        "passenger_count", "trip_distance", "fare_amount", "extra", "mta_tax", "tip_amount",
        "tolls_amount", "ehail_fee", "improvement_surcharge", "total_amount", "payment_type",
        "trip_type", "pickup", "dropoff", "cab_type", "precipitation", "snow_depth", "snowfall",
        "max_temperature", "min_temperature", "average_wind_speed", "pickup_nyct2010_gid",
        "pickup_ctlabel", "pickup_borocode", "pickup_boroname", "pickup_ct2010",
        "pickup_boroct2010", "pickup_cdeligibil", "pickup_ntacode", "pickup_ntaname", "pickup_puma",
        "dropoff_nyct2010_gid", "dropoff_ctlabel", "dropoff_borocode", "dropoff_boroname",
        "dropoff_ct2010", "dropoff_boroct2010", "dropoff_cdeligibil", "dropoff_ntacode",
        "dropoff_ntaname", "dropoff_puma",
    ]
parse_dates=["pickup_datetime", "dropoff_datetime"]
python
df = pd.read_csv('s3://modin-datasets/trips_data.csv', names=columns_names,
                header=None, parse_dates=parse_dates)
python
df

Generate a spreadsheet widget with the DataFrame

mss.from_dataframe takes in a DataFrame, optional configuration options, and returns a SpreadsheetWidget, which contains all the logic for displaying the spreadsheet view of the DataFrame. The object returned will not be rendered unless displayed.

python
spreadsheet = mss.from_dataframe(df)

Displaying the Spreadsheet

The widget is displayed when the widget is returned by an input cell or passed to the display function e.g. display(spreadsheet). When displayed, the SpreadsheetWidget will generate a transformation history cell that contains a record of the transformations applied to the DataFrame unless the cell already exists or the feature is disabled.

Basic Usage

from_dataframe creates a copy of the input DataFrame, so changes do not alter the original DataFrame.

Filter - Each column can be filtered according to its datatype using the filter button to the right of the column header. Any number of columns can be filtered simultaneously.
Sort - Each column can be sorted by clicking on the column header. Assumptions on the order of the data should only be made according to the latest sort i.e. the 2nd last sort may not be in order even if grouped by the duplicates in the last sorted column.
Cell Edit - Double click on a cell to edit its value.
Add Row(toolbar) - Click on the Add Row button in the toolbar to duplicate the last row in the DataFrame.
Remove Row(toolbar) - Select row(s) on the spreadsheet and click the Remove Row button in the toolbar to remove them.
Reset Filters(toolbar) - Click on the Reset Filters button in the toolbar to remove all filters on the data.
Reset Sort(toolbar) - Click on the Reset Sort button in the toolbar to remove any sorting on the data.

Transformation History and Reproducible Code

The widget records the history of transformations, such as filtering, that occur on the spreadsheet. These transformations are updated in the spreadsheet transformation history cell as they happen and can be easily copied for reproducibility. The history can be cleared using the Clear History button in the toolbar.

Try making some changes to the spreadsheet!

python
spreadsheet

Exporting Changes

to_dataframe takes in a SpreadsheetWidget and returns a copy of the DataFrame reflecting the current state of the UI on the widget. Specifically, any filters, edits, or sorts will be applied on the returned Dataframe.

Export a DataFrame after making some changes on the spreadsheet UI

python
changed_df = mss.to_dataframe(spreadsheet)
python
changed_df

SpreadsheetWidget API

The API on SpreadsheetWidget allows users to replicate some of the functionality on the GUI, but also provides other functionality such as applying the transformation history on another DataFrame or getting the DataFrame that matches the spreadsheet state like to_dataframe.

python
# Duplicates the `Reset Filters` button
spreadsheet.reset_filters()
python
# Duplicates the `Reset Sort` button
spreadsheet.reset_sort()
python
# Duplicates the `Clear History` button
spreadsheet.clear_history()
python
# Gets the modified DataFrame that matches the changes to the spreadsheet
# This is the same functionality as `mss.to_dataframe`
spreadsheet.get_changed_df()

Retrieving and Applying Transformation History

The transformation history can be retrieved as a list of code snippets using the get_history API. The apply_history API will apply the transformations on the input DataFrame and return the resultant DataFrame.

python
spreadsheet.get_history()
python
another_df = df.copy()
spreadsheet.apply_history(another_df)

Additional Example

Here is another example of how to use from_dataframe with configuration options.

python
mss.from_dataframe(df, show_toolbar=False, grid_options={'forceFitColumns': False, 'editable': False, 'highlightSelectedCell': True})