Back to Modin

Exercise 3: Not Implemented

examples/tutorial/jupyter/execution/pandas_on_ray/local/exercise_3.ipynb

0.37.18.2 KB
Original Source

<center><h2>Scale your pandas workflows by changing one line of code</h2>

Exercise 3: Not Implemented

GOAL: Learn what happens when a function is not yet supported in Modin as well as how to extend Modin's functionality using the DataFrame Algebra.

When functionality has not yet been implemented, we default to pandas

We convert a Modin dataframe to pandas to do the operation, then convert it back once it is finished. These operations will have a high overhead due to the communication involved and will take longer than pandas.

When this is happening, a warning will be given to the user to inform them that this operation will take longer than usual. For example, DataFrame.mask is not yet implemented. In this case, when a user tries to use it, they will see this warning:

UserWarning: `DataFrame.mask` defaulting to pandas implementation.

Concept for exercise: Default to pandas

In this section of the exercise we will see first-hand how the runtime is affected by operations that are not implemented.

python
import modin.pandas as pd
import pandas
import numpy as np
import time

frame_data = np.random.randint(0, 100, size=(2**18, 2**8))
df = pd.DataFrame(frame_data).add_prefix("col")
python
pandas_df = pandas.DataFrame(frame_data).add_prefix("col")
python
modin_start = time.time()

print(df.mask(df < 50))

modin_end = time.time()
print("Modin mask took {} seconds.".format(round(modin_end - modin_start, 4)))
python
pandas_start = time.time()

print(pandas_df.mask(pandas_df < 50))

pandas_end = time.time()
print("pandas mask took {} seconds.".format(round(pandas_end - pandas_start, 4)))

Concept for exercise: Register custom functions

Modin's user-facing API is pandas, but it is possible that we do not yet support your favorite or most-needed functionalities. Your user-defined function may also be able to be executed more efficiently if you pre-define the type of function it is (e.g. map, reduce, etc.) using the DataFrame Algebra. To solve either case, it is possible to register a custom function to be applied to your data.

Registering a custom function for all query compilers

To register a custom function for a query compiler, we first need to import it:

python
from modin.core.storage_formats.pandas.query_compiler import PandasQueryCompiler

The PandasQueryCompiler is responsible for defining and compiling the queries that can be operated on by Modin, and is specific to the pandas storage format. Any queries defined here must also both be compatible with and result in a pandas.DataFrame. Many functionalities are very simply implemented, as you can see in the current code: Link.

If we want to register a new function, we need to understand what kind of function it is. In our example, we will try to implement a kurtosis on the unary negation of the values in the dataframe, which is a map (unargy negation of each cell) followed by a reduce. So we next want to import the function type so we can use it in our definition:

python
from modin.core.dataframe.algebra import TreeReduce

Then we can just use the TreeReduce.register classmethod and assign it to the PandasQueryCompiler:

python
PandasQueryCompiler.neg_kurtosis = TreeReduce.register(lambda cell_value, **kwargs: ~cell_value, pandas.DataFrame.kurtosis)

We include **kwargs to the lambda function since the query compiler will pass all keyword arguments to both the map and reduce functions.

Finally, we want a handle to it from the DataFrame, so we need to create a way to do that:

python
def neg_kurtosis_func(self, **kwargs):
    # The constructor allows you to pass in a query compiler as a keyword argument
    return self.__constructor__(query_compiler=self._query_compiler.neg_kurtosis(**kwargs))

pd.DataFrame.neg_kurtosis_custom = neg_kurtosis_func

And then you can use it like you usually would:

python
df.neg_kurtosis_custom()
python
from modin.core.storage_formats.pandas.query_compiler import PandasQueryCompiler
from modin.core.dataframe.algebra import TreeReduce
python
PandasQueryCompiler.neg_kurtosis_custom = TreeReduce.register(lambda cell_value, **kwargs: ~cell_value,
                                                             pandas.DataFrame.kurtosis)
python
from pandas._libs import lib
# The function signature came from the pandas documentation:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.kurtosis.html
def neg_kurtosis_func(self, axis=lib.no_default, skipna=True, level=None, numeric_only=None, **kwargs):
    # We need to specify the axis for the query compiler
    if axis in [None, lib.no_default]:
        axis = 0
    # The constructor allows you to pass in a query compiler as a keyword argument
    # Reduce dimension is used for reduces
    # We also pass all keyword arguments here to ensure correctness
    return self._reduce_dimension(
        self._query_compiler.neg_kurtosis_custom(
            axis=axis, skipna=skipna, level=level, numeric_only=numeric_only, **kwargs
        )
    )

pd.DataFrame.neg_kurtosis_custom = neg_kurtosis_func

Speed improvements

If we were to try and replicate this functionality using the pandas API, we would need to call df.applymap with our unary negation function, and subsequently df.kurtosis on the result of the first call. Let's see how this compares with our new, custom function!

python
start = time.time()

print(pandas_df.applymap(lambda cell_value: ~cell_value).kurtosis())

end = time.time()
pandas_duration = end - start
print("pandas unary negation kurtosis took {} seconds.".format(pandas_duration))
python
start = time.time()

print(df.applymap(lambda x: ~x).kurtosis())

end = time.time()
modin_duration = end - start
print("Modin unary negation kurtosis took {} seconds.".format(modin_duration))
python
custom_start = time.time()

print(df.neg_kurtosis_custom())

custom_end = time.time()
modin_custom_duration = custom_end - custom_start
print("Modin neg_kurtosis_custom took {} seconds.".format(modin_custom_duration))
python
from IPython.display import Markdown, display

display(Markdown("### As expected, Modin is {}x faster than pandas when chaining the functions; however we see that our custom function is even faster than that - beating pandas by {}x, and Modin (when chaining the functions) by {}x!".format(round(pandas_duration / modin_duration, 2), round(pandas_duration / modin_custom_duration, 2), round(modin_duration / modin_custom_duration, 2))))

Congratulations! You have just implemented new DataFrame functionality!

Consider opening a pull request: https://github.com/modin-project/modin/pulls

For a complete list of what is implemented, see the Supported APIs section.

Test your knowledge: Add a custom function for another tree reduce: finding DataFrame.mad after squaring all of the values

See the pandas documentation for the correct signature: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mad.html

python
modin_mad_custom_start = time.time()

# Implement your function here! Put the result of your custom squared `mad` in the variable `modin_mad_custom`
# Hint: Look at the kurtosis walkthrough above

modin_mad_custom = ...
print(modin_mad_custom)

modin_mad_custom_end = time.time()
python
# Evaluation code, do not change!
modin_mad_start = time.time()
modin_mad = df.applymap(lambda x: x**2).mad()
print(modin_mad)
modin_mad_end = time.time()

assert modin_mad_end - modin_mad_start > modin_mad_custom_end - modin_mad_custom_start, \
    "Your implementation was too slow, or you used the chaining functions approach. Try again"
assert modin_mad._to_pandas().equals(modin_mad_custom._to_pandas()), "Your result did not match the result of chaining the functions, try again"

Now that you are able to create custom functions, you know enough to contribute to Modin!

Please move on to Exercise 4 when you are ready