Back to Vaex

Performance notes

docs/source/guides/performance.ipynb

4.19.03.0 KB
Original Source

Performance notes

In most cases, minimizing memory usage is Vaex' first priority, and performance comes seconds. This allows Vaex to work with very large datasets, without shooting yourself in the foot.

However, this sometimes comes at the cost of performance.

Virtual columns

When we add a new column to a dataframe based on existing, Vaex will create a virtual column, e.g.:

python
import vaex
import numpy as np
x = np.arange(100_000_000, dtype='float64')
df = vaex.from_arrays(x=x)
df['y'] = (df['x'] + 1).log() - np.abs(df['x']**2 + 1).log()

In this dataframe, x uses memory, while y does not, it will be evaluate in chunks when needed. To demonstate the performance implications, let us compute with the column, to force the evaluation.

python
%%time
df.x.mean()
python
%%time
df.y.mean()

From this, we can see that a similar computation (the mean), with a virtual column can be much slower, a penalty we pay for saving memory.

Materializing the columns

We can ask Vaex to materialize a column, or all virtual column using df.materialize

python
df_mat = df.materialize()
python
%%time
df_mat.x.mean()
python
%%time
df_mat.y.mean()

We now get equal performance for both columns

Consideration in backends with multiple workers

As often is the case with web frameworks in Python, we use multiple workers, e.g. using gunicorn. If all workers would materialize, it would waste a lot of memory, there are two solutions to this issue:

Save to disk

Export the dataframe to disk in hdf5 or arrow format as a pre-process step, and let all workers access the same file. Due to memory mapping, each worker will share the same memory.

e.g.

python
df.export('materialized-data.hdf5', progress=True)

Materialize a single time

Gunicorn has the following command line flag:

  --preload             Load application code before the worker processes are forked. [False]

This will let gunicorn first run you app (a single time), allowing you to do the materialize step. After your script run, it will fork, and all workers will share the same memory.

Tip:

A good ida could be to mix the two, and use use Vaex' df.fingerprint method to cache the file to disk.

E.g.

python
import vaex
import numpy as np
import os

x = np.arange(100_000_000, dtype='float64')
df = vaex.from_arrays(x=x)
df['y'] = (df['x'] + 1).log() - np.abs(df['x']**2 + 1).log()

filename = "vaex-cache-" + df.fingerprint() + ".hdf5"
if not os.path.exists(filename):
    df.export(filename, progress=True)
else:
    df = vaex.open(filename) 

In case the virtual columns change, rerunning will create a new cache file, and changing back will use the previously generated cache file. This is especially useful during development.

In this case, it is still important to let gunicorn run a single process first (using the --preload flag), to avoid multiple workers doing the same work.