Caching

Vaex can cache task results, such as aggregations, or the internal hashmaps used for groupby operations to make recurring calculations much faster, at the cost of calculating cache keys and storing/retrieving the cached values.

Internally, Vaex calculates fingerprints (e.g. hashes of data, or file paths and mtimes) to create cache keys that are similar across processes, so that a restart of a process will most likely result in similar hash keys.

See configuration of the cache.

Caches can be turned on globally like this:

python

import vaex
df = vaex.datasets.titanic()
vaex.cache.memory();  # cache on globally

One can verify that the cache is turned on via:

python

vaex.cache.is_on()

The cache can be globally turned off again:

python

vaex.cache.off()
vaex.cache.is_on()

The cache can also be turned on with a context manager, after which it will be turned off again. Here we use a disk cache. Disk cache is shared among processes, and is ideal for processes that restart, or when using Vaex in a web service with multiple workers. Consider the following example:

python

with vaex.cache.disk(clear=True):
    print(df.age.mean())  # The very first time the mean is computed

python

# outside of the context manager, the cache is still off
vaex.cache.is_on()

python

with vaex.cache.disk():
    print(df.age.mean())  # The second time the result is read from the cache

python

vaex.cache.is_on()