Back to Vaex

Caching

docs/source/guides/caching.ipynb

4.19.01.9 KB
Original Source
<style> pre { white-space: pre-wrap !important; } .table-striped > tbody > tr:nth-of-type(odd) { background-color: #f9f9f9; } .table-striped > tbody > tr:nth-of-type(even) { background-color: white; } .table-striped td, .table-striped th, .table-striped tr { border: 1px solid black; border-collapse: collapse; margin: 1em 2em; } .rendered_html td, .rendered_html th { text-align: left; vertical-align: middle; padding: 4px; } </style>

Caching

Vaex can cache task results, such as aggregations, or the internal hashmaps used for groupby operations to make recurring calculations much faster, at the cost of calculating cache keys and storing/retrieving the cached values.

Internally, Vaex calculates fingerprints (e.g. hashes of data, or file paths and mtimes) to create cache keys that are similar across processes, so that a restart of a process will most likely result in similar hash keys.

See configuration of the cache.

Caches can be turned on globally like this:

python
import vaex
df = vaex.datasets.titanic()
vaex.cache.memory();  # cache on globally

One can verify that the cache is turned on via:

python
vaex.cache.is_on()

The cache can be globally turned off again:

python
vaex.cache.off()
vaex.cache.is_on()

The cache can also be turned on with a context manager, after which it will be turned off again. Here we use a disk cache. Disk cache is shared among processes, and is ideal for processes that restart, or when using Vaex in a web service with multiple workers. Consider the following example:

python
with vaex.cache.disk(clear=True):
    print(df.age.mean())  # The very first time the mean is computed
python
# outside of the context manager, the cache is still off
vaex.cache.is_on()
python
with vaex.cache.disk():
    print(df.age.mean())  # The second time the result is read from the cache
python
vaex.cache.is_on()