docs/source/guides/arrow.ipynb
Vaex supports Arrow. We will demonstrate vaex+arrow by giving a quick look at a large dataset that does not fit into memory. The NYC taxi dataset for the year 2015 contains about 150 million rows containing information about taxi trips in New York, and is about 23GB in size. You can download it here:
In case you want to convert it to the arrow format, use the code below:
ds_hdf5 = vaex.open('/Users/maartenbreddels/datasets/nytaxi/nyc_taxi2015.hdf5')
# this may take a while to export
ds_hdf5.export('./nyc_taxi2015.arrow')
!ls -alh /Users/maartenbreddels/datasets/nytaxi/nyc_taxi2015.arrow
import vaex
Opening the file goes instantly, since nothing is being copied to memory. The data is only memory mapped, a technique that will only read the data when needed.
%time
df = vaex.open('/Users/maartenbreddels/datasets/nytaxi/nyc_taxi2015.arrow')
df
As can be seen, this dataset contains 146 million rows. Using plot, we can generate a quick overview what the data contains. The pickup locations nicely outline Manhattan.
df.viz.heatmap(df.pickup_longitude, df.pickup_latitude, f='log')
df.total_amount.minmax()
As can be seen from the total_amount columns (how much people payed), this dataset contains outliers. From a quick 1d plot, we can see reasonable ways to filter the data
df.plot1d(df.total_amount, shape=100, limits=[0, 100])
# filter the dataset
dff = df[(df.total_amount >= 0) & (df.total_amount < 100)]
This filtered dataset did not copy any data (otherwise it would have costed us about ~23GB of RAM). Shallow copies of the data are made instead and a booleans mask tracks which rows should be used.
dff['ratio'] = dff.tip_amount/dff.total_amount
The new column ratio does not do any computation yet, it only stored the expression and does not waste any memory. However, the new (virtual) column can be used in calculations as if it were a normal column.
dff.ratio.mean()
Our final result, the percentage of the tip, can be easily calcualted for this large dataset, it did not require any excessive amount of memory.
Since the data lives as Arrow arrays, we can pass them around to other libraries such as pandas, or even pass it to other processes.
arrow_table = df.to_arrow_table()
arrow_table
# Although you can 'convert' (pass the data) in to pandas,
# some memory will be wasted (at least an index will be created by pandas)
# here we just pass a subset of the data
df_pandas = df[:10000].to_pandas_df()
df_pandas
If you want to learn more on vaex, take a look at the tutorials to see what is possible.