Back to Vaex

Arrow

docs/source/guides/arrow.ipynb

4.19.02.9 KB
Original Source

Arrow

Vaex supports Arrow. We will demonstrate vaex+arrow by giving a quick look at a large dataset that does not fit into memory. The NYC taxi dataset for the year 2015 contains about 150 million rows containing information about taxi trips in New York, and is about 23GB in size. You can download it here:

In case you want to convert it to the arrow format, use the code below:

python
ds_hdf5 = vaex.open('/Users/maartenbreddels/datasets/nytaxi/nyc_taxi2015.hdf5')
# this may take a while to export
ds_hdf5.export('./nyc_taxi2015.arrow')
python
!ls -alh /Users/maartenbreddels/datasets/nytaxi/nyc_taxi2015.arrow
python
import vaex

Opens instantly

Opening the file goes instantly, since nothing is being copied to memory. The data is only memory mapped, a technique that will only read the data when needed.

python
%time
df = vaex.open('/Users/maartenbreddels/datasets/nytaxi/nyc_taxi2015.arrow')
python
df

Quick viz of 146 million rows

As can be seen, this dataset contains 146 million rows. Using plot, we can generate a quick overview what the data contains. The pickup locations nicely outline Manhattan.

python
df.viz.heatmap(df.pickup_longitude, df.pickup_latitude, f='log')
python
df.total_amount.minmax()

Data cleansing: outliers

As can be seen from the total_amount columns (how much people payed), this dataset contains outliers. From a quick 1d plot, we can see reasonable ways to filter the data

python
df.plot1d(df.total_amount, shape=100, limits=[0, 100])
python
# filter the dataset
dff = df[(df.total_amount >= 0) & (df.total_amount < 100)]

Shallow copies

This filtered dataset did not copy any data (otherwise it would have costed us about ~23GB of RAM). Shallow copies of the data are made instead and a booleans mask tracks which rows should be used.

python
dff['ratio'] = dff.tip_amount/dff.total_amount

Virtual column

The new column ratio does not do any computation yet, it only stored the expression and does not waste any memory. However, the new (virtual) column can be used in calculations as if it were a normal column.

python
dff.ratio.mean()

Result

Our final result, the percentage of the tip, can be easily calcualted for this large dataset, it did not require any excessive amount of memory.

Interoperability

Since the data lives as Arrow arrays, we can pass them around to other libraries such as pandas, or even pass it to other processes.

python
arrow_table = df.to_arrow_table()
arrow_table
python
# Although you can 'convert' (pass the data) in to pandas,
# some memory will be wasted (at least an index will be created by pandas)
# here we just pass a subset of the data
df_pandas = df[:10000].to_pandas_df()
df_pandas

Tutorial

If you want to learn more on vaex, take a look at the tutorials to see what is possible.