Gentle Intro to Vaex for Big Data
Vaex is a rising Big Data Lib for Python and we explore its strengths and limitations and some starting code to use it
General Description of Vaex
As of now, there are various big data libraries available for Python. Dask and Pyspark are well-known examples and H2O is relatively more nascent. Vaex is also one of those newer big data libraries that started to have solid functionalities and garnered more attention from users in 2019.
This is what the official documentation of Vaex says about the library:
“Vaex is a python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid up to a billion (10⁹) objects/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, a zero memory copy policy, and lazy computations for best performance (no memory wasted).”[1]
Simply put, Vaex can be understood as the “Pandas for Big Data”. This leads me to the Vaex’s biggest strength — User Friendly API and syntax that arethe same or very similar to those of Pandas. This allows users to experience a less steeper learning curve for Vaex.
Its underlying principles are similar to other big data libraries — minimizing usage of RAM and lazy computations where ever possible. What does “lazy” mean in this context? Lazy refers to performing operations only when they are actually called instead of performing them every time. This trait of Vaex is especially useful for feature engineering when the data is enormous.
Say there is data with two variables, “v1” and “v2”, and you want to create a new feature using those two variables. For simple operations like creating a new variable that is equal to the sum of v1 and v2 would not be computationally too expensive. However, imagine you want to create a more convoluted feature that is defined by a computationally expensive formula that involves v1 and v2 in it. Then, the “laziness” of Vaex comes into play. The actual computation to create the new variable v3 would not happen when the line that defines v3 is run. Only when v3 is actually used later on in the script will that computation for creating v3 happen. This saves a lot of time and memory!
Comparing time it takes to read in data
Installation
pip install --upgrade vaex
Using Conda, you can install Vaex via the following:conda install -c conda-forge vaex
Making artificial data for tutorial
import numpy as npn_rows = 30000000 #30 million rows
n_cols = 10 #10 variables
df = pd.DataFrame(np.random.randint(100000000, 1000000000, size=(n_rows, n_cols)), columns=['c%d' % i for i in range(n_cols)])df.info(memory_usage='deep')
## 2.2GB of artificial data
2.2GB worth of artificial data was created and we save it for the purpose of comparing read in time of a CSV file in pandas v.s. Vaex!
import pandas as pd%%time
# Save the artificial data to read in use both pandas and vaex later
df.to_csv('data.csv', index=False)
Comparing Read in / Load time of CSV File
%%time
pandas_df = pd.read_csv("data.csv", low_memory=False)
%%time
vaex_df = vaex.from_csv("data.csv",copy_index=False)
%%time
vaex_chunk_df = vaex.from_csv("data.csv",copy_index=False, convert=True, chunk_size=5_000)
From the above results, we see that using Vaex to read in the 2.2GB of artificial data takes slightly less than reading in the same data using pandas (38.3 v.s. 42s). It actually takes way longer (4min 13s v.s. 42s) if you read in the data in chunks. The gains doesn’t seem to be that big. What is going on here?
There are good articles that analyze the performance speed of Vaex in comparison to Pandas or other big data libraries (e.g. Dask, PySpark).
- https://towardsdatascience.com/how-to-analyse-100s-of-gbs-of-data-on-your-laptop-with-python-f83363dda94
- https://towardsdatascience.com/beyond-pandas-spark-dask-vaex-and-other-big-data-technologies-battling-head-to-head-a453a1f8cc13
- https://www.kdnuggets.com/2021/05/vaex-pandas-1000x-faster.html
These articles show how efficient and fast Vaex can be in doing various operations from reading in big data, merging, sorting to joining and calculating basic summary statistics like the mean.
It seems to be the case that Vaex’s performance, as a library for dealing with “big data”, actually backfires or leads to less gains then users expect when it is dealing with smaller data for which pandas is enough. Here our artificial data was 2.2GB in memory and maybe it wasn’t big enough to justify the use of Vaex. Let’s keep in mind that for small datasets for which pandas is enough to read in, just using pandas instead of other big data libraries may actually be more efficient.
Another article that compares the performance of read_csv for various sized data and shows that Vaex’s speed/performance might not be better than Pandas for not much of a big dataset.
Basic Operations
Basic operations users can perform using pandas are equally possible using Vaex, often with the EXACT SAME SYNTAX. Functions like:
vaex_df.head(n)vaex_df.head_and_tail_print(n) #function that doesn't exist in pandas but straightforward enough on what it means (print first n lines and the last n lines)
vaex_df.describe()vaex_df = vaex_df.drop("c9")vaex_df.groupby(by='gender').agg({'IQ':'mean'})....
....
Within the umbrella library of Vaex, there are sub-libraries that are geared towards specific tasks of data science including visualizations and supervised learning.
For instance, there exists the vaex-ml library, one of the sub-libraries of Vaex, for Machine Learning and other related tasks such as dimensionality reduction and clustering.
The following is an implementation example of Vaex’s GradientBoostingClassifier using vaex.ml from the official documentation.
%%time
from vaex.ml.sklearn import Predictor
from sklearn.ensemble import GradientBoostingClassifier
iris_df = vaex.ml.datasets.load_iris()
features = ['petal_length', 'petal_width', 'sepal_length', 'sepal_width']
target = 'class_'
model = GradientBoostingClassifier(random_state=42)
vaex_model = Predictor(features=features, target=target, model=model, prediction_name='prediction')
vaex_model.fit(df=iris_df)
iris_df = vaex_model.transform(iris_df)
Other examples and operations of Vaex can be found from this full Kaggle tutorial of mine!
Vaex is a growing big data library for Python. The bigger the data you are dealing with, the bigger gains you can take advantage of when you are reading in big data, performing computationally expensive operations including merge, sort, join, feature engineering etc. However, it still has limitations. One best example is Vaex’s inability to perform “multi-index merging” which is something common in Pandas. However, I personally it has a lot of potential with its simple and straightforward syntax. In addition, the fact that Vaex does require knowledge of MapReduce or parallel computing which is often necessary for other big libraries like Pyspark makes it very attractive for users without such knowledge!
References
[1] What is Vaex? (2014), Vaex 4.1.0 documentation
[2] J. Veljanoski, How to analyse 100 GB of data on your laptop with Python (2019), Towards Data Science
[3] J. Alexander, Beyond Pandas: Spark, Dask, Vaex and other big data technologies battling head to head (2020), Towards Data Science
[4] A. Anis, Vaex: Pandas but 1000x faster (2021), Kdnuggets
[5] V. Dekanovsky, Is something better than pandas when the dataset fits the memory? (2021), Towards Data Science