website/versioned_docs/version-0.11.3/Explore Algorithms/Vowpal Wabbit/Overview.md
VowpalWabbit (VW) is a machine learning system that pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning. VowpalWabbit is a popular choice in ad-tech due to its speed and cost efficacy. Furthermore it includes many advances in the area of reinforcement learning (for instance, contextual bandits).
In PySpark, you can run the VowpalWabbitClassifier via:
from synapse.ml.vw import VowpalWabbitClassifier
model = (VowpalWabbitClassifier(numPasses=5, args="--holdout_off --loss_function logistic")
.fit(train))
Similarly, you can run the VowpalWabbitRegressor:
from synapse.ml.vw import VowpalWabbitRegressor
model = (VowpalWabbitRegressor(args="--holdout_off --loss_function quantile -q :: -l 0.1")
.fit(train))
You can pass command line parameters to VW via the args parameter, as documented in the VW Wiki.
For an end to end application, check out the VowpalWabbit notebook example.
VowpalWabbit on Spark uses an optimized JNI layer to efficiently support Spark. Java bindings can be found in the VW GitHub repo.
VW's command line tool uses a two-thread architecture (1x parsing/hashing, 1x learning) for learning and inference. To fluently embed VW into the Spark ML ecosystem, the following adaptions were made:
VW classifier/regressor operates on Spark's dense/sparse vectors
VW hashing is separated out into the VowpalWabbitFeaturizer transformer. It supports mapping Spark Dataframe schema into VW's namespaces and sparse features.
VW multi-pass training can be enabled using '--passes 4' argument or setNumPasses method. Cache file is automatically named.
VW distributed training is transparently set up and can be controlled through the input dataframes number of partitions. Similar to LightGBM all training instances must be running at the same time, thus the maximum parallelism is restricted by the number of executors available in the cluster. Under the hood, VW's built-in spanning tree functionality is used to coordinate allreduce. Required parameters are automatically determined and supplied to VW. The spanning tree coordination process is run on the driver node.