Back to Vowpal Wabbit

Simple pandas to vowpalwabbit conversion tutorial

python/docs/source/tutorials/python_simplified_dftovw_tuto.ipynb

9.11.24.6 KB
Original Source

Simple pandas to vowpalwabbit conversion tutorial

python
import pandas as pd
from vowpalwabbit.dftovw import DFtoVW
from vowpalwabbit import Workspace

Building simple examples using DftoVW.from_column_names

Let's create the following pandas dataframe:

python
df = pd.DataFrame(
    [
        {
            "income": 0,
            "age": 27,
            "marital-status": "Separated",
            "education": "HS-grad",
            "occupation": "Handlers-cleaners",
            "hours-per-week": 25,
        },
        {
            "income": 1,
            "age": 34,
            "marital-status": "Married-civ-spouse",
            "education": "Bachelors",
            "occupation": "Prof-specialty",
            "hours-per-week": 40,
        },
        {
            "income": 0,
            "age": 44,
            "marital-status": "Never-married",
            "education": "Assoc-voc",
            "occupation": "Priv-house-serv",
            "hours-per-week": 25,
        },
        {
            "income": 1,
            "age": 38,
            "marital-status": "Married-civ-spouse",
            "education": "Bachelors",
            "occupation": "Prof-specialty",
            "hours-per-week": 60,
        },
        {
            "income": 0,
            "age": 34,
            "marital-status": "Married-civ-spouse",
            "education": "HS-grad",
            "occupation": "Other-service",
            "hours-per-week": 36,
        },
    ]
)

The user builds the examples using the class method DftoVW.from_column_names. The method is called using the dataframe object (df) and its various column names. The conversion to vowpal wabbit examples is then performed by calling the convert_df method:

python
converter = DFtoVW.from_column_names(
    df=df,
    y="income",
    x=["age", "marital-status", "education", "occupation", "hours-per-week"],
)
examples = converter.convert_df()
examples

Note that the vowpal wabbit format for categorical features is feature_name=feature_value whereas for numerical features the format is feature_name:feature_value. When using DFtoVW class, the appropriate format will be inferred from the dataframe columns types.

We then train the model on these examples:

python
model = Workspace(P=1, enable_logging=True)

for ex in examples:
    model.learn(ex)
model.finish()

Building more complex examples

The class method DFtoVW.from_column_names represents a quick and simple way to build the examples, but if the user needs more control over the way the examples are created, she or he can either use the class Feature or the class Namespace for building features, and any of the label class available (see below) based on the nature of the task.

  • When using Namespace class (see https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Namespaces for the meaning) the user specifies the name of the namespace with the name field, and will pass one or a list of Feature object to the features field.

  • The Feature class has a value field, which is the name of the column. The user can also rename the feature using the rename_feature field or choose to enforce a specific type ("numerical" or "categorical") using as_type field.

Regarding the labels, multiple classes are available:

  • SimpleLabel for regression
  • MulticlassLabel and Multilabel for classification
  • ContextualbanditLabel.

In the following examples we'll build 2 namespaces based on socio-demographic features and the job features.

python
from vowpalwabbit.dftovw import SimpleLabel, Namespace, Feature

ns_sociodemo = Namespace(
    features=[Feature(col) for col in ["age", "marital-status", "education"]],
    name="ns_sociodemo",
)
ns_job = Namespace(
    features=[Feature(col) for col in ["occupation", "hours-per-week"]], name="ns_job"
)
label = SimpleLabel("income")

converter_advanced = DFtoVW(df=df, namespaces=[ns_sociodemo, ns_job], label=label)
examples_advanced = converter_advanced.convert_df()
examples_advanced[:5]

We train the model by also including interactions between the variables of the 2 namespaces:

python
model_advanced = Workspace(
    # arg_str="--interactions ns_sociodemo:ns_job", P=1, enable_logging=True
    arg_str="--redefine a:=ns_job b:=ns_sociodemo -q ab ",
    P=1,
    enable_logging=True,
)

for ex in examples_advanced:
    model_advanced.learn(ex)

model_advanced.finish()

Finally, we can get the estimated weights associated to each namespace and feature:

python
[
    (ns.name, feature.name, model_advanced.get_weight_from_name(feature.name, ns.name))
    for ns in [ns_job, ns_sociodemo]
    for feature in ns.features
]