python/docs/source/tutorials/python_simplified_dftovw_tuto.ipynb
import pandas as pd
from vowpalwabbit.dftovw import DFtoVW
from vowpalwabbit import Workspace
DftoVW.from_column_namesLet's create the following pandas dataframe:
df = pd.DataFrame(
[
{
"income": 0,
"age": 27,
"marital-status": "Separated",
"education": "HS-grad",
"occupation": "Handlers-cleaners",
"hours-per-week": 25,
},
{
"income": 1,
"age": 34,
"marital-status": "Married-civ-spouse",
"education": "Bachelors",
"occupation": "Prof-specialty",
"hours-per-week": 40,
},
{
"income": 0,
"age": 44,
"marital-status": "Never-married",
"education": "Assoc-voc",
"occupation": "Priv-house-serv",
"hours-per-week": 25,
},
{
"income": 1,
"age": 38,
"marital-status": "Married-civ-spouse",
"education": "Bachelors",
"occupation": "Prof-specialty",
"hours-per-week": 60,
},
{
"income": 0,
"age": 34,
"marital-status": "Married-civ-spouse",
"education": "HS-grad",
"occupation": "Other-service",
"hours-per-week": 36,
},
]
)
The user builds the examples using the class method DftoVW.from_column_names. The method is called using the dataframe object (df) and its various column names. The conversion to vowpal wabbit examples is then performed by calling the convert_df method:
converter = DFtoVW.from_column_names(
df=df,
y="income",
x=["age", "marital-status", "education", "occupation", "hours-per-week"],
)
examples = converter.convert_df()
examples
Note that the vowpal wabbit format for categorical features is feature_name=feature_value whereas for numerical features the format is feature_name:feature_value. When using DFtoVW class, the appropriate format will be inferred from the dataframe columns types.
We then train the model on these examples:
model = Workspace(P=1, enable_logging=True)
for ex in examples:
model.learn(ex)
model.finish()
The class method DFtoVW.from_column_names represents a quick and simple way to build the examples, but if the user needs more control over the way the examples are created, she or he can either use the class Feature or the class Namespace for building features, and any of the label class available (see below) based on the nature of the task.
When using Namespace class (see https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Namespaces for the meaning) the user specifies the name of the namespace with the name field, and will pass one or a list of Feature object to the features field.
The Feature class has a value field, which is the name of the column. The user can also rename the feature using the rename_feature field or choose to enforce a specific type ("numerical" or "categorical") using as_type field.
Regarding the labels, multiple classes are available:
SimpleLabel for regressionMulticlassLabel and Multilabel for classificationContextualbanditLabel.In the following examples we'll build 2 namespaces based on socio-demographic features and the job features.
from vowpalwabbit.dftovw import SimpleLabel, Namespace, Feature
ns_sociodemo = Namespace(
features=[Feature(col) for col in ["age", "marital-status", "education"]],
name="ns_sociodemo",
)
ns_job = Namespace(
features=[Feature(col) for col in ["occupation", "hours-per-week"]], name="ns_job"
)
label = SimpleLabel("income")
converter_advanced = DFtoVW(df=df, namespaces=[ns_sociodemo, ns_job], label=label)
examples_advanced = converter_advanced.convert_df()
examples_advanced[:5]
We train the model by also including interactions between the variables of the 2 namespaces:
model_advanced = Workspace(
# arg_str="--interactions ns_sociodemo:ns_job", P=1, enable_logging=True
arg_str="--redefine a:=ns_job b:=ns_sociodemo -q ab ",
P=1,
enable_logging=True,
)
for ex in examples_advanced:
model_advanced.learn(ex)
model_advanced.finish()
Finally, we can get the estimated weights associated to each namespace and feature:
[
(ns.name, feature.name, model_advanced.get_weight_from_name(feature.name, ns.name))
for ns in [ns_job, ns_sociodemo]
for feature in ns.features
]