docs/user-guide/expressions/structs.md
Polars Structs are the idiomatic way of working with multiple columns. It is also a free operation i.e. moving columns into Structs does not copy any data!
For this section, let's start with a DataFrame that captures the average rating of a few movies across some states in the U.S.:
{{code_block('user-guide/expressions/structs','ratings_df',['DataFrame'])}}
--8<-- "python/user-guide/expressions/structs.py:setup"
--8<-- "python/user-guide/expressions/structs.py:ratings_df"
Struct typeA common operation that will lead to a Struct column is the ever so popular value_counts function that is commonly used in exploratory data analysis. Checking the number of times a state appears the data will be done as so:
{{code_block('user-guide/expressions/structs','state_value_counts',['value_counts'])}}
--8<-- "python/user-guide/expressions/structs.py:state_value_counts"
Quite unexpected an output, especially if coming from tools that do not have such a data type. We're not in peril though, to get back to a more familiar output, all we need to do is unnest the Struct column into its constituent columns:
{{code_block('user-guide/expressions/structs','struct_unnest',['unnest'])}}
--8<-- "python/user-guide/expressions/structs.py:struct_unnest"
!!! note "Why value_counts returns a Struct"
Polars expressions always have a `Fn(Series) -> Series` signature and `Struct` is thus the data type that allows us to provide multiple columns as input/ouput of an expression. In other words, all expressions have to return a `Series` object, and `Struct` allows us to stay consistent with that requirement.
dictsPolars will interpret a dict sent to the Series constructor as a Struct:
{{code_block('user-guide/expressions/structs','series_struct',['Series'])}}
--8<-- "python/user-guide/expressions/structs.py:series_struct"
!!! note "Constructing Series objects"
Note that `Series` here was constructed with the `name` of the series in the begninng, followed by the `values`. Providing the latter first
is considered an anti-pattern in Polars, and must be avoided.
StructLet's say that we needed to obtain just the movie value in the Series that we created above. We can use the field method to do so:
{{code_block('user-guide/expressions/structs','series_struct_extract',['field'])}}
--8<-- "python/user-guide/expressions/structs.py:series_struct_extract"
StructWhat if we need to rename individual fields of a Struct column? We first convert the rating_Series object to a DataFrame so that we can view the changes easily, and then use the rename_fields method:
{{code_block('user-guide/expressions/structs','series_struct_rename',['rename_fields'])}}
--8<-- "python/user-guide/expressions/structs.py:series_struct_rename"
Struct columnsLet's get back to the ratings data. We want to identify cases where there are duplicates at a Movie and Theatre level. This is where the Struct datatype shines:
{{code_block('user-guide/expressions/structs','struct_duplicates',['is_duplicated', 'struct'])}}
--8<-- "python/user-guide/expressions/structs.py:struct_duplicates"
We can identify the unique cases at this level also with is_unique!
Suppose, given that we know there are duplicates, we want to choose which rank gets a higher priority. We define Count of ratings to be more important than the actual Avg_Rating themselves, and only use it to break a tie. We can then do:
{{code_block('user-guide/expressions/structs','struct_ranking',['is_duplicated', 'struct'])}}
--8<-- "python/user-guide/expressions/structs.py:struct_ranking"
That's a pretty complex set of requirements done very elegantly in Polars!
This was discussed in the previous section on User Defined Functions.