web/pandas/pdeps/0008-inplace-methods-in-pandas.md
This PDEP proposes that:
inplace parameter will be removed from any method which can never update the
underlying values of a pandas object inplace or which alters the shape of the object,
and where the inplace=True option is only syntactic sugar for reassigning the result
to the calling DataFrame/Series.inplace parameter is only kept for those methods that can
modify the underlying values of a pandas object inplace, such as fillna or replace.inplace
keyword to avoid a copy of the data.inplace=True option:
self), instead of the current None.The inplace=True keyword has been a controversial topic for many years. It is generally seen (at least by several
pandas maintainers and educators) as bad practice and often unnecessary, but at the same time it is also widely used,
partly because of confusion around the impact of the keyword.
Generally, we assume that people use the keyword for the following reasons:
For the first reason: efficiency is an important aspect. However, in practice it is not always the case
that inplace=True improves anything. Some of the methods with an inplace keyword can actually work inplace, but
others still make a copy under the hood anyway. In addition, with the introduction of Copy-on-Write (PDEP-7), there are now other
ways to avoid making unnecessary copies by default (without needing to specify a keyword). The next section gives a
detailed overview of those different cases.
For the second reason: we are convinced that this is not worth it. While it might save some keystrokes (if you have a long variable name), this code style also has sufficient disadvantages that we think it is not worth providing "two ways" to achieve the same result:
inplace=Trueinplace keyword complicates type annotations (because the return value depends on the value of inplace)inplace=True gives code that mutates the state of an object and thus has side-effects. That can introduce
subtle bugs and is harder to debug.Finally, there are also methods that have a copy keyword instead of an inplace keyword (which also avoids copying
the data when copy=False, but returns a new object referencing the same data instead of updating the calling object),
adding to the inconsistencies. This keyword is also redundant now with the introduction of Copy-on-Write.
Given the above reasons, we are convinced that there is no need for neither the inplace nor the copy keyword, except
for a small subset of methods that can actually update data inplace. Removing those keywords will give a more
consistent and less confusing API. Removing the copy keyword is covered by PDEP-7 about Copy-on-Write,
and this PDEP will focus on the inplace keyword.
Thus, in this PDEP, we aim to standardize behavior across methods to make control of inplace-ness of methods consistent, and compatible with Copy-on-Write.
Note: there are also operations (not methods) that work inplace in pandas, such as indexing (
e.g. df.loc[0, "col"] = val) or inplace operators (e.g. df += 1). This is out of scope for this PDEP, as we focus on
the inplace behaviour of DataFrame and Series methods.
Many methods in pandas currently have the ability to perform an operation inplace. For example, some methods such
as DataFrame.insert only support inplace operations, while other methods use the inplace keyword to control
whether an operation is done inplace or not.
While we generally speak about "inplace" operations, this term is used in various context. Broadly speaking, for this PDEP, we can distinguish two kinds of "inplace" operations:
"values-inplace": an operation that updates the underlying values of a Series or DataFrame columns inplace (without making a copy of the array).
As illustration, an example of such a values-inplace operation without using a method:
:::python
df.loc[0, "col"] = val
"object-inplace": an operation that updates a pandas DataFrame or Series object inplace, but without updating existing column values inplace.
As illustration, an example of such an object-inplace operation without using a method:
:::python
df inplace, but without actuallydf.index = pd.Index(...)
df["col"] = new_values
Object-inplace operations, while not actually modifying existing column values, keep (a subset of) those columns and thus can avoid copying the data of those existing columns.
In addition, several methods supporting the inplace keyword cannot actually be done inplace (in either meaning)
because they make a copy as a
consequence of the operations they perform, regardless of whether inplace is True or not. This, coupled with the
fact that the inplace=True changes the return type of a method from a pandas object to None, makes usage of
the inplace keyword confusing and non-intuitive.
To summarize the status quo of inplace behavior of methods, we have divided methods that can operate inplace or have
an inplace keyword into 4 groups:
Group 1: Methods that always operate inplace (no user-control with inplace keyword)
| Method Name |
|---|
insert |
pop |
update |
isetitem |
This group encompasses both kinds of inplace: update can be values-inplace, while the others are object-inplace
(for example, although isetitem operates on the original pandas object inplace,
it will not change any existing values inplace; rather it will remove the values of the column being set, and insert new values).
Group 2: Methods that can modify the underlying data of the DataFrame/Series object ("values-inplace")
| Method Name |
|---|
where |
fillna |
replace |
mask |
interpolate |
ffill |
bfill |
clip |
These methods don't operate inplace by default, but can be done inplace with inplace=True if the dtypes are compatible
(e.g. the values replacing the old values can be stored in the original array without an astype). All those methods leave
the structure of the DataFrame or Series intact (shape, row/column labels), but can mutate some elements of the data of
the DataFrame or Series.
Group 3: Methods that can modify the DataFrame/Series object, but not the pre-existing values ("object-inplace")
| Method Name |
|---|
drop (dropping columns) |
rename |
rename_axis |
reset_index |
set_index |
These methods can change the structure of the DataFrame or Series, such as changing the shape by adding or removing columns, or changing the row/column labels (changing the index/columns attributes), but don't modify the existing underlying column data of the object.
All those methods make a copy of the full data by default, but can be performed object-inplace with
avoiding copying all data (currently enabled with specifying inplace=True).
Note: there are also methods that have a copy keyword instead of an inplace keyword (e.g. set_axis). This serves
a similar purpose (avoid copying all data), but those methods don't update the original object inplace and instead
return a new object referencing the same data.
Group 4: Methods that can never operate inplace
| Method Name |
|---|
drop (dropping rows) |
dropna |
drop_duplicates |
sort_values |
sort_index |
eval |
query |
Although these methods have the inplace keyword, they can never operate inplace, in either meaning, because the nature of the
operation requires copying (such as reordering or dropping rows). For those methods, inplace=True is essentially just
syntactic sugar for reassigning the new result to the calling DataFrame/Series.
Note: in the case of a "no-op" (for example when sorting an already sorted DataFrame), some of those methods might not
need to perform a copy and could be considered as "object-inplace" in that case.
This currently happens with Copy-on-Write (regardless of inplace), but this is considered an
implementation detail for the purpose of this PDEP.
The methods from group 1 (always inplace, no keyword) won't change behavior, and will remain always inplace.
For methods from group 4 (never inplace), the inplace keyword has no actual effect
(except for re-assigning to the calling variable) and is effectively syntactic sugar for
manually re-assigning. For this group, we propose to remove the inplace keyword.
For methods from group 3 (object-inplace), the inplace=True keyword can currently be
used to avoid a copy. However, with the introduction of Copy-on-Write, every operation
will potentially return a shallow copy of the input object by default (if the performed
operation does not require a copy of the data). This future default is therefore
equivalent to the behavior with inplace=True for those methods (minus the return
value).
For the above reasoning, we think there is no benefit of keeping the keyword around for
these methods. To emulate behavior of the inplace keyword, we can reassign the result
of an operation to the same variable:
:::python
df = pd.DataFrame({"foo": [1, 2, 3]})
df = df.reset_index()
df.iloc[0, 1] = ...
All references to the original object will go out of scope when the result of the reset_index operation is re-assigned
to df. As a consequence, iloc will continue to operate inplace, and the underlying data will not be copied (with Copy-on-Write).
Group 2 (values-inplace) methods differ, though, since they modify the underlying data, and therefore can actually happen inplace:
:::python
df = pd.DataFrame({"foo": [1, 2, 3]})
df.replace(to_replace=1, value=100, inplace=True)
Currently, the above updates df values-inplace, without requiring a copy of the data.
For this type of method, however, we can not emulate the above usage of inplace by
re-assigning:
:::python
df = pd.DataFrame({"foo": [1, 2, 3]})
df = df.replace(to_replace=1, value=100)
If we follow the rules of Copy-on-Write1 where "any subset or returned
series/dataframe always behaves as a copy of the original, and thus never modifies the
original", then there is no way of doing this operation inplace by default, because the
original object df would be modified before the reference goes out of scope (pandas
does not know whether you will re-assign it to df or assign it to another variable).
That would violate the Copy-on-Write rules, and therefore the replace() method in the
example always needs to make a copy of the underlying data by default
For this case, an inplace=True option can have an actual benefit, i.e. allowing to
avoid a data copy. Therefore, we propose to keep the inplace argument for this
group of methods.
Summarizing for the inplace keyword, we propose to:
inplace keyword for this subset of methods (group 2) that can update the
underlying values inplace ("values-inplace")inplace keyword from all other methods that either can never work inplace
(group 4) or only update the object (group 3, "object-inplace", which can be emulated
with reassigning).inplace=True, should we silently copy or raise an error if the data has references?For those methods where we would keep the inplace=True option (group 2), there is a complication that actually operating inplace
is not always possible.
For example,
:::python
df = pd.DataFrame({"foo": [1, 2, 3]})
df.replace(to_replace=1, value=100, inplace=True)
can be performed inplace.
This is only true if df does not share the values it stores with another pandas object. For example, the following
operations
:::python
df = pd.DataFrame({"foo": [1, 2, 3]})
view = df[:]
# We can't operate inplace, because view would also be modified!
df.replace(to_replace=1, value=100, inplace=True)
would be incompatible with the Copy-on-Write rules when actually done inplace. In this case we can either
Raising an error here is problematic since oftentimes users do not have control over whether a method would cause a "
lazy copy" to be triggered under Copy-on-Write. It is also hard to fix, adding a copy() before calling a method
with inplace=True might actually be worse than triggering the copy under the hood. We would only copy columns that
share data with another object, not the whole object like .copy() would.
Therefore, we propose to silently copy when needed. The inplace=True option would thus mean "try inplace whenever possible", and not guarantee it is actually done inplace.
In the future, if there is demand for it, it could still be possible to add to option to raise a warning whenever this happens. This would be useful in an IPython shell/Jupyter Notebook setting, where the user would have the opportunity to delete unused references that are causing the copying to be triggered.
For example:
:::ipython
In [1]: import pandas as pd
In [2]: pd.set_option("mode.copy_on_write", True)
In [3]: ser = pd.Series([1,2,3])
In [4]: ser_vals = ser.values # Save values to check inplace-ness
In [5]: ser
Out[5]:
0 1
1 2
2 3
dtype: int64
In [6]: ser = ser[:] # Original series should go out of scope
In [7]: ser.iloc[0] = -1 # This should be inplace
In [8]: ser
Out[8]:
0 -1
1 2
2 3
dtype: int64
In [9]: ser_vals
Out[9]: array([1, 2, 3]) # It's not modified!
In [10]: Out[5] # IPython kept our series alive since we displayed it!
While there are ways to mitigate this2, it may be helpful to let the user know that an operation that they performed was not inplace, since it is possible to go out of memory because of this.
self) also when using inplace=True?One of the downsides of the inplace=True option is that the return type of those methods
depends on the value of inplace, and that method chaining does not work.
Those downsides are still relevant for the cases where we keep inplace=True.
To address this, we can have those methods return the object that was operated on
inplace when inplace=True.
Advantages:
Disadvantages:
df2 = df.method(inplace=True); assert df2 is df)inplace=TrueWe generally assume that changing to return self should not give much problems for
existing usage (typically, the current return value of None is not actively used).
Further, we think the advantages of simplifying return types and enabling methods chains
outweighs the special case of returning an identical object.
Therefore, we propose that for those methods with an inplace=True option, the calling object (self) gets returned.
Removing the inplace keyword is a breaking change, but since the affected behaviour is inplace=True, the default
behaviour when not specifying the keyword (i.e. inplace=False) will not change and the keyword itself can first be
deprecated before it is removed.
inplace keyword altogetherIn the past, it was considered to remove the inplace keyword entirely. This was because many methods with
the inplace keyword did not actually operate inplace, but made a copy and re-assigned the underlying values under
the hood, causing confusion and providing no real benefit to users.
Because a majority of the methods supporting inplace did not operate inplace, it was considered at the time to
deprecate and remove inplace from all methods, and add back the keyword as necessary.3
For methods where the operation actually can be done inplace (group 2), however, removing the inplace
keyword could give a significant performance regression when currently using this keyword with large
DataFrames. Therefore, we decided to keep the inplace keyword for this small subset of methods.
copy keyword instead of inplaceIt may seem more natural to standardize on the copy keyword instead of the inplace keyword, since the copy
keyword already returns a new object instead of None (enabling method chaining) and avoids a copy when it is set to False.
However, the copy keyword is not supported in any of the values-mutating methods listed in Group 2 above
unlike inplace, so semantics of future inplace mutation of values align better with the current behavior of
the inplace keyword, than with the current behavior of the copy keyword.
Furthermore, with the approved Copy-on-Write proposal, the copy keyword also has become superfluous. With Copy-on-Write
enabled, methods that return a new pandas object will always try to avoid a copy whenever possible, regardless of
a copy=False keyword. Thus, the Copy-on-Write PDEP proposes to actually remove the copy keyword from the methods
where it is currently used (so it would be strange to add this as a new keyword to the Group 2 methods).
Currently, when using copy=False in methods where it is supported, a new pandas object is returned as the result
of a method call (same as with copy=True), but with the values backing this object being shared with the calling
object when possible (but the calling object is never modified). With the proposed inplace behavior for Group 2 methods,
a potential copy=False option would return a new pandas object with identical values as the original object (that
was modified inplace, in contrast to current usage of copy=False), which may be confusing for users, and lead to
ambiguity with Copy on Write rules.
The future of the inplace keyword is something that has been debated a lot over the years.
It may be helpful to review those discussions (see links) 4 3 5 to better understand this PDEP.
The inplace keyword is widely used, and thus we need to take considerable time to
deprecate and remove this feature.
inplace keyword will be removed, we add a
DeprecationWarning in the first release after acceptance (2.2 if possible, otherwise
3.0)inplace keyword with the new behaviour
(returning self, working inplace when possible)When introducing the warning in 2.2 (or 3.0), users will already have the ability to
enable Copy-on-Write so they can rewrite their code in a way that avoids the deprecation
warning (remove the usage of inplace) while keeping the no-copy behaviour (which will
be the default with Copy-on-Write).