web/pandas/pdeps/0005-no-default-index-mode.md
[TOC]
The suggestion is to add a NoRowIndex class. Internally, it would act a bit like
a RangeIndex, but some methods would be stricter. This would be one
step towards enabling users who do not want to think about indices to not need to.
The Index can be a source of confusion and frustration for pandas users. For example, let's consider the inputs
In[37]: ser1 = pd.Series([10, 15, 20, 25], index=[1, 2, 3, 5])
In[38]: ser2 = pd.Series([10, 15, 20, 25], index=[1, 2, 3, 4])
Then:
it can be unexpected that adding Series with the same length (but different indices) produces NaNs in the result (https://stackoverflow.com/q/66094702/4451315):
In [41]: ser1 + ser2
Out[41]:
1 20.0
2 30.0
3 40.0
4 NaN
5 NaN
dtype: float64
concatenation, even with ignore_index=True, still aligns on the index (https://github.com/pandas-dev/pandas/issues/25349):
In [42]: pd.concat([ser1, ser2], axis=1, ignore_index=True)
Out[42]:
0 1
1 10.0 10.0
2 15.0 15.0
3 20.0 20.0
5 25.0 NaN
4 NaN 25.0
it can be frustrating to have to repeatedly call .reset_index() (https://twitter.com/chowthedog/status/1559946277315641345):
In [3]: ser1.reset_index(drop=True) + ser2.reset_index(drop=True)
Out[3]:
0 20
1 30
2 40
3 50
dtype: int64
If a user did not want to think about row labels (which they may have ended up after slicing / concatenating operations),
then NoRowIndex would enable the above to work in a more intuitive
manner (details and examples to follow below).
This proposal deals exclusively with the NoRowIndex class. To allow users to fully "opt-out" of having to think
about row labels, the following could also be useful:
pd.set_option('mode.no_row_index', True) mode which would default to creating new DataFrames and
Series with NoRowIndex instead of RangeIndex;as_index options to methods which currently create an index
(e.g. value_counts, .sum(), .pivot_table) to just insert a new column instead of creating an
Index.However, neither of the above will be discussed here.
The core pandas code would change as little as possible. The additional complexity should be handled
within the NoRowIndex object. It would act just like RangeIndex, but would be a bit stricter
in some cases:
name could only be None;start could only be 0, step 1;NoRowIndex to another NoRowIndex, the result would still be NoRowIndex.
Appending a NoRowIndex to any other index (or vice-versa) would raise;NoRowIndex class would be preserved under slicing;NoRowIndex could only be aligned with another Index if it's also NoRowIndex and if it's of the same length;DataFrame columns cannot be NoRowIndex (so transpose would need some adjustments when called on a NoRowIndex DataFrame);insert and delete should raise. As a consequence, if df is a DataFrame with a
NoRowIndex, then df.drop with axis=0 would always raise;NoRowIndex(3) + 2) would always raise;DataFrame/Series with a NoRowIndex, then the row labels would not be printed;MultiIndex could not be created with a NoRowIndex as one of its levels.Let's go into more detail for some of these. In the examples that follow, the NoRowIndex will be passed explicitly,
but this is not how users would be expected to use it (see "Usage and Impact" section for details).
If one has two DataFrames with NoRowIndex, then one would expect that concatenating them would
result in a DataFrame which still has NoRowIndex. To do this, the following rule could be introduced:
If appending a
NoRowIndexof lengthyto aNoRowIndexof lengthx, the result will be aNoRowIndexof lengthx + y.
Example:
In [6]: df1 = pd.DataFrame({'a': [1, 2], 'b': [4, 5]}, index=NoRowIndex(2))
In [7]: df2 = pd.DataFrame({'a': [4], 'b': [0]}, index=NoRowIndex(1))
In [8]: df1
Out[8]:
a b
1 4
2 5
In [9]: df2
Out[9]:
a b
4 0
In [10]: pd.concat([df1, df2])
Out[10]:
a b
1 4
2 5
4 0
In [11]: pd.concat([df1, df2]).index
Out[11]: NoRowIndex(len=3)
Appending anything other than another NoRowIndex would raise.
NoRowIndexIf one has a DataFrame with NoRowIndex, then one would expect that a slice of it would still have
a NoRowIndex. This could be accomplished with:
If a slice of length
xis taken from aNoRowIndexof lengthy, then one gets aNoRowIndexof lengthx. Label-based slicing would not be allowed.
Example:
In [12]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}, index=NoRowIndex(3))
In [13]: df.loc[df['a']>1, 'b']
Out[13]:
5
6
Name: b, dtype: int64
In [14]: df.loc[df['a']>1, 'b'].index
Out[14]: NoRowIndex(len=2)
Slicing by label, however, would be disallowed:
In [15]: df.loc[0, 'b']
---------------------------------------------------------------------------
IndexError: Cannot use label-based indexing on NoRowIndex!
Note too that:
.loc, such as boolean masks, would still be allowed (see F.A.Q);.iloc and .iat would keep working as before;.at would raise.NoRowIndexsTo minimise surprises, the rule would be:
A
NoRowIndexcan only be aligned with anotherNoRowIndexof the same length. Attempting to align it with anything else would raise.
Example:
In [1]: ser1 = pd.Series([1, 2, 3], index=NoRowIndex(3))
In [2]: ser2 = pd.Series([4, 5, 6], index=NoRowIndex(3))
In [3]: ser1 + ser2 # works!
Out[3]:
5
7
9
dtype: int64
In [4]: ser1 + ser2.iloc[1:] # errors!
---------------------------------------------------------------------------
TypeError: Cannot join NoRowIndex of different lengths
This proposal deals exclusively with allowing users to not need to think about row labels. There's no suggestion to remove the column labels.
In particular, calling transpose on a NoRowIndex DataFrame
would error. The error would come with a helpful error message, informing
users that they should first set an index. E.g.:
In [4]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}, index=NoRowIndex(3))
In [5]: df.transpose()
---------------------------------------------------------------------------
ValueError: Columns cannot be NoRowIndex.
If you got here via `transpose` or an `axis=1` operation, then you should first set an index, e.g.: `df.pipe(lambda _df: _df.set_axis(pd.RangeIndex(len(_df))))`
When printing an object with a NoRowIndex, then the row labels would not be shown:
In [15]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}, index=NoRowIndex(3))
In [16]: df
Out[16]:
a b
1 4
2 5
3 6
Of the above changes, this may be the only one that would need implementing within
DataFrameFormatter / SerieFormatter, as opposed to within NoRowIndex.
Users would not be expected to work with the NoRowIndex class itself directly.
Usage would probably involve a mode which would change how the default_index
function to return a NoRowIndex rather than a RangeIndex.
Then, if a mode.no_row_index option was introduced and a user opted in to it with
pd.set_option("mode.no_row_index", True)
then the following would all create a DataFrame with a NoRowIndex (as they
all call default_index):
df.reset_index();pd.concat([df1, df2], ignore_index=True)df1.merge(df2, on=col);df = pd.DataFrame({'col_1': [1, 2, 3]})Further discussion of such a mode is out-of-scope for this proposal. A NoRowIndex would
just be a first step towards getting there.
Draft pull request showing proof of concept: https://github.com/pandas-dev/pandas/pull/49693.
Note that implementation details could well change even if this PDEP were
accepted. For example, NoRowIndex would not necessarily need to subclass
RangeIndex, and it would not necessarily need to be accessible to the user
(df.index could well return None)
Q: Could not users just use RangeIndex? Why do we need a new class?
A: RangeIndex is not preserved under slicing and appending, e.g.:
In[1]: ser = pd.Series([1, 2, 3])
In[2]: ser[ser != 2].index
Out[2]: Int64Index([0, 2], dtype="int64")
If someone does not want to think about row labels and starts off
with a RangeIndex, they'll very quickly lose it.
Q: Are indices not really powerful?
A: Yes! And they're also confusing to many users, even experienced developers.
Often users are using .reset_index to avoid issues with indices and alignment.
Such users would benefit from being able to not think about indices
and alignment. Indices would be here to stay, and NoRowIndex would not be the
default.
Q: How could one switch a NoRowIndex DataFrame back to one with an index?
A: The simplest way would probably be:
df.set_axis(pd.RangeIndex(len(df)))
There's probably no need to introduce a new method for this.
Conversely, to get rid of the index, then if the mode.no_row_index option was introduced, then
one could simply do df.reset_index(drop=True).
Q: How would tz_localize and other methods which operate on the index work on a NoRowIndex DataFrame?
A: Same way they work on other NumericIndexs, which would typically be to raise:
In [2]: ser.tz_localize('UTC')
---------------------------------------------------------------------------
TypeError: index is not a valid DatetimeIndex or PeriodIndex
Q: Why not let transpose switch NoRowIndex to RangeIndex under the hood before swapping index and columns?
A: This is the kind of magic that can lead to surprising behaviour that's
difficult to debug. For example, df.transpose().transpose() would not
round-trip. It's easy enough to set an index after all, better to "force" users
to be intentional about what they want and end up with fewer surprises later
on.
Q: What would df.sum(), and other methods which introduce an index, return?
A: Such methods would still set an index and would work the same way they
do now. There may be some way to change that (e.g. introducing as_index
arguments and introducing a mode to set its default) but that's out of scope
for this particular PDEP.
Q: How would a user opt-in to a NoRowIndex DataFrame?
A: This PDEP would only allow it via the constructor, passing
index=NoRowIndex(len(df)). A mode could be introduced to toggle
making that the default, but would be out-of-scope for the current PDEP.
Q: Would .loc stop working?
A: No. It would only raise if used for label-based selection. Other uses
of .loc, such as df.loc[:, col_1] or df.loc[boolean_mask, col_1], would
continue working.
Q: What's unintuitive about Series aligning indices when summing?
A: Not sure, but I once asked a group of experienced developers what the output of
ser1 = pd.Series([1, 1, 1], index=[1, 2, 3])
ser2 = pd.Series([1, 1, 1], index=[3, 4, 5])
print(ser1 + ser2)
would be, and nobody got it right.
After some discussions, it has become clear there is not enough for support for the proposal in its current state. In short, it would add too much complexity to justify the potential benefits. It would unacceptably increase the maintenance burden, the testing requirements, and the benefits would be minimal.
Concretely:
NoRowIndex class itself, some
extra logic would need to go into the pandas core codebase, which is already very complex and hard to maintain;df.sum() returns a Series with the
column names in the index.In order to make no-index the pandas default and have a chance of benefiting users, a more comprehensive set of changes would need to made at the same time. This would require a proposal much larger in scope, and would be a much more radical change. It may be that this proposal will be revisited in the future, but in its current state (as an option) it cannot be accepted.
This has still been a useful exercise, though, as it has resulted in two related proposals (see below).
.value_counts behaviour change: https://github.com/pandas-dev/pandas/issues/49497mode to a separate proposal)