This notebook was prepared by Donne Martin. Source and license info is on GitHub.

Pandas

Credits: The following are notes taken while working through Python for Data Analysis by Wes McKinney

Series
DataFrame
Reindexing
Dropping Entries
Indexing, Selecting, Filtering
Arithmetic and Data Alignment
Function Application and Mapping
Sorting and Ranking
Axis Indices with Duplicate Values
Summarizing and Computing Descriptive Statistics
Cleaning Data (Under Construction)
Input and Output (Under Construction)

python

from pandas import Series, DataFrame
import pandas as pd
import numpy as np

Series

A Series is a one-dimensional array-like object containing an array of data and an associated array of data labels. The data can be any NumPy data type and the labels are the Series' index.

Create a Series:

python

ser_1 = Series([1, 1, 2, -3, -5, 8, 13])
ser_1

Get the array representation of a Series:

python

ser_1.values

Index objects are immutable and hold the axis labels and metadata such as names and axis names.

Get the index of the Series:

python

ser_1.index

Create a Series with a custom index:

python

ser_2 = Series([1, 1, 2, -3, -5], index=['a', 'b', 'c', 'd', 'e'])
ser_2

Get a value from a Series:

python

ser_2[4] == ser_2['e']

Get a set of values from a Series by passing in a list:

python

ser_2[['c', 'a', 'b']]

Get values great than 0:

python

ser_2[ser_2 > 0]

Scalar multiply:

python

ser_2 * 2

Apply a numpy math function:

python

import numpy as np
np.exp(ser_2)

A Series is like a fixed-length, ordered dict.

Create a series by passing in a dict:

python

dict_1 = {'foo' : 100, 'bar' : 200, 'baz' : 300}
ser_3 = Series(dict_1)
ser_3

Re-order a Series by passing in an index (indices not found are NaN):

python

index = ['foo', 'bar', 'baz', 'qux']
ser_4 = Series(dict_1, index=index)
ser_4

Check for NaN with the pandas method:

python

pd.isnull(ser_4)

Check for NaN with the Series method:

python

ser_4.isnull()

Series automatically aligns differently indexed data in arithmetic operations:

python

ser_3 + ser_4

Name a Series:

python

ser_4.name = 'foobarbazqux'

Name a Series index:

python

ser_4.index.name = 'label'

python

ser_4

Rename a Series' index in place:

python

ser_4.index = ['fo', 'br', 'bz', 'qx']
ser_4

DataFrame

A DataFrame is a tabular data structure containing an ordered collection of columns. Each column can have a different type. DataFrames have both row and column indices and is analogous to a dict of Series. Row and column operations are treated roughly symmetrically. Columns returned when indexing a DataFrame are views of the underlying data, not a copy. To obtain a copy, use the Series' copy method.

Create a DataFrame:

python

data_1 = {'state' : ['VA', 'VA', 'VA', 'MD', 'MD'],
          'year' : [2012, 2013, 2014, 2014, 2015],
          'pop' : [5.0, 5.1, 5.2, 4.0, 4.1]}
df_1 = DataFrame(data_1)
df_1

Create a DataFrame specifying a sequence of columns:

python

df_2 = DataFrame(data_1, columns=['year', 'state', 'pop'])
df_2

Like Series, columns that are not present in the data are NaN:

python

df_3 = DataFrame(data_1, columns=['year', 'state', 'pop', 'unempl'])
df_3

Retrieve a column by key, returning a Series:

python

df_3['state']

Retrive a column by attribute, returning a Series:

python

df_3.year

Retrieve a row by position:

python

df_3.ix[0]

Update a column by assignment:

python

df_3['unempl'] = np.arange(5)
df_3

Assign a Series to a column (note if assigning a list or array, the length must match the DataFrame, unlike a Series):

python

unempl = Series([6.0, 6.0, 6.1], index=[2, 3, 4])
df_3['unempl'] = unempl
df_3

Assign a new column that doesn't exist to create a new column:

python

df_3['state_dup'] = df_3['state']
df_3

Delete a column:

python

del df_3['state_dup']
df_3

Create a DataFrame from a nested dict of dicts (the keys in the inner dicts are unioned and sorted to form the index in the result, unless an explicit index is specified):

python

pop = {'VA' : {2013 : 5.1, 2014 : 5.2},
       'MD' : {2014 : 4.0, 2015 : 4.1}}
df_4 = DataFrame(pop)
df_4

Transpose the DataFrame:

python

df_4.T

Create a DataFrame from a dict of Series:

python

data_2 = {'VA' : df_4['VA'][1:],
          'MD' : df_4['MD'][2:]}
df_5 = DataFrame(data_2)
df_5

Set the DataFrame index name:

python

df_5.index.name = 'year'
df_5

Set the DataFrame columns name:

python

df_5.columns.name = 'state'
df_5

Return the data contained in a DataFrame as a 2D ndarray:

python

df_5.values

If the columns are different dtypes, the 2D ndarray's dtype will accomodate all of the columns:

python

df_3.values

Reindexing

Create a new object with the data conformed to a new index. Any missing values are set to NaN.

python

df_3

Reindexing rows returns a new frame with the specified index:

python

df_3.reindex(list(reversed(range(0, 6))))

Missing values can be set to something other than NaN:

python

df_3.reindex(range(6, 0), fill_value=0)

Interpolate ordered data like a time series:

python

ser_5 = Series(['foo', 'bar', 'baz'], index=[0, 2, 4])

python

ser_5.reindex(range(5), method='ffill')

python

ser_5.reindex(range(5), method='bfill')

Reindex columns:

python

df_3.reindex(columns=['state', 'pop', 'unempl', 'year'])

Reindex rows and columns while filling rows:

python

df_3.reindex(index=list(reversed(range(0, 6))),
             fill_value=0,
             columns=['state', 'pop', 'unempl', 'year'])

Reindex using ix:

python

df_6 = df_3.ix[range(0, 7), ['state', 'pop', 'unempl', 'year']]
df_6

Dropping Entries

Drop rows from a Series or DataFrame:

python

df_7 = df_6.drop([0, 1])
df_7

Drop columns from a DataFrame:

python

df_7 = df_7.drop('unempl', axis=1)
df_7

Indexing, Selecting, Filtering

Series indexing is similar to NumPy array indexing with the added bonus of being able to use the Series' index values.

python

ser_2

Select a value from a Series:

python

ser_2[0] == ser_2['a']

Select a slice from a Series:

python

ser_2[1:4]

Select specific values from a Series:

python

ser_2[['b', 'c', 'd']]

Select from a Series based on a filter:

python

ser_2[ser_2 > 0]

Select a slice from a Series with labels (note the end point is inclusive):

python

ser_2['a':'b']

Assign to a Series slice (note the end point is inclusive):

python

ser_2['a':'b'] = 0
ser_2

Pandas supports indexing into a DataFrame.

python

df_6

Select specified columns from a DataFrame:

python

df_6[['pop', 'unempl']]

Select a slice from a DataFrame:

python

df_6[:2]

Select from a DataFrame based on a filter:

python

df_6[df_6['pop'] > 5]

Perform a scalar comparison on a DataFrame:

python

df_6 > 5

Perform a scalar comparison on a DataFrame, retain the values that pass the filter:

python

df_6[df_6 > 5]

Select a slice of rows from a DataFrame (note the end point is inclusive):

python

df_6.ix[2:3]

Select a slice of rows from a specific column of a DataFrame:

python

df_6.ix[0:2, 'pop']

Select rows based on an arithmetic operation on a specific row:

python

df_6.ix[df_6.unempl > 5.0]

Arithmetic and Data Alignment

Adding Series objects results in the union of index pairs if the pairs are not the same, resulting in NaN for indices that do not overlap:

python

np.random.seed(0)
ser_6 = Series(np.random.randn(5),
               index=['a', 'b', 'c', 'd', 'e'])
ser_6

python

np.random.seed(1)
ser_7 = Series(np.random.randn(5),
               index=['a', 'c', 'e', 'f', 'g'])
ser_7

python

ser_6 + ser_7

Set a fill value instead of NaN for indices that do not overlap:

python

ser_6.add(ser_7, fill_value=0)

Adding DataFrame objects results in the union of index pairs for rows and columns if the pairs are not the same, resulting in NaN for indices that do not overlap:

python

np.random.seed(0)
df_8 = DataFrame(np.random.rand(9).reshape((3, 3)),
                 columns=['a', 'b', 'c'])
df_8

python

np.random.seed(1)
df_9 = DataFrame(np.random.rand(9).reshape((3, 3)),
                 columns=['b', 'c', 'd'])
df_9

python

df_8 + df_9

Set a fill value instead of NaN for indices that do not overlap:

python

df_10 = df_8.add(df_9, fill_value=0)
df_10

Like NumPy, pandas supports arithmetic operations between DataFrames and Series.

Match the index of the Series on the DataFrame's columns, broadcasting down the rows:

python

ser_8 = df_10.ix[0]
df_11 = df_10 - ser_8
df_11

Match the index of the Series on the DataFrame's columns, broadcasting down the rows and union the indices that do not match:

python

ser_9 = Series(range(3), index=['a', 'd', 'e'])
ser_9

python

df_11 - ser_9

Broadcast over the columns and match the rows (axis=0) by using an arithmetic method:

python

df_10

python

ser_10 = Series([100, 200, 300])
ser_10

python

df_10.sub(ser_10, axis=0)

Function Application and Mapping

NumPy ufuncs (element-wise array methods) operate on pandas objects:

python

df_11 = np.abs(df_11)
df_11

Apply a function on 1D arrays to each column:

python

func_1 = lambda x: x.max() - x.min()
df_11.apply(func_1)

Apply a function on 1D arrays to each row:

python

df_11.apply(func_1, axis=1)

Apply a function and return a DataFrame:

python

func_2 = lambda x: Series([x.min(), x.max()], index=['min', 'max'])
df_11.apply(func_2)

Apply an element-wise Python function to a DataFrame:

python

func_3 = lambda x: '%.2f' %x
df_11.applymap(func_3)

Apply an element-wise Python function to a Series:

python

df_11['a'].map(func_3)

Sorting and Ranking

python

ser_4

Sort a Series by its index:

python

ser_4.sort_index()

Sort a Series by its values:

python

ser_4.sort_values()

python

df_12 = DataFrame(np.arange(12).reshape((3, 4)),
                  index=['three', 'one', 'two'],
                  columns=['c', 'a', 'b', 'd'])
df_12

Sort a DataFrame by its index:

python

df_12.sort_index()

Sort a DataFrame by columns in descending order:

python

df_12.sort_index(axis=1, ascending=False)

Sort a DataFrame's values by column:

python

df_12.sort_values(by=['d', 'c'])

Ranking is similar to numpy.argsort except that ties are broken by assigning each group the mean rank:

python

ser_11 = Series([7, -5, 7, 4, 2, 0, 4, 7])
ser_11 = ser_11.sort_values()
ser_11

python

ser_11.rank()

Rank a Series according to when they appear in the data:

python

ser_11.rank(method='first')

Rank a Series in descending order, using the maximum rank for the group:

python

ser_11.rank(ascending=False, method='max')

DataFrames can rank over rows or columns.

python

df_13 = DataFrame({'foo' : [7, -5, 7, 4, 2, 0, 4, 7],
                   'bar' : [-5, 4, 2, 0, 4, 7, 7, 8],
                   'baz' : [-1, 2, 3, 0, 5, 9, 9, 5]})
df_13

Rank a DataFrame over rows:

python

df_13.rank()

Rank a DataFrame over columns:

python

df_13.rank(axis=1)

Axis Indexes with Duplicate Values

Labels do not have to be unique in Pandas:

python

ser_12 = Series(range(5), index=['foo', 'foo', 'bar', 'bar', 'baz'])
ser_12

python

ser_12.index.is_unique

Select Series elements:

python

ser_12['foo']

Select DataFrame elements:

python

df_14 = DataFrame(np.random.randn(5, 4),
                  index=['foo', 'foo', 'bar', 'bar', 'baz'])
df_14

python

df_14.ix['bar']

Summarizing and Computing Descriptive Statistics

Unlike NumPy arrays, Pandas descriptive statistics automatically exclude missing data. NaN values are excluded unless the entire row or column is NA.

python

df_6

python

df_6.sum()

Sum over the rows:

python

df_6.sum(axis=1)

Account for NaNs:

python

df_6.sum(axis=1, skipna=False)

Cleaning Data (Under Construction)

Replace
Drop
Concatenate

python

from pandas import Series, DataFrame
import pandas as pd

Setup a DataFrame:

python

data_1 = {'state' : ['VA', 'VA', 'VA', 'MD', 'MD'],
          'year' : [2012, 2013, 2014, 2014, 2015],
          'population' : [5.0, 5.1, 5.2, 4.0, 4.1]}
df_1 = DataFrame(data_1)
df_1

Replace

Replace all occurrences of a string with another string, in place (no copy):

python

df_1.replace('VA', 'VIRGINIA', inplace=True)
df_1

In a specified column, replace all occurrences of a string with another string, in place (no copy):

python

df_1.replace({'state' : { 'MD' : 'MARYLAND' }}, inplace=True)
df_1

Drop

Drop the 'population' column and return a copy of the DataFrame:

python

df_2 = df_1.drop('population', axis=1)
df_2

Concatenate

Concatenate two DataFrames:

python

data_2 = {'state' : ['NY', 'NY', 'NY', 'FL', 'FL'],
          'year' : [2012, 2013, 2014, 2014, 2015],
          'population' : [6.0, 6.1, 6.2, 3.0, 3.1]}
df_3 = DataFrame(data_2)
df_3

python

df_4 = pd.concat([df_1, df_3])
df_4

Input and Output (Under Construction)

Reading
Writing

python

from pandas import Series, DataFrame
import pandas as pd

Reading

Read data from a CSV file into a DataFrame (use sep='\t' for TSV):

python

df_1 = pd.read_csv("../data/ozone.csv")

Get a summary of the DataFrame:

python

df_1.describe()

List the first five rows of the DataFrame:

python

df_1.head()

Writing

Create a copy of the CSV file, encoded in UTF-8 and hiding the index and header labels:

python

df_1.to_csv('../data/ozone_copy.csv', 
            encoding='utf-8', 
            index=False, 
            header=False)

View the data directory:

python

!ls -l ../data/