Introduction¶

Large portion of Data Analysis is taken up by preparation: loading, cleaning, transforming and rearranging. These tasks take up more than 80% of an analyst's time. This is because the way the data is stored in files or databases is not in the right format. Researchers prefer to do ad-hoc processing of data from one form to another using languages like R, Python, etc. pandas with built-in Python features provides high-level, flexible abd fast set of tools that enables you to manipulate data into right form.

Handling missing data¶

The way missing data is represented in pandas is imperfect but functional for lot of users. For numeric data, pandas uses floating-point value NaN. It is called a Sentinel value and can be easily detected.

import pandas as pd
import numpy as np

string_data = pd.Series(['aardvak', 'artichoke', np.nan, 'avocado'])
string_data

0      aardvak
1    artichoke
2          NaN
3      avocado
dtype: object

string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

This convention of NaN was adopted from R, where missing values are refered as NA. In statistics, NA may either be data that does not exist or data that was not observed. During cleaning data, we should also analyse the missing data to identify data collection problems or potential bias due to missing data.

string_data[0] = None

string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

Filtering 'Out' Missing Data¶

We always have the option to filter out missing data by hand using 'isnull' and boolean indexing. The 'dropna' function can be pretty useful too. For a Series it returns the Series with only non-null data and index values. For DataFrame, it is a bit complex. dropna by default will drop any row that contains even 1 missing value. By passing "how='all'" will target rows with all NAs. To drop columns, pass 'axis=1'.

from numpy import nan as NA

data = pd.Series([1, NA, 3.5, NA, 7])
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
                   [NA, NA, NA], [NA, 6.5, 3.]])

cleaned = data.dropna()
data

cleaned

data.dropna(how='all')

data[4] = NA
data

data.dropna(axis=1, how='all')

Another DataFrame cleaning method concerns with time series data. To keep only rows with certain number of observations, use the 'thresh' argument. Any row that contains number of NAs greater than or equal thresh will be eliminated.

df = pd.DataFrame(np.random.randn(7,3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA

df

df.dropna()

df.dropna(thresh=2)

Filling In Missing Data¶

Rather than removing NAs and discarding important information in the same rows, we can also fill in the NAs in different ways. The 'fillna' is a workhorse function, where the constant we pass replaces missing values. If we call fillna with a dict, we can fill different value for each column.

df.fillna(0)

df.fillna({1:0.5, 2: 0})

By default it returns a new object, but we can modify it to change in-place. The interpolation methods used for reindexing like 'ffill' can also be used with fillna. It allows you to do lots of creative things, like filling with mean or median values.

_ = df.fillna(0, inplace=True)

df

df = pd.DataFrame(np.random.randn(6,3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df

df.fillna(method='ffill')

df.fillna(method='ffill', limit=2)

data = pd.Series([1., NA, 3.5, NA, 7])

data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

Data Transformation¶

Till now we have seen methods for rearranging data. Transformation involves filtering, cleaning and other different functions.

Removing Duplicates¶

The DataFrame method 'duplicated' returns boolean Series indicating if each row is a duplicate (i.e. observed in a previous row) or not. Similarly, 'drop_duplicates' returns DataFrame where 'duplicated' array is False.

data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                    'k2': [1,1,2,3,3,4,4]})
data

data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

data.drop_duplicates()

Both the above methods by default consider all of the columns. You can also specify any subset of the DataFrame to detect duplicates. By default, both keep the first observation in case of duplicates. We can specify "keep='last'" to instead keep the last observation.

data['v1'] = range(7)
data.drop_duplicates(['k1'])

data.drop_duplicates(['k1', 'k2'], keep='last')

Transforming Data Using a Function or Mapping¶

We sometimes need to make transformations based on the values present in an array, Series or column in a DataFrame. We can use the map method with a function or dict-like object having the mapping to add or change a column. Sometimes the column that we base our mapping on may have varying case from our map. In such a case, we can convert all the values to lowercase. Or just pass a function that does it for us.

data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                             'Pastrami', 'corned beef', 'Bacon',
                            'pastrami', 'honey ham', 'nova lox'],
                    'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

meat_to_animal = {
    'bacon': 'pig',
    'pulled pork': 'pig',
    'pastrami': 'cow',
    'corned beef': 'cow',
    'honey ham': 'pig',
    'nova lox': 'salmon'
}

lowercased = data['food'].str.lower()
lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

data['animal'] = lowercased.map(meat_to_animal)
data

data['food'].map(lambda x: meat_to_animal[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

Replacing Values¶

The fillna method is a special case of more general values replacement. The map function modifies a subset of values but, 'replace' provides simpler and more flexible way to do so. Passing the sentinel (or garbage) value followed by the replcae value will create a new object with the values replaced. If we want in-place replacement, use "inplace=True".

data = pd.Series([1., -999., 2., -999., 3.])
data

0      1.0
1   -999.0
2      2.0
3   -999.0
4      3.0
dtype: float64

data.replace(-999, np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    3.0
dtype: float64

To replace multiple values with a single value, pass a list followed by substitute value. To have different replacements for different values, pass list of substitutes. You can also pass a dict as argument to replace multiple substitutes. NOTE - 'data.replace' is different from 'data.str.replace'. The latter is for element-wise string substitution.

data.replace([-999, -1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    3.0
dtype: float64

data.replace([-999, -1000], [np.nan, 0])

0    1.0
1    NaN
2    2.0
3    NaN
4    3.0
dtype: float64

data.replace({-999: np.nan, -1000: 0})

0    1.0
1    NaN
2    2.0
3    NaN
4    3.0
dtype: float64

Renaming Axis Indexes¶

Just like values, axis labels can also be transformed by a function or mapping to produce differntly labeled objects. We can also modify axes in-place without any new data structure.

data = pd.DataFrame(np.arange(12).reshape((3,4)),
                   index = ['Ohio', 'Colorado', 'New York'],
                   columns = ['one', 'two', 'three', 'four'])

transform = lambda x: x[:4].upper()

data.index.map(transform)

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

data.index = data.index.map(transform)
data

To get a transformed version of a dataset without modifying the original, use 'rename'. It can also be used in conjunction with a dict-like object providing new values for subset of the axis labels. It saves you from copying DataFrame manuallyand then assigning it index and columns. To modify in-place, use parameter 'inplace=True'.

data.rename(index=str.title, columns=str.upper)

data.rename(index={'OHIO':"INDIANA"},
           columns = {'three':'peekaboo'})

data.rename(index={'OHIO': 'INDIANA'}, inplace=True)
data

Discretization and Binning¶

Continuous data is always discretized or seperated into 'bins' for analysis. To bin a set of continuous data, use the 'cut' method from pandas. In below example, we are binnning set of gaes into groups 18 to 25, 26 to 35, 36 to 60 and 61 nd older.

ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]

cats = pd.cut(ages, bins)
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

pandas returns a special Categorical object from cuts function. The output describes the bins that each of the element is in. You can treat it like a bin name for each element. Internally, the output contains a categories array specifying distinct category names along with a labeling for the 'ages' data in the 'codes' attribute.

cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]]
              closed='right',
              dtype='interval[int64]')

pd.value_counts(cats)

(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

The interval system for cut is consistent with the mathematical notation. A parenthesis means that the side is open and a square bracket means that it is closed (inclusive). We can changes which side is closed by passing 'right=False'. We can have our own bin names by passing a list or array to the labels option.

pd.cut(ages, [18, 26, 36, 61, 100], right=False)

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

group_names = ['Youth', 'YoungAdult', 'MiddleAges', 'Senior']
pd.cut(ages, bins, labels=group_names)

[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAges, MiddleAges, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAges < Senior]

Instead of specifying specific intervals, we can just pass an integer to get equal length bins of the same number based on the max and min value of the data. The 'precision' parameter limits decimal precision in the values. 'precision=2' limits decimal precision to 2 digits.

data = np.random.rand(20)

pd.cut(data, 4, precision=2)

[(0.51, 0.74], (0.28, 0.51], (0.51, 0.74], (0.055, 0.28], (0.51, 0.74], ..., (0.28, 0.51], (0.055, 0.28], (0.055, 0.28], (0.28, 0.51], (0.055, 0.28]]
Length: 20
Categories (4, interval[float64]): [(0.055, 0.28] < (0.28, 0.51] < (0.51, 0.74] < (0.74, 0.97]]

cut has a closely related function - 'qcut' that bins data based on sample quantiles. Based on distribution, using cut will not usually result in each bin have the same number of data points. But as qcut uses sample quantiles, you wil rougjly obtain equal-size bins. We can even pass our own quantiles to qcut.

data = np.random.randn(1000)
cats = pd.qcut(data, 4)
cats

[(-3.413, -0.659], (0.0656, 0.778], (0.0656, 0.778], (0.778, 3.118], (-0.659, 0.0656], ..., (0.0656, 0.778], (-0.659, 0.0656], (-0.659, 0.0656], (0.0656, 0.778], (0.778, 3.118]]
Length: 1000
Categories (4, interval[float64]): [(-3.413, -0.659] < (-0.659, 0.0656] < (0.0656, 0.778] < (0.778, 3.118]]

pd.value_counts(cats)

(0.778, 3.118]      250
(0.0656, 0.778]     250
(-0.659, 0.0656]    250
(-3.413, -0.659]    250
dtype: int64

pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])

[(-3.413, -1.172], (0.0656, 1.378], (0.0656, 1.378], (0.0656, 1.378], (-1.172, 0.0656], ..., (0.0656, 1.378], (-1.172, 0.0656], (-1.172, 0.0656], (0.0656, 1.378], (1.378, 3.118]]
Length: 1000
Categories (4, interval[float64]): [(-3.413, -1.172] < (-1.172, 0.0656] < (0.0656, 1.378] < (1.378, 3.118]]

Detecting and Filtering Outliers¶

Filtering and Transforming outliers is mostly a matter of applying array operations. To find values exceeding a threshold, just use boolean indexing with other functions like 'abs()' based on requirement. To get all rows having at least one value exceed a threshold, use the 'any(1)' method. Values can also be set based on these criteria. So you can cap values based on an interval or threshold. You can also use the 'np.sign()' function to get 1 and -1 where the data is positive or negative respectively.

data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()

col = data[2]
col[np.abs(col) > 3]

360    3.099117
559    4.065409
583   -3.039110
976    3.180584
Name: 2, dtype: float64

data[(np.abs(data) > 3).any(1)]

# Capping outside -3 to 3
data[np.abs(data) > 3] = np.sign(data) * 3 
data.describe()

np.sign(data).head()

Permutation and Random Sampling¶

We can easily perform Permutation (randomly reordering) on a Series or rows of a DataFrame using 'numpy.random.permutation'. Calling it with length of the axis you want to permute cerates an array of integers indicating the new ordering. we can then use the same array in an iloc-based indexing or with an equivalent 'take' function.

df = pd.DataFrame(np.arange(5*4).reshape((5,4)))
sampler = np.random.permutation(5)
sampler

array([1, 0, 4, 2, 3])

df

df.take(sampler)

Use the 'sample' method to select random subset without replacement. To generate subset with replacement (i.e. repeat choices) pass 'replace=True'.

df.sample(n=3)

choices = pd.Series([5, 7, -1, 6, 4])
draws = choices.sample(n=10, replace=True)
draws

0    5
1    7
0    5
4    4
4    4
4    4
4    4
3    6
0    5
2   -1
dtype: int64

Computing Indicator / Dummy Variables¶

Another transformation is to convert a categorical variable into a 'dummy' or 'indicator' matrix. If a column has k distinct values, we can derive a matrix or a DataFrame with k columns all containing 1s and 0s. pandas has the 'get_dummies' function to do this. You may want to add a prefix to the columns in the indicator DataFrame. This DataFrame can be merged with other data. get_dummies has the 'prefix' argument to do this.

df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                  'data1': range(6)})
pd.get_dummies(df['key'])

dummies = pd.get_dummies(df['key'], prefix='key')
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy

But if a row belongs to multiple categories, then dummying it becomes complicated. Adding indicator variable to a column like genre involves a bit of data wrangling. First we get all the unique categorical values in the DataFrame.

mnames = ['movie_id', 'title', 'genres']

movies = pd.read_table('datasets/movielens/movies.dat', sep='::',
                      header=None, names=mnames)
movies[:10]

C:\Users\adity\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  after removing the cwd from sys.path.

all_genres = []
for x in movies.genres:
    all_genres.extend(x.split('|'))
    
genres = pd.unique(all_genres)
genres

array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
       'Western'], dtype=object)

One way to get this would be to start with a DataFrame of all zeros. The iterate through each row and set the entry in each row of 'dummies' to 1. For this we use 'dummies.columns' to compute column indices for each category. The we use iloc to set values based on those indices. After that we combine this indicator DataFrame with original DataFrame.

zero_matrix = np.zeros((len(movies), len(genres)))
dummies = pd.DataFrame(zero_matrix, columns=genres)

gen = movies.genres[0]
gen.split('|')

['Animation', "Children's", 'Comedy']

dummies.columns.get_indexer(gen.split('|'))

array([0, 1, 2], dtype=int64)

for i, gen in enumerate(movies.genres):
    indices = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i, indices] = 1

movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0]

movie_id                                       1
title                           Toy Story (1995)
genres               Animation|Children's|Comedy
Genre_Animation                                1
Genre_Children's                               1
Genre_Comedy                                   1
Genre_Adventure                                0
Genre_Fantasy                                  0
Genre_Romance                                  0
Genre_Drama                                    0
Genre_Action                                   0
Genre_Crime                                    0
Genre_Thriller                                 0
Genre_Horror                                   0
Genre_Sci-Fi                                   0
Genre_Documentary                              0
Genre_War                                      0
Genre_Musical                                  0
Genre_Mystery                                  0
Genre_Film-Noir                                0
Genre_Western                                  0
Name: 0, dtype: object

NOTE - For larger data, the above method for creating indicator method is not speedy. It would be better to write a lower-level function that writes directly to NumPy array and then wraps the result in a DataFrame. Another useful method in statistical applications is to combine the dummy function with discretization function like cut. This shows if a value is present in a bin or not. For below example we will use random seed to make the example more deterministic.

np.random.seed(12345)
values = np.random.rand(10)
values

array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503,
       0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])

bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
pd.get_dummies(pd.cut(values, bins))

String Manipulation¶

Python is a popular raw data manipulation language due to its ease of use for string and text processing. Simple text operations can be done using String object's built-in methods. For more complex pattern matching and text manipulations, we can use regular expressions. pandas enable us to apply both string and regex functions on whole arrays of data.

String Object Methods¶

In most string manipulation scenarios, inbuilt string methods are mostly sufficient. A string can be broken based on a seperator using 'split'. It is often combined with 'strip' to trim out whitespace, including line breaks. Substrings can be concatenated together using the '+' operator. A more faster and pythonic way to do so is to pass a list or tuple of substrings to the 'join' method on the 'stitching' string.

val = 'a,b,   guido'

val.split(',')

['a', 'b', '   guido']

pieces = [x.strip() for x in val.split(',')]
pieces

['a', 'b', 'guido']

first, second, third = pieces
first + '::' + second + '::' + third

'a::b::guido'

'::'.join(pieces)

'a::b::guido'

The 'in' keyword is the best way to detect substring. Although, 'index' and 'find' can also be used. There is 1 major difference between 'find' and 'index'. 'index' returns an Exception if substring is not found. 'find' returns a -1. 'count' returns number of occurences of a particular substring. 'replace' substitutes occurence of one pattern for another. It is commonly used to delete patterns by passing an empty string as replacement.

'guido' in val

True

val.index(',')

1

val.find(':')

-1

# val.index(':')

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-86-2c016e7367ac> in <module>()
----> 1 val.index(':')

ValueError: substring not found

val.count(',')

2

val.replace(',', '::')

'a::b::   guido'

val.replace(',', '')

'ab   guido'

Regular Expression¶

It is a flexible way to search or match (often complex) string patterns in text. The expression 'regex' is a string formed by 'regular expression'. Python has built-in 're' module for applying regular expressions to strings. This module works in 3 categories, all are related : 'pattern matching', 'substitution' and 'splitting'. We first describe a pattern that we want to locate in text. This can be used for many purposes.

Once we define a regex, we can use it to split a text using 'split'. In this command, the regex is first compiled and then its split method is called on to the passed text. We can compile regex on our own using 're.compile' which will form a reusable regex object. Compiling regex beforehand is highly recommended if we have to apply it to many strings. This will save CPU cycles. NOTE - To avoid using escape sequences in regex, use raw string literals. eg - r'C:\x' instead of 'c:\x'.

import re

text = "foo bar\t baz   \tqux"
re.split('\s+', text)

['foo', 'bar', 'baz', 'qux']

regex = re.compile('\s+')
regex.split(text)

['foo', 'bar', 'baz', 'qux']

We can use the 'findall' function to list all the matching patterns to the regex. To get only the first match in a string, we can use 'search' method. 'match' only matches at the beginning of the string i.e. it checks if the string startes with the regex. Using 'IGNORECASE' makes the regex case-insensitive.

regex.findall(text)

[' ', '\t ', '   \t']

text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""

pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
regex = re.compile(pattern, flags=re.IGNORECASE)

'findall' returns a list of matching pattern. 'search' returns a special match object for the first matching pattern. It can only tell us the start and end positions of the pattern in a string. 'match' will return None if the pattern does not exist in the string. 'sub' will return a new string with occurences of the pattern replaced by a new pattern.

regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

m = regex.search(text)
m

<_sre.SRE_Match object; span=(5, 20), match='dave@google.com'>

text[m.start():m.end()]

'dave@google.com'

print(regex.match(text))

None

print(regex.sub('REDACTED', text))

Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED

To find the patterns and at the same time segment it into different parts, put parentheses around the parts of the pattern to segment. In this case, 'findall' will return a list of tuples, when the pattern has groups. 'match' will return a tuple of the pattern components with its 'groups' method. 'sub' can also be used to access the groups in each match with special symbols '\1', '\2'. Here '\1' corresponds to first matched group.

pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)

regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

m = regex.match('wes@bright.net')
m.groups()

('wes', 'bright', 'net')

print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))

Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com

Vectorized String Functions in pandas¶

The presence of missing data in columns makes data cleaning in datasets very difficult. We could apply string or regex methods using lambda or other functions to each value using data.map, but this will fail on the NA values. To avoid this Series has array-oriented methods for string operations that skip NA values. They can be accessed using the 'str' attribute.

data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
        'Rob': 'rob@gmail.com', 'Wes': np.nan}
data = pd.Series(data)
data

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

data.isnull()

Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool

We can check if a string contains a pattern using the 'str.contains' function. We can use regex too, with options like 'IGNORECASE'.

data.str.contains('gmail')

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

pattern

'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'

data.str.findall(pattern, flags=re.IGNORECASE)

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object

To retrieve elements from Vector, we can either use 'str.get' or 'index'. To access elements in embedded list, we pass an index to either of these functions. We can also slice strings with this syntax.

matches = data.str.match(pattern, flags=re.IGNORECASE)
matches

Dave     True
Steve    True
Rob      True
Wes       NaN
dtype: object

matches.str.get(1)

Dave    NaN
Steve   NaN
Rob     NaN
Wes     NaN
dtype: float64

matches.str[0]

Dave    NaN
Steve   NaN
Rob     NaN
Wes     NaN
dtype: float64

data.str[:5]

Dave     dave@
Steve    steve
Rob      rob@g
Wes        NaN
dtype: object

Introduction¶

In many applications, data may be spread across number of files or databases, or be arranged in a for that is not easy to analyze. Knowing how to combine, join and rearrange data is an important skill in the Data Analyst toolkit.

Hierarchical Indexing¶

It is an important feature in pandas that enables you to have multiple index levels on an axis. It provides a way to work with higher dimensional data in lower dimensional form. When looking at a Series or DataFrame with multi-index, you will see "gaps" in the higher index, which means "same as the one above".

import pandas as pd
import numpy as np

data = pd.Series(np.random.randn(9),
                index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
                      [1, 2, 3, 1, 3, 1, 2, 2, 3]])

data

a  1    1.407820
   2   -1.461955
   3    1.228871
b  1    0.154511
   3   -0.003190
c  1   -0.027980
   2   -0.095242
d  2   -0.562603
   3   -0.681595
dtype: float64

data.index

MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 2, 0, 1, 1, 2]])

With hierarchical indexed objects, you can perform 'partial indexing', which enables us to concisely select subsets of data. Selection is also possible for "inner" level of indexes.

data['b']

1    0.154511
3   -0.003190
dtype: float64

data['b':'c']

b  1    0.154511
   3   -0.003190
c  1   -0.027980
   2   -0.095242
dtype: float64

data.loc[['b', 'd']]

b  1    0.154511
   3   -0.003190
d  2   -0.562603
   3   -0.681595
dtype: float64

data.loc[:,2]

a   -1.461955
c   -0.095242
d   -0.562603
dtype: float64

Hierarchical indexing has important role in reshaping data and group=based operations. eg - forming pivot table. You could rearrange data into a DataFrame using its 'unstack' method. The inverse operation of stack is 'stack'.

data.unstack()

data.unstack().stack()

a  1    1.407820
   2   -1.461955
   3    1.228871
b  1    0.154511
   3   -0.003190
c  1   -0.027980
   2   -0.095242
d  2   -0.562603
   3   -0.681595
dtype: float64

In a DataFrame, either axis can have hierarchical index. The hierarchical indexes can have names and they will be shown in the console output. NOTE - Be careful not to mix-up index names with row labels. With partial column indexing, we can select groups of columns. A 'MultiIndex' can be created by itself and then reused.

frame = pd.DataFrame(np.arange(12).reshape((4,3)),
                    index=[['a', 'a', 'b', 'b'], [1,2,1,2]],
                    columns = [['Ohio', 'Ohio', 'Colorado'],
                              ['Green', 'Red', 'Green']])
frame

frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame

frame['Ohio']

pd.MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']],
                      names = ['state', 'color'])

MultiIndex(levels=[['Colorado', 'Ohio'], ['Green', 'Red']],
           labels=[[1, 1, 0], [0, 1, 0]],
           names=['state', 'color'])

Reordering and Sorting Levels¶

Sometimes, we may need to rearrange the order of the levels on an axis or sort the data by values in one specific level. We can use 'swaplevel', which takes 2 level numbers or name and returns a new object with the levels interchanged. The data is unaltered. 'sort_index' sorts the data using the values in a single level. We often use sort_index when swapping levels so that result is lexicographically sorted. NOTE - Data selection performance is better on hierarchically indexed objects if index is lexicographically sorted, starting from outermost level.

frame.swaplevel('key1', 'key2')

frame.sort_index(level=1)

frame.swaplevel(0,1).sort_index(level=0)

Summary statistics by Level¶

Many Descriptive and Summary statistics on DataFrame and Series have 'level' option, where you can specify the level you want to aggregate by on an axis. we can aggregate on either rows or columns. Under the hood, it utilizes pandas' 'groupby machinery'.

frame.sum(level='key2')

frame.sum(level='color', axis=1)

Indexing with a DataFrame's columns¶

We may want to use 1 or more columns as Row Index in DataFrame or alternatively, move row index into columns. The 'set_index' method can create a new DataFrame using 1 or more of its columns as index. By default, columns are removed from the DataFrame, but we can also leave them in. 'reset_index' is the opposite of set_index. It moves hierarchical index levels into columns.

frame = pd.DataFrame({'a': range(7), 'b':range(7,0,-1),
                     'c':['one', 'one', 'one', 'two', 'two',
                          'two', 'two'],
                      'd': [0,1,2,0,1,2,3]})
frame

frame2 = frame.set_index(['c', 'd'])
frame2

frame.set_index(['c','d'], drop=False)

frame2.reset_index()

Combining and Merging Datasets¶

Data in pandas can be combined in many ways:

1. 'pandas.merge' connects rows in DataFrames based on 1 or more keys. This is similar to 'join' operations in SQL.
2. 'pandas.concat' concatenates or 'stacks' together objects along an axis.
3. 'combine_first' method enables splicing together overlapping data to fill in missing values in one object from another.

Database-Style DataFrame Joins¶

Merge or join combines datasets by linking rows using 1 or more keys. They are central to relational databasees. The 'merge' function in pandas is the main entry point for using these operations on your dataset. We can have a many-to-one join, where 1 object has multiple rows with same index labels whereas the other object has only 1 row for each label. We get a combination of the rows from both objects.

df1 = pd.DataFrame({'key': ['b','b','a','c','a','a','b'],
                   'data1':range(7)})
df2 = pd.DataFrame({'key':['a','b','d'],
                   'data2': range(3)})
df1

df2

pd.merge(df1, df2)

It is not necessary to specify which column to join on. 'merge' uses overlapping column names as the keys. But it is a good practice to specify explicitly. If column names are different in the 2 objects, you need to specify them in the merge explicitly. By default, merge does an 'inner join' so keys in the result are intersection or common set found in both objects.There are other possible join options- 'left', 'right' and 'outer'. 'Outer join' does a union of the keys, combining the effect of left and right joins.

pd.merge(df1, df2, on='key')

df3 = pd.DataFrame({'lkey': ['b','b','a','c','a','a','b'],
                   'data1':range(7)})
df4 = pd.DataFrame({'rkey':['a','b','d'],
                   'data2': range(3)})

pd.merge(df3, df4, left_on='lkey', right_on='rkey')

pd.merge(df1, df2, how='outer')

Many-to-many merges have well-defined although not vey intuitive behaviour. They form Cartesian product of the rows. eg- 3 rows on left and 2 rows on right for the same label will lead to 6 rows in the result.

df1 = pd.DataFrame({'key': ['b','b','a','c','a', 'b'],
                   'data1':range(6)})
df2 = pd.DataFrame({'key':['a','b','a','b','d'],
                   'data2': range(5)})

df1

df2

pd.merge(df1, df2, on='key', how='left')

pd.merge(df1, df2, how='inner')

To merge on multiple keys, pass a list of columns. To determine the key combinations in the result, think an array of tuples formed by the multiple keys being used as a single join key. NOTE - When joining columns on columns, indexes on passed DataFrame objects are discarded. When merging, we may get overlapping column names. We can address manually by renaming the axis labels. 'merge' has a 'suffixes' option for specifying strings to append to overlapping names in left and right objects.

left = pd.DataFrame({'key1':['foo', 'foo', 'bar'],
                    'key2':['one', 'two', 'one'],
                    'lval':[1, 2, 3]})
right = pd.DataFrame({'key1':['foo', 'foo', 'bar', 'bar'],
                     'key2': ['one', 'one', 'one', 'two'],
                     'rval': [4, 5, 6, 7]})

pd.merge(left, right, on=['key1', 'key2'], how='outer')

pd.merge(left, right, on='key1')

pd.merge(left, right, on='key1', suffixes=('_left', '_right'))

Merging on Index¶

In some scenarios, the index will contain the key(s) along which the objects need to be merged. You can pass 'left_index=True' or 'right_index=True' or both to indicate which index should be used to merge. The default method for merge is to intersect join keys. But we can instead form 'union' with an 'outer_join'.

left1 = pd.DataFrame({'key':['a','b','a','a','b','c'],
                     'value':range(6)})
right1 = pd.DataFrame({'group_val':[3.5, 7]}, index=['a','b'])

left1

right1

pd.merge(left1, right1, left_on='key', right_index=True)

pd.merge(left1, right1, left_on='key', right_index=True, how='outer')

Things are more complicated with hierarchically indexed data, as joining on index is implicitly multi-key merge. You can indicate multiple columns to merge as a list. And handle duplicate data with 'how=outer'. Using the indexes of both sides is also possible.

lefth = pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
                     'key2': [2000, 2001, 2002, 2001, 2002],
                     'data': np.arange(5.)})
righth = pd.DataFrame(np.arange(12).reshape((6,2)),
                     index=[['Nevada', 'Nevada', 'Ohio', 'Ohio',
                            'Ohio', 'Ohio'],
                           [2001, 2000, 2000, 2000, 2001, 2002]],
                     columns=['event1', 'event2'])

lefth

righth

pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True)

pd.merge(lefth, righth, left_on=['key1', 'key2'], 
         right_index=True, how='outer')

left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]],
                    index=['a', 'c', 'e'],
                    columns=['Ohio', 'Nevada'])
right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],
                     index=['b','c','d','e'],
                     columns=['Missouri','Alabama'])

left2

right2

pd.merge(left2, right2, how='outer', left_index=True, right_index=True)

Dataframe has a 'join' method for merging by index. It can also be used to combine together many DataFrame objects that have similar indexes but non-overlapping columns. Mostly for legacy reasons, the join method performs a left join on the keys, preserving the left frame's ro index. It also supports joining the index of passed DataFrame on one of the columns of the calling DataFrame. For index-on-index merges, you can pass a list of DataFrames to join instead of using the 'concat' function.

left2.join(right2, how='outer')

left1.join(right1, on='key')

another = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [16., 17.]],
                      index=['a', 'c', 'e', 'f'],
                      columns=['New York', 'Oregon'])
another

left2.join([right2, another])

left2.join([right2, another], how='outer')

C:\Users\adity\Anaconda3\lib\site-packages\pandas\core\frame.py:6359: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False

  verify_integrity=True)

Concatenating along an Axis¶

Another data concatenation method is interchangeably known as concatenation, binding or stacking. NumPy has 'concatenate' which performs this on NumPy arrays.

arr = np.arange(12).reshape((3,4))
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

np.concatenate([arr, arr], axis=1)

array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])

In pandas' objects, we have labelled axes, which help us in generalizing array concatenation. There are a few things to think about before concatenating pandas objects:

 1. If the 2 objects are indexed differently on the other axis, should we combine the distinct element or use only the shared values along that axis (intersection)?
 2. Do the concatenated chunks need to be identifiable in the resulting object?
 3. Does the 'concatenating axis' have sata that needs to be preserved? In general, integer labels should best be discarded during concatenation.

The 'concat' method provides a consistent way to address the above issues. For 'Series' with no overlap, concat just glues the values and indexes. By default, it works along 'axis=0', but if we pass 'axis=1', teh result will be a DataFrame (axis=1 is for columns).

s1 = pd.Series([0,1], index=['a', 'b'])
s2 = pd.Series([2,3,4], index=['c', 'd', 'e'])
s3 = pd.Series([5,6], index=['f', 'g'])

pd.concat([s1, s2, s3])

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

pd.concat([s1, s2, s3], axis=1)

C:\Users\adity\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False

  """Entry point for launching an IPython kernel.

If there is no overlap on the other axis (like above example), we see a 'sorted union' (or outer join). You can get an intersect by passing "join='inner'". If you want to specify the axes to be used in 'other axes' do it with 'join_axes'.

s4 = pd.concat([s1, s3])
s4

a    0
b    1
f    5
g    6
dtype: int64

pd.concat([s1, s4], axis=1)

C:\Users\adity\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False

  """Entry point for launching an IPython kernel.

pd.concat([s1, s4], axis=1, join='inner')

pd.concat([s1, s4], axis=1, join_axes=[['a', 'c', 'b', 'e']])

Another issue is to identify concatenated pieces in the result. To create a hierarchical index in the concatenated axis, we can use the 'keys' argument. When we use axis=1, the keys become column headers. This is true to both Series and DataFrame. If we pass a dict of objects instead of a list, then dict's keys will be used for the 'keys' option. There are other arguments in hierarchical index creation. eg - To name a created axis just use the 'names' argument.

result = pd.concat([s1, s2, s3], keys=['one', 'two', 'three'])
result

one    a    0
       b    1
two    c    2
       d    3
       e    4
three  f    5
       g    6
dtype: int64

result.unstack()

pd.concat([s1, s2, s3], axis=1, keys=['one', 'two', 'three'])

C:\Users\adity\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False

  """Entry point for launching an IPython kernel.

df1 = pd.DataFrame(np.arange(6).reshape(3,2), index=['a', 'b', 'c'],
                  columns=['one', 'two'])
df2 = pd.DataFrame(5 + np.arange(4).reshape(2,2), index=['a', 'c'],
                  columns=['three', 'four'])

df1

df2

pd.concat([df1, df2], axis=1, keys=['level1', 'level2'])

C:\Users\adity\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False

  """Entry point for launching an IPython kernel.

pd.concat({'level1':df1, 'level2': df2}, axis=1)

C:\Users\adity\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False

  """Entry point for launching an IPython kernel.

pd.concat([df1, df2], axis=1, keys=['level1', 'level2'],
         names=['upper', 'lower'])

C:\Users\adity\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False

The last issue is that the row index does not contain any relevant data. To avoid this, pass 'ignore_index=True'.

df1 = pd.DataFrame(np.random.randn(3,4), columns=['a', 'b', 'c', 'd'])
df2 = pd.DataFrame(np.random.randn(2,3), columns=['b', 'd', 'a'])
df1

df2

pd.concat([df1, df2], ignore_index=True)

C:\Users\adity\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False

  """Entry point for launching an IPython kernel.

Combining Data with Overlap¶

There is another data operation involving 2 datasets with indexes fully or partly overlapped. NumPy has 'where' function which performs array-equivalent of an 'if-else'. pandas has its equivalent 'combine_first'. In Series, this will perform the equivalent operation with the data alignment logic of pandas. In DataFrames, it does the same column by column. So it is "patching" missing data from calling object with values from passed object.

a = pd.Series([np.nan, 2.5, 0.0, 3.5, 4.5, np.nan],
             index=['f','e','d','c','b','a'])
b = pd.Series([0., np.nan, 2., np.nan, np.nan, 5.],
             index=['a','b','c','d','e','f'])

a

f    NaN
e    2.5
d    0.0
c    3.5
b    4.5
a    NaN
dtype: float64

b

a    0.0
b    NaN
c    2.0
d    NaN
e    NaN
f    5.0
dtype: float64

np.where(pd.isnull(a), b, a)

array([0. , 2.5, 0. , 3.5, 4.5, 5. ])

b.combine_first(a)

a    0.0
b    4.5
c    2.0
d    0.0
e    2.5
f    5.0
dtype: float64

df1 = pd.DataFrame({'a':[1., np.nan, 5., np.nan],
                   'b':[np.nan, 2., np.nan, 6.],
                   'c':range(2,18,4)})
df2 = pd.DataFrame({'a':[5., 4., np.nan, 3., 7.],
                   'b':[np.nan, 3., 4., 6., 8.]})

df1

df2

df1.combine_first(df2)

Reshaping and Pivoting¶

There are a number of basic operations for rearranging tabular data. These are called 'reshape' or 'pivot' operations.¶

Reshaping with Hierarchical Indexing¶

Hierarchical indexing allows to rearrange data consistently in a DataFrame. There are 2 primary actions:

 1. stack - It 'rotates' or pivots from columns to rows.
 2. unstack - Pivots rows into columns.

Calling 'stack' pivots columns into rows, producing a Series. If you have a hierarchically indexed Series, we can rearrange the data back into a DataFrame with 'unstack'.

data = pd.DataFrame(np.arange(6).reshape((2,3)),
                   index=pd.Index(['Ohio', 'Colorado'], name='state'),
                   columns = pd.Index(['one', 'two', 'three'],
                                     name = 'number'))
data

result = data.stack()
result

state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int32

result.unstack()

By default the innermost index gets 'stacked' or 'unstacked'. You can unstack a different level by passing a level number or name. Unstacking might introduce missing data if all values in the level aren't foundin each subgroups.

result.unstack(0)

result.unstack('state')

s1 = pd.Series([0,1,2,3], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([4,5,6], index=['c','d','e'])
data2 = pd.concat([s1, s2], keys=['one', 'two'])

data2

one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64

data2.unstack()

Stacking filters out missing data by default, so that the operation is easily invertible. When you unstack, the level unstacked becomes lowest level in the result. when calling stack or unstack, we can indicate the name of the axis to the stack.

data2.unstack()

data2.unstack().stack()

one  a    0.0
     b    1.0
     c    2.0
     d    3.0
two  c    4.0
     d    5.0
     e    6.0
dtype: float64

data2.unstack().stack(dropna=False)

one  a    0.0
     b    1.0
     c    2.0
     d    3.0
     e    NaN
two  a    NaN
     b    NaN
     c    4.0
     d    5.0
     e    6.0
dtype: float64

df = pd.DataFrame({'left': result, 'right': result+5},
                 columns=pd.Index(['left', 'right'], name='side'))
df

df.unstack('state')

df.unstack('state').stack('side')

Pivoting "Long" to "Wide" Format¶

A common way to store multiple time series in the so-called 'long' or 'stacked' format. One way to store time index is via 'PeriodIndex', which combines year and quarter columns to create a kind of time interval type. we use this interval to index our data.

data = pd.read_csv('examples/macrodata.csv')
data.head()

data.columns

Index(['year', 'quarter', 'realgdp', 'realcons', 'realinv', 'realgovt',
       'realdpi', 'cpi', 'm1', 'tbilrate', 'unemp', 'pop', 'infl', 'realint'],
      dtype='object')

periods = pd.PeriodIndex(year=data.year, quarter=data.quarter,
                        name='date')
columns = pd.Index(['realgdp', 'infl', 'unemp'], name='item')
data = data.reindex(columns=columns)
data.index = periods.to_timestamp('D', 'end')
ldata = data.stack().reset_index().rename(columns={0:'value'})

ldata[:5]

The above way of displaying data is called 'long format' for multiple time series or other observational data with 2or more keys. Each row in the table is a single observation. Data is frequently stored in this format in RDBMS as a fixed schema as it allows number of distinct values in the 'item' column to change as more data is added to the table. In the previous example, date and item would be the primary keys, offering both relational integrity and easier joins. But in some cases, this kind of data would be more difficult to work with, as you might prefer a DataFrame containing 1 column per distinct item indexed by timestamps. This is exactly what 'pivot' method provides.

pivoted = ldata.pivot('date', 'item', 'value')
pivoted[:10]

The first 2 values passed in the pivot method are the columns to be used as 'row' and 'column' index. If you had 2 'value' columns that you needed to reshape simultaneously. In pivot, by omitting the last argument, you get a DataFrame with hierarchical columns. Using pivot is just like creating a hierarchical index using the 'set_index' followed by a call to 'unstack'.

ldata['value2'] = np.random.randn(len(ldata))
ldata[:5]

pivoted = ldata.pivot('date', 'item')
pivoted[:5]

pivoted['value'][:5]

unstacked = ldata.set_index(['date', 'item']).unstack('item')
unstacked[:5]

Pivoting "Wide" to "Long" Format¶

The inverse to 'pivot' is 'pandas.melt'. It merges multiple columns into one, producing a DataFrame that is longer than the input. When using pandas.melt, we indicate which columns, if any, are group indicators. In this example, we will use 'key' as the group indicator.

df =pd.DataFrame({'key': ['foo', 'bar', 'baz'],
                 'A': [1,2,3],
                 'B': [4,5,6],
                 'C': [7,8,9]})
df

melted = pd.melt(df, ['key'])
melted

We can use pivot to get the original layout back. But because it creates an index from the columns used as row labels, we may have to use 'reset_index' to move it back into a column. You can be explicit in which columns can be indicators and which can be values using the 'id_vars' and 'value_vars' parameters respectively. We can use pandas.melt without group identifiers too.

reshaped = melted.pivot('key', 'variable', 'value')
reshaped

reshaped.reset_index()

pd.melt(df, id_vars=['key'], value_vars=['A','B'])

pd.melt(df, value_vars=['A', 'B', 'C'])

pd.melt(df, value_vars=['key', 'A', 'B'])

	0	1	2
0	-1.152841	NaN	NaN
1	1.392019	NaN	NaN
2	-0.138522	NaN	-0.801143
3	1.727443	NaN	0.011845
4	-0.008346	-0.910039	1.191962
5	0.846771	1.635657	0.192299
6	0.611303	-0.422075	0.599414

	0	1	2
4	-0.008346	-0.910039	1.191962
5	0.846771	1.635657	0.192299
6	0.611303	-0.422075	0.599414

	0	1	2
2	-0.138522	NaN	-0.801143
3	1.727443	NaN	0.011845
4	-0.008346	-0.910039	1.191962
5	0.846771	1.635657	0.192299
6	0.611303	-0.422075	0.599414

	0	1	2
0	-1.152841	0.000000	0.000000
1	1.392019	0.000000	0.000000
2	-0.138522	0.000000	-0.801143
3	1.727443	0.000000	0.011845
4	-0.008346	-0.910039	1.191962
5	0.846771	1.635657	0.192299
6	0.611303	-0.422075	0.599414

	0	1	2
0	-1.152841	0.500000	0.000000
1	1.392019	0.500000	0.000000
2	-0.138522	0.500000	-0.801143
3	1.727443	0.500000	0.011845
4	-0.008346	-0.910039	1.191962
5	0.846771	1.635657	0.192299
6	0.611303	-0.422075	0.599414

Introduction¶

Handling missing data¶

Filtering 'Out' Missing Data¶

Filling In Missing Data¶

Data Transformation¶

Removing Duplicates¶

Transforming Data Using a Function or Mapping¶

Replacing Values¶

Renaming Axis Indexes¶

Discretization and Binning¶

Detecting and Filtering Outliers¶

Permutation and Random Sampling¶

Computing Indicator / Dummy Variables¶

String Manipulation¶

String Object Methods¶

Regular Expression¶

Vectorized String Functions in pandas¶

Introduction¶

Hierarchical Indexing¶

Reordering and Sorting Levels¶

Summary statistics by Level¶

Indexing with a DataFrame's columns¶

Combining and Merging Datasets¶

Database-Style DataFrame Joins¶

Merging on Index¶

Concatenating along an Axis¶

Combining Data with Overlap¶

Reshaping and Pivoting¶

There are a number of basic operations for rearranging tabular data. These are called 'reshape' or 'pivot' operations.¶

Reshaping with Hierarchical Indexing¶

Pivoting "Long" to "Wide" Format¶

Pivoting "Wide" to "Long" Format¶

Source :

Find Something

	0	1	2
0	-1.196822	-1.728530	0.011651
1	0.635502	-1.557243	1.468730
2	-0.357786	NaN	0.239709
3	-0.456500	NaN	-0.379100
4	-0.511740	NaN	NaN
5	-0.005963	NaN	NaN

	food	ounces
0	bacon	4.0
1	pulled pork	3.0
2	bacon	12.0
3	Pastrami	6.0
4	corned beef	7.5
5	Bacon	8.0
6	pastrami	3.0
7	honey ham	5.0
8	nova lox	6.0

	0	1	2	3
count	1000.000000	1000.000000	1000.000000	1000.000000
mean	0.014491	0.014646	-0.012298	0.039840
std	1.010034	1.010854	1.019774	0.967757
min	-3.701866	-3.387856	-3.039110	-3.390256
25%	-0.657859	-0.654879	-0.685004	-0.619386
50%	0.066684	-0.001560	0.020953	0.058701
75%	0.692534	0.713840	0.628787	0.696540
max	3.058465	3.226376	4.065409	3.088712

	0	1	2	3
186	3.058465	0.016649	-1.722464	1.367502
228	-2.056428	-1.475965	-0.003494	-3.390256
360	0.124732	-0.404802	3.099117	-0.237581
388	-0.253419	3.105240	0.078645	-0.148384
470	-3.701866	-2.025745	0.727206	-1.468939
472	-0.484421	-3.387856	-0.112233	-0.068768
508	0.323240	3.226376	-1.808320	-0.094250
559	-0.579666	1.194872	4.065409	-1.209842
583	0.658055	1.219842	-3.039110	0.021802
636	-3.087210	1.114326	-0.585877	0.080978
664	-1.837414	-0.281346	-0.871586	3.088712
678	-3.232977	0.249556	2.571467	1.558432
735	0.080877	-0.120672	-1.181248	-3.254930
976	0.713734	0.905299	3.180584	0.013758

	0	1	2	3
0	1.0	1.0	-1.0	-1.0
1	-1.0	-1.0	1.0	-1.0
2	-1.0	1.0	1.0	-1.0
3	1.0	1.0	1.0	1.0
4	1.0	1.0	-1.0	1.0

	movie_id	title	genres
0	1	Toy Story (1995)	Animation\|Children's\|Comedy
1	2	Jumanji (1995)	Adventure\|Children's\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama
4	5	Father of the Bride Part II (1995)	Comedy
5	6	Heat (1995)	Action\|Crime\|Thriller
6	7	Sabrina (1995)	Comedy\|Romance
7	8	Tom and Huck (1995)	Adventure\|Children's
8	9	Sudden Death (1995)	Action
9	10	GoldenEye (1995)	Action\|Adventure\|Thriller

	(0.0, 0.2]	(0.2, 0.4]	(0.4, 0.6]	(0.6, 0.8]	(0.8, 1.0]
0	0	0	0	0	1
1	0	1	0	0	0
2	1	0	0	0	0
3	0	1	0	0	0
4	0	0	1	0	0
5	0	0	1	0	0
6	0	0	0	0	1
7	0	0	0	1	0
8	0	0	0	1	0
9	0	0	0	1	0

	1	2	3
a	1.407820	-1.461955	1.228871
b	0.154511	NaN	-0.003190
c	-0.027980	-0.095242	NaN
d	NaN	-0.562603	-0.681595

	key	data1	data2
0	b	0.0	1.0
1	b	1.0	1.0
2	b	6.0	1.0
3	a	2.0	0.0
4	a	4.0	0.0
5	a	5.0	0.0
6	c	3.0	NaN
7	d	NaN	2.0

	key	data1	data2
0	b	0	1.0
1	b	0	3.0
2	b	1	1.0
3	b	1	3.0
4	a	2	0.0
5	a	2	2.0
6	c	3	NaN
7	a	4	0.0
8	a	4	2.0
9	b	5	1.0
10	b	5	3.0

	one	two	three	four
OHIO	0	1	2	3
COLO	4	5	6	7
NEW	8	9	10	11

	ONE	TWO	THREE	FOUR
Ohio	0	1	2	3
Colo	4	5	6	7
New	8	9	10	11

	one	two	peekaboo	four
INDIANA	0	1	2	3
COLO	4	5	6	7
NEW	8	9	10	11

	one	two	three	four
INDIANA	0	1	2	3
COLO	4	5	6	7
NEW	8	9	10	11

	(0.0, 0.2]	(0.2, 0.4]	(0.4, 0.6]	(0.6, 0.8]	(0.8, 1.0]
0	0	0	0	0	1
1	0	1	0	0	0
2	1	0	0	0	0
3	0	1	0	0	0
4	0	0	1	0	0
5	0	0	1	0	0
6	0	0	0	0	1
7	0	0	0	1	0
8	0	0	0	1	0
9	0	0	0	1	0

	key	data1	data2
0	b	0.0	1.0
1	b	1.0	1.0
2	b	6.0	1.0
3	a	2.0	0.0
4	a	4.0	0.0
5	a	5.0	0.0
6	c	3.0	NaN
7	d	NaN	2.0

	key	data1	data2
0	b	0	1.0
1	b	0	3.0
2	b	1	1.0
3	b	1	3.0
4	a	2	0.0
5	a	2	2.0
6	c	3	NaN
7	a	4	0.0
8	a	4	2.0
9	b	5	1.0
10	b	5	3.0

	key1	key2	lval	rval
0	foo	one	1.0	4.0
1	foo	one	1.0	5.0
2	foo	two	2.0	NaN
3	bar	one	3.0	6.0
4	bar	two	NaN	7.0

	key1	key2_x	lval	key2_y	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

	key1	key2_left	lval	key2_right	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

	key1	key2	data
0	Ohio	2000	0.0
1	Ohio	2001	1.0
2	Ohio	2002	2.0
3	Nevada	2001	3.0
4	Nevada	2002	4.0

	Ohio	Nevada	Missouri	Alabama
a	1.0	2.0	NaN	NaN
b	NaN	NaN	7.0	8.0
c	3.0	4.0	9.0	10.0
d	NaN	NaN	11.0	12.0
e	5.0	6.0	13.0	14.0

	0	1	2
a	0.0	NaN	NaN
b	1.0	NaN	NaN
c	NaN	2.0	NaN
d	NaN	3.0	NaN
e	NaN	4.0	NaN
f	NaN	NaN	5.0
g	NaN	NaN	6.0

	one	two	three
a	0.0	NaN	NaN
b	1.0	NaN	NaN
c	NaN	2.0	NaN
d	NaN	3.0	NaN
e	NaN	4.0	NaN
f	NaN	NaN	5.0
g	NaN	NaN	6.0

	(0.0, 0.2]	(0.2, 0.4]	(0.4, 0.6]	(0.6, 0.8]	(0.8, 1.0]
0	0	0	0	0	1
1	0	1	0	0	0
2	1	0	0	0	0
3	0	1	0	0	0
4	0	0	1	0	0
5	0	0	1	0	0
6	0	0	0	0	1
7	0	0	0	1	0
8	0	0	0	1	0
9	0	0	0	1	0

	key	data1	data2
0	b	0.0	1.0
1	b	1.0	1.0
2	b	6.0	1.0
3	a	2.0	0.0
4	a	4.0	0.0
5	a	5.0	0.0
6	c	3.0	NaN
7	d	NaN	2.0

	key	data1	data2
0	b	0	1.0
1	b	0	3.0
2	b	1	1.0
3	b	1	3.0
4	a	2	0.0
5	a	2	2.0
6	c	3	NaN
7	a	4	0.0
8	a	4	2.0
9	b	5	1.0
10	b	5	3.0

	key1	key2	lval	rval
0	foo	one	1.0	4.0
1	foo	one	1.0	5.0
2	foo	two	2.0	NaN
3	bar	one	3.0	6.0
4	bar	two	NaN	7.0

	key1	key2_x	lval	key2_y	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

	key1	key2_left	lval	key2_right	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

	0	1	2
a	0.0	NaN	NaN
b	1.0	NaN	NaN
c	NaN	2.0	NaN
d	NaN	3.0	NaN
e	NaN	4.0	NaN
f	NaN	NaN	5.0
g	NaN	NaN	6.0

	one	two	three
a	0.0	NaN	NaN
b	1.0	NaN	NaN
c	NaN	2.0	NaN
d	NaN	3.0	NaN
e	NaN	4.0	NaN
f	NaN	NaN	5.0
g	NaN	NaN	6.0

	a	b	c	d
0	-0.588973	1.051040	1.465350	2.065064
1	0.629663	0.600635	-0.489396	0.897867
2	-1.732375	-0.397218	-1.519736	-1.329098

	state	Colorado	Ohio
number	side
one	left	3	0
one	right	8	5
two	left	4	1
two	right	9	6
three	left	5	2
three	right	10	7