IntroductionΒΆ

Large portion of Data Analysis is taken up by preparation: loading, cleaning, transforming and rearranging. These tasks take up more than 80% of an analyst's time. This is because the way the data is stored in files or databases is not in the right format. Researchers prefer to do ad-hoc processing of data from one form to another using languages like R, Python, etc. pandas with built-in Python features provides high-level, flexible abd fast set of tools that enables you to manipulate data into right form.

Handling missing dataΒΆ

The way missing data is represented in pandas is imperfect but functional for lot of users. For numeric data, pandas uses floating-point value NaN. It is called a Sentinel value and can be easily detected.

In [3]:
import pandas as pd
import numpy as np

string_data = pd.Series(['aardvak', 'artichoke', np.nan, 'avocado'])
string_data
Out[3]:
0      aardvak
1    artichoke
2          NaN
3      avocado
dtype: object
In [4]:
string_data.isnull()
Out[4]:
0    False
1    False
2     True
3    False
dtype: bool

This convention of NaN was adopted from R, where missing values are refered as NA. In statistics, NA may either be data that does not exist or data that was not observed. During cleaning data, we should also analyse the missing data to identify data collection problems or potential bias due to missing data.

In [5]:
string_data[0] = None

string_data.isnull()
Out[5]:
0     True
1    False
2     True
3    False
dtype: bool

Filtering 'Out' Missing DataΒΆ

We always have the option to filter out missing data by hand using 'isnull' and boolean indexing. The 'dropna' function can be pretty useful too. For a Series it returns the Series with only non-null data and index values. For DataFrame, it is a bit complex. dropna by default will drop any row that contains even 1 missing value. By passing "how='all'" will target rows with all NAs. To drop columns, pass 'axis=1'.

In [6]:
from numpy import nan as NA

data = pd.Series([1, NA, 3.5, NA, 7])
data.dropna()
Out[6]:
0    1.0
2    3.5
4    7.0
dtype: float64
In [7]:
data[data.notnull()]
Out[7]:
0    1.0
2    3.5
4    7.0
dtype: float64
In [8]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
                   [NA, NA, NA], [NA, 6.5, 3.]])

cleaned = data.dropna()
data
Out[8]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
In [9]:
cleaned
Out[9]:
0 1 2
0 1.0 6.5 3.0
In [10]:
data.dropna(how='all')
Out[10]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0
In [11]:
data[4] = NA
data
Out[11]:
0 1 2 4
0 1.0 6.5 3.0 NaN
1 1.0 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.5 3.0 NaN
In [12]:
data.dropna(axis=1, how='all')
Out[12]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0

Another DataFrame cleaning method concerns with time series data. To keep only rows with certain number of observations, use the 'thresh' argument. Any row that contains number of NAs greater than or equal thresh will be eliminated.

In [13]:
df = pd.DataFrame(np.random.randn(7,3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA

df
Out[13]:
0 1 2
0 -1.152841 NaN NaN
1 1.392019 NaN NaN
2 -0.138522 NaN -0.801143
3 1.727443 NaN 0.011845
4 -0.008346 -0.910039 1.191962
5 0.846771 1.635657 0.192299
6 0.611303 -0.422075 0.599414
In [14]:
df.dropna()
Out[14]:
0 1 2
4 -0.008346 -0.910039 1.191962
5 0.846771 1.635657 0.192299
6 0.611303 -0.422075 0.599414
In [15]:
df.dropna(thresh=2)
Out[15]:
0 1 2
2 -0.138522 NaN -0.801143
3 1.727443 NaN 0.011845
4 -0.008346 -0.910039 1.191962
5 0.846771 1.635657 0.192299
6 0.611303 -0.422075 0.599414

Filling In Missing DataΒΆ

Rather than removing NAs and discarding important information in the same rows, we can also fill in the NAs in different ways. The 'fillna' is a workhorse function, where the constant we pass replaces missing values. If we call fillna with a dict, we can fill different value for each column.

In [16]:
df.fillna(0)
Out[16]:
0 1 2
0 -1.152841 0.000000 0.000000
1 1.392019 0.000000 0.000000
2 -0.138522 0.000000 -0.801143
3 1.727443 0.000000 0.011845
4 -0.008346 -0.910039 1.191962
5 0.846771 1.635657 0.192299
6 0.611303 -0.422075 0.599414
In [17]:
df.fillna({1:0.5, 2: 0})
Out[17]:
0 1 2
0 -1.152841 0.500000 0.000000
1 1.392019 0.500000 0.000000
2 -0.138522 0.500000 -0.801143
3 1.727443 0.500000 0.011845
4 -0.008346 -0.910039 1.191962
5 0.846771 1.635657 0.192299
6 0.611303 -0.422075 0.599414

By default it returns a new object, but we can modify it to change in-place. The interpolation methods used for reindexing like 'ffill' can also be used with fillna. It allows you to do lots of creative things, like filling with mean or median values.

In [18]:
_ = df.fillna(0, inplace=True)

df
Out[18]:
0 1 2
0 -1.152841 0.000000 0.000000
1 1.392019 0.000000 0.000000
2 -0.138522 0.000000 -0.801143
3 1.727443 0.000000 0.011845
4 -0.008346 -0.910039 1.191962
5 0.846771 1.635657 0.192299
6 0.611303 -0.422075 0.599414
In [19]:
df = pd.DataFrame(np.random.randn(6,3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df
Out[19]:
0 1 2
0 -1.196822 -1.728530 0.011651
1 0.635502 -1.557243 1.468730
2 -0.357786 NaN 0.239709
3 -0.456500 NaN -0.379100
4 -0.511740 NaN NaN
5 -0.005963 NaN NaN
In [20]:
df.fillna(method='ffill')
Out[20]:
0 1 2
0 -1.196822 -1.728530 0.011651
1 0.635502 -1.557243 1.468730
2 -0.357786 -1.557243 0.239709
3 -0.456500 -1.557243 -0.379100
4 -0.511740 -1.557243 -0.379100
5 -0.005963 -1.557243 -0.379100
In [21]:
df.fillna(method='ffill', limit=2)
Out[21]:
0 1 2
0 -1.196822 -1.728530 0.011651
1 0.635502 -1.557243 1.468730
2 -0.357786 -1.557243 0.239709
3 -0.456500 -1.557243 -0.379100
4 -0.511740 NaN -0.379100
5 -0.005963 NaN -0.379100
In [22]:
data = pd.Series([1., NA, 3.5, NA, 7])

data.fillna(data.mean())
Out[22]:
0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

Data TransformationΒΆ

Till now we have seen methods for rearranging data. Transformation involves filtering, cleaning and other different functions.

Removing DuplicatesΒΆ

The DataFrame method 'duplicated' returns boolean Series indicating if each row is a duplicate (i.e. observed in a previous row) or not. Similarly, 'drop_duplicates' returns DataFrame where 'duplicated' array is False.

In [23]:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                    'k2': [1,1,2,3,3,4,4]})
data
Out[23]:
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
6 two 4
In [24]:
data.duplicated()
Out[24]:
0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool
In [25]:
data.drop_duplicates()
Out[25]:
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4

Both the above methods by default consider all of the columns. You can also specify any subset of the DataFrame to detect duplicates. By default, both keep the first observation in case of duplicates. We can specify "keep='last'" to instead keep the last observation.

In [26]:
data['v1'] = range(7)
data.drop_duplicates(['k1'])
Out[26]:
k1 k2 v1
0 one 1 0
1 two 1 1
In [27]:
data.drop_duplicates(['k1', 'k2'], keep='last')
Out[27]:
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
6 two 4 6

Transforming Data Using a Function or MappingΒΆ

We sometimes need to make transformations based on the values present in an array, Series or column in a DataFrame. We can use the map method with a function or dict-like object having the mapping to add or change a column. Sometimes the column that we base our mapping on may have varying case from our map. In such a case, we can convert all the values to lowercase. Or just pass a function that does it for us.

In [28]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                             'Pastrami', 'corned beef', 'Bacon',
                            'pastrami', 'honey ham', 'nova lox'],
                    'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
Out[28]:
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 6.0
4 corned beef 7.5
5 Bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0
In [29]:
meat_to_animal = {
    'bacon': 'pig',
    'pulled pork': 'pig',
    'pastrami': 'cow',
    'corned beef': 'cow',
    'honey ham': 'pig',
    'nova lox': 'salmon'
}
In [30]:
lowercased = data['food'].str.lower()
lowercased
Out[30]:
0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object
In [31]:
data['animal'] = lowercased.map(meat_to_animal)
data
Out[31]:
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 cow
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon
In [32]:
data['food'].map(lambda x: meat_to_animal[x.lower()])
Out[32]:
0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

Replacing ValuesΒΆ

The fillna method is a special case of more general values replacement. The map function modifies a subset of values but, 'replace' provides simpler and more flexible way to do so. Passing the sentinel (or garbage) value followed by the replcae value will create a new object with the values replaced. If we want in-place replacement, use "inplace=True".

In [33]:
data = pd.Series([1., -999., 2., -999., 3.])
data
Out[33]:
0      1.0
1   -999.0
2      2.0
3   -999.0
4      3.0
dtype: float64
In [34]:
data.replace(-999, np.nan)
Out[34]:
0    1.0
1    NaN
2    2.0
3    NaN
4    3.0
dtype: float64

To replace multiple values with a single value, pass a list followed by substitute value. To have different replacements for different values, pass list of substitutes. You can also pass a dict as argument to replace multiple substitutes. NOTE - 'data.replace' is different from 'data.str.replace'. The latter is for element-wise string substitution.

In [35]:
data.replace([-999, -1000], np.nan)
Out[35]:
0    1.0
1    NaN
2    2.0
3    NaN
4    3.0
dtype: float64
In [36]:
data.replace([-999, -1000], [np.nan, 0])
Out[36]:
0    1.0
1    NaN
2    2.0
3    NaN
4    3.0
dtype: float64
In [37]:
data.replace({-999: np.nan, -1000: 0})
Out[37]:
0    1.0
1    NaN
2    2.0
3    NaN
4    3.0
dtype: float64

Renaming Axis IndexesΒΆ

Just like values, axis labels can also be transformed by a function or mapping to produce differntly labeled objects. We can also modify axes in-place without any new data structure.

In [38]:
data = pd.DataFrame(np.arange(12).reshape((3,4)),
                   index = ['Ohio', 'Colorado', 'New York'],
                   columns = ['one', 'two', 'three', 'four'])
In [39]:
transform = lambda x: x[:4].upper()

data.index.map(transform)
Out[39]:
Index(['OHIO', 'COLO', 'NEW '], dtype='object')
In [40]:
data.index = data.index.map(transform)
data
Out[40]:
one two three four
OHIO 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11

To get a transformed version of a dataset without modifying the original, use 'rename'. It can also be used in conjunction with a dict-like object providing new values for subset of the axis labels. It saves you from copying DataFrame manuallyand then assigning it index and columns. To modify in-place, use parameter 'inplace=True'.

In [41]:
data.rename(index=str.title, columns=str.upper)
Out[41]:
ONE TWO THREE FOUR
Ohio 0 1 2 3
Colo 4 5 6 7
New 8 9 10 11
In [42]:
data.rename(index={'OHIO':"INDIANA"},
           columns = {'three':'peekaboo'})
Out[42]:
one two peekaboo four
INDIANA 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
In [43]:
data.rename(index={'OHIO': 'INDIANA'}, inplace=True)
data
Out[43]:
one two three four
INDIANA 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11

Discretization and BinningΒΆ

Continuous data is always discretized or seperated into 'bins' for analysis. To bin a set of continuous data, use the 'cut' method from pandas. In below example, we are binnning set of gaes into groups 18 to 25, 26 to 35, 36 to 60 and 61 nd older.

In [44]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]

cats = pd.cut(ages, bins)
cats
Out[44]:
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

pandas returns a special Categorical object from cuts function. The output describes the bins that each of the element is in. You can treat it like a bin name for each element. Internally, the output contains a categories array specifying distinct category names along with a labeling for the 'ages' data in the 'codes' attribute.

In [45]:
cats.codes
Out[45]:
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
In [46]:
cats.categories
Out[46]:
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]]
              closed='right',
              dtype='interval[int64]')
In [47]:
pd.value_counts(cats)
Out[47]:
(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

The interval system for cut is consistent with the mathematical notation. A parenthesis means that the side is open and a square bracket means that it is closed (inclusive). We can changes which side is closed by passing 'right=False'. We can have our own bin names by passing a list or array to the labels option.

In [48]:
pd.cut(ages, [18, 26, 36, 61, 100], right=False)
Out[48]:
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]
In [49]:
group_names = ['Youth', 'YoungAdult', 'MiddleAges', 'Senior']
pd.cut(ages, bins, labels=group_names)
Out[49]:
[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAges, MiddleAges, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAges < Senior]

Instead of specifying specific intervals, we can just pass an integer to get equal length bins of the same number based on the max and min value of the data. The 'precision' parameter limits decimal precision in the values. 'precision=2' limits decimal precision to 2 digits.

In [50]:
data = np.random.rand(20)

pd.cut(data, 4, precision=2)
Out[50]:
[(0.51, 0.74], (0.28, 0.51], (0.51, 0.74], (0.055, 0.28], (0.51, 0.74], ..., (0.28, 0.51], (0.055, 0.28], (0.055, 0.28], (0.28, 0.51], (0.055, 0.28]]
Length: 20
Categories (4, interval[float64]): [(0.055, 0.28] < (0.28, 0.51] < (0.51, 0.74] < (0.74, 0.97]]

cut has a closely related function - 'qcut' that bins data based on sample quantiles. Based on distribution, using cut will not usually result in each bin have the same number of data points. But as qcut uses sample quantiles, you wil rougjly obtain equal-size bins. We can even pass our own quantiles to qcut.

In [51]:
data = np.random.randn(1000)
cats = pd.qcut(data, 4)
cats
Out[51]:
[(-3.413, -0.659], (0.0656, 0.778], (0.0656, 0.778], (0.778, 3.118], (-0.659, 0.0656], ..., (0.0656, 0.778], (-0.659, 0.0656], (-0.659, 0.0656], (0.0656, 0.778], (0.778, 3.118]]
Length: 1000
Categories (4, interval[float64]): [(-3.413, -0.659] < (-0.659, 0.0656] < (0.0656, 0.778] < (0.778, 3.118]]
In [52]:
pd.value_counts(cats)
Out[52]:
(0.778, 3.118]      250
(0.0656, 0.778]     250
(-0.659, 0.0656]    250
(-3.413, -0.659]    250
dtype: int64
In [53]:
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])
Out[53]:
[(-3.413, -1.172], (0.0656, 1.378], (0.0656, 1.378], (0.0656, 1.378], (-1.172, 0.0656], ..., (0.0656, 1.378], (-1.172, 0.0656], (-1.172, 0.0656], (0.0656, 1.378], (1.378, 3.118]]
Length: 1000
Categories (4, interval[float64]): [(-3.413, -1.172] < (-1.172, 0.0656] < (0.0656, 1.378] < (1.378, 3.118]]

Detecting and Filtering OutliersΒΆ

Filtering and Transforming outliers is mostly a matter of applying array operations. To find values exceeding a threshold, just use boolean indexing with other functions like 'abs()' based on requirement. To get all rows having at least one value exceed a threshold, use the 'any(1)' method. Values can also be set based on these criteria. So you can cap values based on an interval or threshold. You can also use the 'np.sign()' function to get 1 and -1 where the data is positive or negative respectively.

In [54]:
data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()
Out[54]:
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 0.014491 0.014646 -0.012298 0.039840
std 1.010034 1.010854 1.019774 0.967757
min -3.701866 -3.387856 -3.039110 -3.390256
25% -0.657859 -0.654879 -0.685004 -0.619386
50% 0.066684 -0.001560 0.020953 0.058701
75% 0.692534 0.713840 0.628787 0.696540
max 3.058465 3.226376 4.065409 3.088712
In [55]:
col = data[2]
col[np.abs(col) > 3]
Out[55]:
360    3.099117
559    4.065409
583   -3.039110
976    3.180584
Name: 2, dtype: float64
In [56]:
data[(np.abs(data) > 3).any(1)]
Out[56]:
0 1 2 3
186 3.058465 0.016649 -1.722464 1.367502
228 -2.056428 -1.475965 -0.003494 -3.390256
360 0.124732 -0.404802 3.099117 -0.237581
388 -0.253419 3.105240 0.078645 -0.148384
470 -3.701866 -2.025745 0.727206 -1.468939
472 -0.484421 -3.387856 -0.112233 -0.068768
508 0.323240 3.226376 -1.808320 -0.094250
559 -0.579666 1.194872 4.065409 -1.209842
583 0.658055 1.219842 -3.039110 0.021802
636 -3.087210 1.114326 -0.585877 0.080978
664 -1.837414 -0.281346 -0.871586 3.088712
678 -3.232977 0.249556 2.571467 1.558432
735 0.080877 -0.120672 -1.181248 -3.254930
976 0.713734 0.905299 3.180584 0.013758
In [57]:
# Capping outside -3 to 3
data[np.abs(data) > 3] = np.sign(data) * 3 
data.describe()
Out[57]:
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 0.015455 0.014702 -0.013604 0.040396
std 1.006524 1.008608 1.015092 0.965337
min -3.000000 -3.000000 -3.000000 -3.000000
25% -0.657859 -0.654879 -0.685004 -0.619386
50% 0.066684 -0.001560 0.020953 0.058701
75% 0.692534 0.713840 0.628787 0.696540
max 3.000000 3.000000 3.000000 3.000000
In [58]:
np.sign(data).head()
Out[58]:
0 1 2 3
0 1.0 1.0 -1.0 -1.0
1 -1.0 -1.0 1.0 -1.0
2 -1.0 1.0 1.0 -1.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 -1.0 1.0

Permutation and Random SamplingΒΆ

We can easily perform Permutation (randomly reordering) on a Series or rows of a DataFrame using 'numpy.random.permutation'. Calling it with length of the axis you want to permute cerates an array of integers indicating the new ordering. we can then use the same array in an iloc-based indexing or with an equivalent 'take' function.

In [59]:
df = pd.DataFrame(np.arange(5*4).reshape((5,4)))
sampler = np.random.permutation(5)
sampler
Out[59]:
array([1, 0, 4, 2, 3])
In [60]:
df
Out[60]:
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
In [61]:
df.take(sampler)
Out[61]:
0 1 2 3
1 4 5 6 7
0 0 1 2 3
4 16 17 18 19
2 8 9 10 11
3 12 13 14 15

Use the 'sample' method to select random subset without replacement. To generate subset with replacement (i.e. repeat choices) pass 'replace=True'.

In [62]:
df.sample(n=3)
Out[62]:
0 1 2 3
1 4 5 6 7
2 8 9 10 11
4 16 17 18 19
In [63]:
choices = pd.Series([5, 7, -1, 6, 4])
draws = choices.sample(n=10, replace=True)
draws
Out[63]:
0    5
1    7
0    5
4    4
4    4
4    4
4    4
3    6
0    5
2   -1
dtype: int64

Computing Indicator / Dummy VariablesΒΆ

Another transformation is to convert a categorical variable into a 'dummy' or 'indicator' matrix. If a column has k distinct values, we can derive a matrix or a DataFrame with k columns all containing 1s and 0s. pandas has the 'get_dummies' function to do this. You may want to add a prefix to the columns in the indicator DataFrame. This DataFrame can be merged with other data. get_dummies has the 'prefix' argument to do this.

In [64]:
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                  'data1': range(6)})
pd.get_dummies(df['key'])
Out[64]:
a b c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0
In [65]:
dummies = pd.get_dummies(df['key'], prefix='key')
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy
Out[65]:
data1 key_a key_b key_c
0 0 0 1 0
1 1 0 1 0
2 2 1 0 0
3 3 0 0 1
4 4 1 0 0
5 5 0 1 0

But if a row belongs to multiple categories, then dummying it becomes complicated. Adding indicator variable to a column like genre involves a bit of data wrangling. First we get all the unique categorical values in the DataFrame.

In [66]:
mnames = ['movie_id', 'title', 'genres']

movies = pd.read_table('datasets/movielens/movies.dat', sep='::',
                      header=None, names=mnames)
movies[:10]
C:\Users\adity\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  after removing the cwd from sys.path.
Out[66]:
movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy
5 6 Heat (1995) Action|Crime|Thriller
6 7 Sabrina (1995) Comedy|Romance
7 8 Tom and Huck (1995) Adventure|Children's
8 9 Sudden Death (1995) Action
9 10 GoldenEye (1995) Action|Adventure|Thriller
In [67]:
all_genres = []
for x in movies.genres:
    all_genres.extend(x.split('|'))
    
genres = pd.unique(all_genres)
genres
Out[67]:
array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
       'Western'], dtype=object)

One way to get this would be to start with a DataFrame of all zeros. The iterate through each row and set the entry in each row of 'dummies' to 1. For this we use 'dummies.columns' to compute column indices for each category. The we use iloc to set values based on those indices. After that we combine this indicator DataFrame with original DataFrame.

In [68]:
zero_matrix = np.zeros((len(movies), len(genres)))
dummies = pd.DataFrame(zero_matrix, columns=genres)
In [69]:
gen = movies.genres[0]
gen.split('|')
Out[69]:
['Animation', "Children's", 'Comedy']
In [70]:
dummies.columns.get_indexer(gen.split('|'))
Out[70]:
array([0, 1, 2], dtype=int64)
In [71]:
for i, gen in enumerate(movies.genres):
    indices = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i, indices] = 1
In [72]:
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0]
Out[72]:
movie_id                                       1
title                           Toy Story (1995)
genres               Animation|Children's|Comedy
Genre_Animation                                1
Genre_Children's                               1
Genre_Comedy                                   1
Genre_Adventure                                0
Genre_Fantasy                                  0
Genre_Romance                                  0
Genre_Drama                                    0
Genre_Action                                   0
Genre_Crime                                    0
Genre_Thriller                                 0
Genre_Horror                                   0
Genre_Sci-Fi                                   0
Genre_Documentary                              0
Genre_War                                      0
Genre_Musical                                  0
Genre_Mystery                                  0
Genre_Film-Noir                                0
Genre_Western                                  0
Name: 0, dtype: object

NOTE - For larger data, the above method for creating indicator method is not speedy. It would be better to write a lower-level function that writes directly to NumPy array and then wraps the result in a DataFrame. Another useful method in statistical applications is to combine the dummy function with discretization function like cut. This shows if a value is present in a bin or not. For below example we will use random seed to make the example more deterministic.

In [73]:
np.random.seed(12345)
values = np.random.rand(10)
values
Out[73]:
array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503,
       0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])
In [74]:
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
pd.get_dummies(pd.cut(values, bins))
Out[74]:
(0.0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1.0]
0 0 0 0 0 1
1 0 1 0 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 0 0 1 0 0
5 0 0 1 0 0
6 0 0 0 0 1
7 0 0 0 1 0
8 0 0 0 1 0
9 0 0 0 1 0

String ManipulationΒΆ

Python is a popular raw data manipulation language due to its ease of use for string and text processing. Simple text operations can be done using String object's built-in methods. For more complex pattern matching and text manipulations, we can use regular expressions. pandas enable us to apply both string and regex functions on whole arrays of data.

String Object MethodsΒΆ

In most string manipulation scenarios, inbuilt string methods are mostly sufficient. A string can be broken based on a seperator using 'split'. It is often combined with 'strip' to trim out whitespace, including line breaks. Substrings can be concatenated together using the '+' operator. A more faster and pythonic way to do so is to pass a list or tuple of substrings to the 'join' method on the 'stitching' string.

In [78]:
val = 'a,b,   guido'

val.split(',')
Out[78]:
['a', 'b', '   guido']
In [79]:
pieces = [x.strip() for x in val.split(',')]
pieces
Out[79]:
['a', 'b', 'guido']
In [80]:
first, second, third = pieces
first + '::' + second + '::' + third
Out[80]:
'a::b::guido'
In [81]:
'::'.join(pieces)
Out[81]:
'a::b::guido'

The 'in' keyword is the best way to detect substring. Although, 'index' and 'find' can also be used. There is 1 major difference between 'find' and 'index'. 'index' returns an Exception if substring is not found. 'find' returns a -1. 'count' returns number of occurences of a particular substring. 'replace' substitutes occurence of one pattern for another. It is commonly used to delete patterns by passing an empty string as replacement.

In [83]:
'guido' in val
Out[83]:
True
In [84]:
val.index(',')
Out[84]:
1
In [85]:
val.find(':')
Out[85]:
-1
In [86]:
# val.index(':')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-86-2c016e7367ac> in <module>()
----> 1 val.index(':')

ValueError: substring not found
In [87]:
val.count(',')
Out[87]:
2
In [88]:
val.replace(',', '::')
Out[88]:
'a::b::   guido'
In [89]:
val.replace(',', '')
Out[89]:
'ab   guido'

Regular ExpressionΒΆ

It is a flexible way to search or match (often complex) string patterns in text. The expression 'regex' is a string formed by 'regular expression'. Python has built-in 're' module for applying regular expressions to strings. This module works in 3 categories, all are related : 'pattern matching', 'substitution' and 'splitting'. We first describe a pattern that we want to locate in text. This can be used for many purposes.

Once we define a regex, we can use it to split a text using 'split'. In this command, the regex is first compiled and then its split method is called on to the passed text. We can compile regex on our own using 're.compile' which will form a reusable regex object. Compiling regex beforehand is highly recommended if we have to apply it to many strings. This will save CPU cycles. NOTE - To avoid using escape sequences in regex, use raw string literals. eg - r'C:\x' instead of 'c:\x'.

In [92]:
import re

text = "foo bar\t baz   \tqux"
re.split('\s+', text)
Out[92]:
['foo', 'bar', 'baz', 'qux']
In [94]:
regex = re.compile('\s+')
regex.split(text)
Out[94]:
['foo', 'bar', 'baz', 'qux']

We can use the 'findall' function to list all the matching patterns to the regex. To get only the first match in a string, we can use 'search' method. 'match' only matches at the beginning of the string i.e. it checks if the string startes with the regex. Using 'IGNORECASE' makes the regex case-insensitive.

In [96]:
regex.findall(text)
Out[96]:
[' ', '\t ', '   \t']
In [100]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""

pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
regex = re.compile(pattern, flags=re.IGNORECASE)

'findall' returns a list of matching pattern. 'search' returns a special match object for the first matching pattern. It can only tell us the start and end positions of the pattern in a string. 'match' will return None if the pattern does not exist in the string. 'sub' will return a new string with occurences of the pattern replaced by a new pattern.

In [102]:
regex.findall(text)
Out[102]:
['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']
In [103]:
m = regex.search(text)
m
Out[103]:
<_sre.SRE_Match object; span=(5, 20), match='dave@google.com'>
In [104]:
text[m.start():m.end()]
Out[104]:
'dave@google.com'
In [105]:
print(regex.match(text))
None
In [106]:
print(regex.sub('REDACTED', text))
Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED

To find the patterns and at the same time segment it into different parts, put parentheses around the parts of the pattern to segment. In this case, 'findall' will return a list of tuples, when the pattern has groups. 'match' will return a tuple of the pattern components with its 'groups' method. 'sub' can also be used to access the groups in each match with special symbols '\1', '\2'. Here '\1' corresponds to first matched group.

In [108]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)
In [109]:
regex.findall(text)
Out[109]:
[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]
In [111]:
m = regex.match('wes@bright.net')
m.groups()
Out[111]:
('wes', 'bright', 'net')
In [112]:
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))
Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com

Vectorized String Functions in pandasΒΆ

The presence of missing data in columns makes data cleaning in datasets very difficult. We could apply string or regex methods using lambda or other functions to each value using data.map, but this will fail on the NA values. To avoid this Series has array-oriented methods for string operations that skip NA values. They can be accessed using the 'str' attribute.

In [116]:
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
        'Rob': 'rob@gmail.com', 'Wes': np.nan}
data = pd.Series(data)
data
Out[116]:
Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object
In [117]:
data.isnull()
Out[117]:
Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool

We can check if a string contains a pattern using the 'str.contains' function. We can use regex too, with options like 'IGNORECASE'.

In [119]:
data.str.contains('gmail')
Out[119]:
Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object
In [120]:
pattern
Out[120]:
'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'
In [121]:
data.str.findall(pattern, flags=re.IGNORECASE)
Out[121]:
Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object

To retrieve elements from Vector, we can either use 'str.get' or 'index'. To access elements in embedded list, we pass an index to either of these functions. We can also slice strings with this syntax.

In [123]:
matches = data.str.match(pattern, flags=re.IGNORECASE)
matches
Out[123]:
Dave     True
Steve    True
Rob      True
Wes       NaN
dtype: object
In [124]:
matches.str.get(1)
Out[124]:
Dave    NaN
Steve   NaN
Rob     NaN
Wes     NaN
dtype: float64
In [125]:
matches.str[0]
Out[125]:
Dave    NaN
Steve   NaN
Rob     NaN
Wes     NaN
dtype: float64
In [126]:
data.str[:5]
Out[126]:
Dave     dave@
Steve    steve
Rob      rob@g
Wes        NaN
dtype: object

IntroductionΒΆ

In many applications, data may be spread across number of files or databases, or be arranged in a for that is not easy to analyze. Knowing how to combine, join and rearrange data is an important skill in the Data Analyst toolkit.

Hierarchical IndexingΒΆ

It is an important feature in pandas that enables you to have multiple index levels on an axis. It provides a way to work with higher dimensional data in lower dimensional form. When looking at a Series or DataFrame with multi-index, you will see "gaps" in the higher index, which means "same as the one above".

In [1]:
import pandas as pd
import numpy as np

data = pd.Series(np.random.randn(9),
                index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
                      [1, 2, 3, 1, 3, 1, 2, 2, 3]])

data
Out[1]:
a  1    1.407820
   2   -1.461955
   3    1.228871
b  1    0.154511
   3   -0.003190
c  1   -0.027980
   2   -0.095242
d  2   -0.562603
   3   -0.681595
dtype: float64
In [2]:
data.index
Out[2]:
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 2, 0, 1, 1, 2]])

With hierarchical indexed objects, you can perform 'partial indexing', which enables us to concisely select subsets of data. Selection is also possible for "inner" level of indexes.

In [3]:
data['b']
Out[3]:
1    0.154511
3   -0.003190
dtype: float64
In [4]:
data['b':'c']
Out[4]:
b  1    0.154511
   3   -0.003190
c  1   -0.027980
   2   -0.095242
dtype: float64
In [5]:
data.loc[['b', 'd']]
Out[5]:
b  1    0.154511
   3   -0.003190
d  2   -0.562603
   3   -0.681595
dtype: float64
In [6]:
data.loc[:,2]
Out[6]:
a   -1.461955
c   -0.095242
d   -0.562603
dtype: float64

Hierarchical indexing has important role in reshaping data and group=based operations. eg - forming pivot table. You could rearrange data into a DataFrame using its 'unstack' method. The inverse operation of stack is 'stack'.

In [7]:
data.unstack()
Out[7]:
1 2 3
a 1.407820 -1.461955 1.228871
b 0.154511 NaN -0.003190
c -0.027980 -0.095242 NaN
d NaN -0.562603 -0.681595
In [8]:
data.unstack().stack()
Out[8]:
a  1    1.407820
   2   -1.461955
   3    1.228871
b  1    0.154511
   3   -0.003190
c  1   -0.027980
   2   -0.095242
d  2   -0.562603
   3   -0.681595
dtype: float64

In a DataFrame, either axis can have hierarchical index. The hierarchical indexes can have names and they will be shown in the console output. NOTE - Be careful not to mix-up index names with row labels. With partial column indexing, we can select groups of columns. A 'MultiIndex' can be created by itself and then reused.

In [9]:
frame = pd.DataFrame(np.arange(12).reshape((4,3)),
                    index=[['a', 'a', 'b', 'b'], [1,2,1,2]],
                    columns = [['Ohio', 'Ohio', 'Colorado'],
                              ['Green', 'Red', 'Green']])
frame
Out[9]:
Ohio Colorado
Green Red Green
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
In [10]:
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame
Out[10]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
In [11]:
frame['Ohio']
Out[11]:
color Green Red
key1 key2
a 1 0 1
2 3 4
b 1 6 7
2 9 10
In [12]:
pd.MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']],
                      names = ['state', 'color'])
Out[12]:
MultiIndex(levels=[['Colorado', 'Ohio'], ['Green', 'Red']],
           labels=[[1, 1, 0], [0, 1, 0]],
           names=['state', 'color'])

Reordering and Sorting LevelsΒΆ

Sometimes, we may need to rearrange the order of the levels on an axis or sort the data by values in one specific level. We can use 'swaplevel', which takes 2 level numbers or name and returns a new object with the levels interchanged. The data is unaltered. 'sort_index' sorts the data using the values in a single level. We often use sort_index when swapping levels so that result is lexicographically sorted. NOTE - Data selection performance is better on hierarchically indexed objects if index is lexicographically sorted, starting from outermost level.

In [13]:
frame.swaplevel('key1', 'key2')
Out[13]:
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
2 a 3 4 5
1 b 6 7 8
2 b 9 10 11
In [14]:
frame.sort_index(level=1)
Out[14]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
b 1 6 7 8
a 2 3 4 5
b 2 9 10 11
In [15]:
frame.swaplevel(0,1).sort_index(level=0)
Out[15]:
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
b 6 7 8
2 a 3 4 5
b 9 10 11

Summary statistics by LevelΒΆ

Many Descriptive and Summary statistics on DataFrame and Series have 'level' option, where you can specify the level you want to aggregate by on an axis. we can aggregate on either rows or columns. Under the hood, it utilizes pandas' 'groupby machinery'.

In [16]:
frame.sum(level='key2')
Out[16]:
state Ohio Colorado
color Green Red Green
key2
1 6 8 10
2 12 14 16
In [17]:
frame.sum(level='color', axis=1)
Out[17]:
color Green Red
key1 key2
a 1 2 1
2 8 4
b 1 14 7
2 20 10

Indexing with a DataFrame's columnsΒΆ

We may want to use 1 or more columns as Row Index in DataFrame or alternatively, move row index into columns. The 'set_index' method can create a new DataFrame using 1 or more of its columns as index. By default, columns are removed from the DataFrame, but we can also leave them in. 'reset_index' is the opposite of set_index. It moves hierarchical index levels into columns.

In [18]:
frame = pd.DataFrame({'a': range(7), 'b':range(7,0,-1),
                     'c':['one', 'one', 'one', 'two', 'two',
                          'two', 'two'],
                      'd': [0,1,2,0,1,2,3]})
frame                 
Out[18]:
a b c d
0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
3 3 4 two 0
4 4 3 two 1
5 5 2 two 2
6 6 1 two 3
In [19]:
frame2 = frame.set_index(['c', 'd'])
frame2
Out[19]:
a b
c d
one 0 0 7
1 1 6
2 2 5
two 0 3 4
1 4 3
2 5 2
3 6 1
In [20]:
frame.set_index(['c','d'], drop=False)
Out[20]:
a b c d
c d
one 0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
two 0 3 4 two 0
1 4 3 two 1
2 5 2 two 2
3 6 1 two 3
In [21]:
frame2.reset_index()
Out[21]:
c d a b
0 one 0 0 7
1 one 1 1 6
2 one 2 2 5
3 two 0 3 4
4 two 1 4 3
5 two 2 5 2
6 two 3 6 1

Combining and Merging DatasetsΒΆ

Data in pandas can be combined in many ways:

1. 'pandas.merge' connects rows in DataFrames based on 1 or more keys. This is similar to 'join' operations in SQL.
2. 'pandas.concat' concatenates or 'stacks' together objects along an axis.
3. 'combine_first' method enables splicing together overlapping data to fill in missing values in one object from another. 

Database-Style DataFrame JoinsΒΆ

Merge or join combines datasets by linking rows using 1 or more keys. They are central to relational databasees. The 'merge' function in pandas is the main entry point for using these operations on your dataset. We can have a many-to-one join, where 1 object has multiple rows with same index labels whereas the other object has only 1 row for each label. We get a combination of the rows from both objects.

In [24]:
df1 = pd.DataFrame({'key': ['b','b','a','c','a','a','b'],
                   'data1':range(7)})
df2 = pd.DataFrame({'key':['a','b','d'],
                   'data2': range(3)})
df1
Out[24]:
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6
In [25]:
df2
Out[25]:
key data2
0 a 0
1 b 1
2 d 2
In [26]:
pd.merge(df1, df2)
Out[26]:
key data1 data2
0 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0

It is not necessary to specify which column to join on. 'merge' uses overlapping column names as the keys. But it is a good practice to specify explicitly. If column names are different in the 2 objects, you need to specify them in the merge explicitly. By default, merge does an 'inner join' so keys in the result are intersection or common set found in both objects.There are other possible join options- 'left', 'right' and 'outer'. 'Outer join' does a union of the keys, combining the effect of left and right joins.

In [28]:
pd.merge(df1, df2, on='key')
Out[28]:
key data1 data2
0 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0
In [29]:
df3 = pd.DataFrame({'lkey': ['b','b','a','c','a','a','b'],
                   'data1':range(7)})
df4 = pd.DataFrame({'rkey':['a','b','d'],
                   'data2': range(3)})

pd.merge(df3, df4, left_on='lkey', right_on='rkey')
Out[29]:
lkey data1 rkey data2
0 b 0 b 1
1 b 1 b 1
2 b 6 b 1
3 a 2 a 0
4 a 4 a 0
5 a 5 a 0
In [30]:
pd.merge(df1, df2, how='outer')
Out[30]:
key data1 data2
0 b 0.0 1.0
1 b 1.0 1.0
2 b 6.0 1.0
3 a 2.0 0.0
4 a 4.0 0.0
5 a 5.0 0.0
6 c 3.0 NaN
7 d NaN 2.0

Many-to-many merges have well-defined although not vey intuitive behaviour. They form Cartesian product of the rows. eg- 3 rows on left and 2 rows on right for the same label will lead to 6 rows in the result.

In [33]:
df1 = pd.DataFrame({'key': ['b','b','a','c','a', 'b'],
                   'data1':range(6)})
df2 = pd.DataFrame({'key':['a','b','a','b','d'],
                   'data2': range(5)})

df1
Out[33]:
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 b 5
In [34]:
df2
Out[34]:
key data2
0 a 0
1 b 1
2 a 2
3 b 3
4 d 4
In [35]:
pd.merge(df1, df2, on='key', how='left')
Out[35]:
key data1 data2
0 b 0 1.0
1 b 0 3.0
2 b 1 1.0
3 b 1 3.0
4 a 2 0.0
5 a 2 2.0
6 c 3 NaN
7 a 4 0.0
8 a 4 2.0
9 b 5 1.0
10 b 5 3.0
In [36]:
pd.merge(df1, df2, how='inner')
Out[36]:
key data1 data2
0 b 0 1
1 b 0 3
2 b 1 1
3 b 1 3
4 b 5 1
5 b 5 3
6 a 2 0
7 a 2 2
8 a 4 0
9 a 4 2

To merge on multiple keys, pass a list of columns. To determine the key combinations in the result, think an array of tuples formed by the multiple keys being used as a single join key. NOTE - When joining columns on columns, indexes on passed DataFrame objects are discarded. When merging, we may get overlapping column names. We can address manually by renaming the axis labels. 'merge' has a 'suffixes' option for specifying strings to append to overlapping names in left and right objects.

In [38]:
left = pd.DataFrame({'key1':['foo', 'foo', 'bar'],
                    'key2':['one', 'two', 'one'],
                    'lval':[1, 2, 3]})
right = pd.DataFrame({'key1':['foo', 'foo', 'bar', 'bar'],
                     'key2': ['one', 'one', 'one', 'two'],
                     'rval': [4, 5, 6, 7]})

pd.merge(left, right, on=['key1', 'key2'], how='outer')
Out[38]:
key1 key2 lval rval
0 foo one 1.0 4.0
1 foo one 1.0 5.0
2 foo two 2.0 NaN
3 bar one 3.0 6.0
4 bar two NaN 7.0
In [39]:
pd.merge(left, right, on='key1')
Out[39]:
key1 key2_x lval key2_y rval
0 foo one 1 one 4
1 foo one 1 one 5
2 foo two 2 one 4
3 foo two 2 one 5
4 bar one 3 one 6
5 bar one 3 two 7
In [40]:
pd.merge(left, right, on='key1', suffixes=('_left', '_right'))
Out[40]:
key1 key2_left lval key2_right rval
0 foo one 1 one 4
1 foo one 1 one 5
2 foo two 2 one 4
3 foo two 2 one 5
4 bar one 3 one 6
5 bar one 3 two 7

Merging on IndexΒΆ

In some scenarios, the index will contain the key(s) along which the objects need to be merged. You can pass 'left_index=True' or 'right_index=True' or both to indicate which index should be used to merge. The default method for merge is to intersect join keys. But we can instead form 'union' with an 'outer_join'.

In [43]:
left1 = pd.DataFrame({'key':['a','b','a','a','b','c'],
                     'value':range(6)})
right1 = pd.DataFrame({'group_val':[3.5, 7]}, index=['a','b'])

left1
Out[43]:
key value
0 a 0
1 b 1
2 a 2
3 a 3
4 b 4
5 c 5
In [44]:
right1
Out[44]:
group_val
a 3.5
b 7.0
In [45]:
pd.merge(left1, right1, left_on='key', right_index=True)
Out[45]:
key value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
In [46]:
pd.merge(left1, right1, left_on='key', right_index=True, how='outer')
Out[46]:
key value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
5 c 5 NaN

Things are more complicated with hierarchically indexed data, as joining on index is implicitly multi-key merge. You can indicate multiple columns to merge as a list. And handle duplicate data with 'how=outer'. Using the indexes of both sides is also possible.

In [48]:
lefth = pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
                     'key2': [2000, 2001, 2002, 2001, 2002],
                     'data': np.arange(5.)})
righth = pd.DataFrame(np.arange(12).reshape((6,2)),
                     index=[['Nevada', 'Nevada', 'Ohio', 'Ohio',
                            'Ohio', 'Ohio'],
                           [2001, 2000, 2000, 2000, 2001, 2002]],
                     columns=['event1', 'event2'])

lefth
Out[48]:
key1 key2 data
0 Ohio 2000 0.0
1 Ohio 2001 1.0
2 Ohio 2002 2.0
3 Nevada 2001 3.0
4 Nevada 2002 4.0
In [49]:
righth
Out[49]:
event1 event2
Nevada 2001 0 1
2000 2 3
Ohio 2000 4 5
2000 6 7
2001 8 9
2002 10 11
In [50]:
pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True)
Out[50]:
key1 key2 data event1 event2
0 Ohio 2000 0.0 4 5
0 Ohio 2000 0.0 6 7
1 Ohio 2001 1.0 8 9
2 Ohio 2002 2.0 10 11
3 Nevada 2001 3.0 0 1
In [51]:
pd.merge(lefth, righth, left_on=['key1', 'key2'], 
         right_index=True, how='outer')
Out[51]:
key1 key2 data event1 event2
0 Ohio 2000 0.0 4.0 5.0
0 Ohio 2000 0.0 6.0 7.0
1 Ohio 2001 1.0 8.0 9.0
2 Ohio 2002 2.0 10.0 11.0
3 Nevada 2001 3.0 0.0 1.0
4 Nevada 2002 4.0 NaN NaN
4 Nevada 2000 NaN 2.0 3.0
In [52]:
left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]],
                    index=['a', 'c', 'e'],
                    columns=['Ohio', 'Nevada'])
right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],
                     index=['b','c','d','e'],
                     columns=['Missouri','Alabama'])

left2
Out[52]:
Ohio Nevada
a 1.0 2.0
c 3.0 4.0
e 5.0 6.0
In [53]:
right2
Out[53]:
Missouri Alabama
b 7.0 8.0
c 9.0 10.0
d 11.0 12.0
e 13.0 14.0
In [54]:
pd.merge(left2, right2, how='outer', left_index=True, right_index=True)
Out[54]:
Ohio Nevada Missouri Alabama
a 1.0 2.0 NaN NaN
b NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0
d NaN NaN 11.0 12.0
e 5.0 6.0 13.0 14.0

Dataframe has a 'join' method for merging by index. It can also be used to combine together many DataFrame objects that have similar indexes but non-overlapping columns. Mostly for legacy reasons, the join method performs a left join on the keys, preserving the left frame's ro index. It also supports joining the index of passed DataFrame on one of the columns of the calling DataFrame. For index-on-index merges, you can pass a list of DataFrames to join instead of using the 'concat' function.

In [56]:
left2.join(right2, how='outer')
Out[56]:
Ohio Nevada Missouri Alabama
a 1.0 2.0 NaN NaN
b NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0
d NaN NaN 11.0 12.0
e 5.0 6.0 13.0 14.0
In [57]:
left1.join(right1, on='key')
Out[57]:
key value group_val
0 a 0 3.5
1 b 1 7.0
2 a 2 3.5
3 a 3 3.5
4 b 4 7.0
5 c 5 NaN
In [58]:
another = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [16., 17.]],
                      index=['a', 'c', 'e', 'f'],
                      columns=['New York', 'Oregon'])
another
Out[58]:
New York Oregon
a 7.0 8.0
c 9.0 10.0
e 11.0 12.0
f 16.0 17.0
In [59]:
left2.join([right2, another])
Out[59]:
Ohio Nevada Missouri Alabama New York Oregon
a 1.0 2.0 NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0 9.0 10.0
e 5.0 6.0 13.0 14.0 11.0 12.0
In [60]:
left2.join([right2, another], how='outer')
C:\Users\adity\Anaconda3\lib\site-packages\pandas\core\frame.py:6359: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False

  verify_integrity=True)
Out[60]:
Ohio Nevada Missouri Alabama New York Oregon
a 1.0 2.0 NaN NaN 7.0 8.0
b NaN NaN 7.0 8.0 NaN NaN
c 3.0 4.0 9.0 10.0 9.0 10.0
d NaN NaN 11.0 12.0 NaN NaN
e 5.0 6.0 13.0 14.0 11.0 12.0
f NaN NaN NaN NaN 16.0 17.0

Concatenating along an AxisΒΆ

Another data concatenation method is interchangeably known as concatenation, binding or stacking. NumPy has 'concatenate' which performs this on NumPy arrays.

In [62]:
arr = np.arange(12).reshape((3,4))
arr
Out[62]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
In [63]:
np.concatenate([arr, arr], axis=1)
Out[63]:
array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])

In pandas' objects, we have labelled axes, which help us in generalizing array concatenation. There are a few things to think about before concatenating pandas objects:

 1. If the 2 objects are indexed differently on the other axis, should we combine the distinct element or use only the shared values along that axis (intersection)?
 2. Do the concatenated chunks need to be identifiable in the resulting object?
 3. Does the 'concatenating axis' have sata that needs to be preserved? In general, integer labels should best be discarded during concatenation.

The 'concat' method provides a consistent way to address the above issues. For 'Series' with no overlap, concat just glues the values and indexes. By default, it works along 'axis=0', but if we pass 'axis=1', teh result will be a DataFrame (axis=1 is for columns).

In [66]:
s1 = pd.Series([0,1], index=['a', 'b'])
s2 = pd.Series([2,3,4], index=['c', 'd', 'e'])
s3 = pd.Series([5,6], index=['f', 'g'])
In [67]:
pd.concat([s1, s2, s3])
Out[67]:
a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64
In [68]:
pd.concat([s1, s2, s3], axis=1)
C:\Users\adity\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False

  """Entry point for launching an IPython kernel.
Out[68]:
0 1 2
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0

If there is no overlap on the other axis (like above example), we see a 'sorted union' (or outer join). You can get an intersect by passing "join='inner'". If you want to specify the axes to be used in 'other axes' do it with 'join_axes'.

In [70]:
s4 = pd.concat([s1, s3])
s4
Out[70]:
a    0
b    1
f    5
g    6
dtype: int64
In [72]:
pd.concat([s1, s4], axis=1)
C:\Users\adity\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False

  """Entry point for launching an IPython kernel.
Out[72]:
0 1
a 0.0 0
b 1.0 1
f NaN 5
g NaN 6
In [73]:
pd.concat([s1, s4], axis=1, join='inner')
Out[73]:
0 1
a 0 0
b 1 1
In [74]:
pd.concat([s1, s4], axis=1, join_axes=[['a', 'c', 'b', 'e']])
Out[74]:
0 1
a 0.0 0.0
c NaN NaN
b 1.0 1.0
e NaN NaN

Another issue is to identify concatenated pieces in the result. To create a hierarchical index in the concatenated axis, we can use the 'keys' argument. When we use axis=1, the keys become column headers. This is true to both Series and DataFrame. If we pass a dict of objects instead of a list, then dict's keys will be used for the 'keys' option. There are other arguments in hierarchical index creation. eg - To name a created axis just use the 'names' argument.

In [76]:
result = pd.concat([s1, s2, s3], keys=['one', 'two', 'three'])
result
Out[76]:
one    a    0
       b    1
two    c    2
       d    3
       e    4
three  f    5
       g    6
dtype: int64
In [77]:
result.unstack()
Out[77]:
a b c d e f g
one 0.0 1.0 NaN NaN NaN NaN NaN
two NaN NaN 2.0 3.0 4.0 NaN NaN
three NaN NaN NaN NaN NaN 5.0 6.0
In [78]:
pd.concat([s1, s2, s3], axis=1, keys=['one', 'two', 'three'])
C:\Users\adity\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False

  """Entry point for launching an IPython kernel.
Out[78]:
one two three
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
In [80]:
df1 = pd.DataFrame(np.arange(6).reshape(3,2), index=['a', 'b', 'c'],
                  columns=['one', 'two'])
df2 = pd.DataFrame(5 + np.arange(4).reshape(2,2), index=['a', 'c'],
                  columns=['three', 'four'])

df1
Out[80]:
one two
a 0 1
b 2 3
c 4 5
In [81]:
df2
Out[81]:
three four
a 5 6
c 7 8
In [82]:
pd.concat([df1, df2], axis=1, keys=['level1', 'level2'])
C:\Users\adity\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False

  """Entry point for launching an IPython kernel.
Out[82]:
level1 level2
one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
In [83]:
pd.concat({'level1':df1, 'level2': df2}, axis=1)
C:\Users\adity\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False

  """Entry point for launching an IPython kernel.
Out[83]:
level1 level2
one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
In [84]:
pd.concat([df1, df2], axis=1, keys=['level1', 'level2'],
         names=['upper', 'lower'])
C:\Users\adity\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False

  
Out[84]:
upper level1 level2
lower one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0

The last issue is that the row index does not contain any relevant data. To avoid this, pass 'ignore_index=True'.

In [86]:
df1 = pd.DataFrame(np.random.randn(3,4), columns=['a', 'b', 'c', 'd'])
df2 = pd.DataFrame(np.random.randn(2,3), columns=['b', 'd', 'a'])
df1
Out[86]:
a b c d
0 -0.588973 1.051040 1.465350 2.065064
1 0.629663 0.600635 -0.489396 0.897867
2 -1.732375 -0.397218 -1.519736 -1.329098
In [87]:
df2
Out[87]:
b d a
0 0.279414 -1.949706 -0.988094
1 1.090094 1.267435 -0.928241
In [88]:
pd.concat([df1, df2], ignore_index=True)
C:\Users\adity\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False

  """Entry point for launching an IPython kernel.
Out[88]:
a b c d
0 -0.588973 1.051040 1.465350 2.065064
1 0.629663 0.600635 -0.489396 0.897867
2 -1.732375 -0.397218 -1.519736 -1.329098
3 -0.988094 0.279414 NaN -1.949706
4 -0.928241 1.090094 NaN 1.267435

Combining Data with OverlapΒΆ

There is another data operation involving 2 datasets with indexes fully or partly overlapped. NumPy has 'where' function which performs array-equivalent of an 'if-else'. pandas has its equivalent 'combine_first'. In Series, this will perform the equivalent operation with the data alignment logic of pandas. In DataFrames, it does the same column by column. So it is "patching" missing data from calling object with values from passed object.

In [90]:
a = pd.Series([np.nan, 2.5, 0.0, 3.5, 4.5, np.nan],
             index=['f','e','d','c','b','a'])
b = pd.Series([0., np.nan, 2., np.nan, np.nan, 5.],
             index=['a','b','c','d','e','f'])

a
Out[90]:
f    NaN
e    2.5
d    0.0
c    3.5
b    4.5
a    NaN
dtype: float64
In [91]:
b
Out[91]:
a    0.0
b    NaN
c    2.0
d    NaN
e    NaN
f    5.0
dtype: float64
In [93]:
np.where(pd.isnull(a), b, a)
Out[93]:
array([0. , 2.5, 0. , 3.5, 4.5, 5. ])
In [94]:
b.combine_first(a)
Out[94]:
a    0.0
b    4.5
c    2.0
d    0.0
e    2.5
f    5.0
dtype: float64
In [96]:
df1 = pd.DataFrame({'a':[1., np.nan, 5., np.nan],
                   'b':[np.nan, 2., np.nan, 6.],
                   'c':range(2,18,4)})
df2 = pd.DataFrame({'a':[5., 4., np.nan, 3., 7.],
                   'b':[np.nan, 3., 4., 6., 8.]})

df1
Out[96]:
a b c
0 1.0 NaN 2
1 NaN 2.0 6
2 5.0 NaN 10
3 NaN 6.0 14
In [97]:
df2
Out[97]:
a b
0 5.0 NaN
1 4.0 3.0
2 NaN 4.0
3 3.0 6.0
4 7.0 8.0
In [98]:
df1.combine_first(df2)
Out[98]:
a b c
0 1.0 NaN 2.0
1 4.0 2.0 6.0
2 5.0 4.0 10.0
3 3.0 6.0 14.0
4 7.0 8.0 NaN

Reshaping and PivotingΒΆ

There are a number of basic operations for rearranging tabular data. These are called 'reshape' or 'pivot' operations.ΒΆ

Reshaping with Hierarchical IndexingΒΆ

Hierarchical indexing allows to rearrange data consistently in a DataFrame. There are 2 primary actions:

 1. stack - It 'rotates' or pivots from columns to rows.
 2. unstack - Pivots rows into columns.

Calling 'stack' pivots columns into rows, producing a Series. If you have a hierarchically indexed Series, we can rearrange the data back into a DataFrame with 'unstack'.

In [101]:
data = pd.DataFrame(np.arange(6).reshape((2,3)),
                   index=pd.Index(['Ohio', 'Colorado'], name='state'),
                   columns = pd.Index(['one', 'two', 'three'],
                                     name = 'number'))
data
Out[101]:
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
In [103]:
result = data.stack()
result
Out[103]:
state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int32
In [104]:
result.unstack()
Out[104]:
number one two three
state
Ohio 0 1 2
Colorado 3 4 5

By default the innermost index gets 'stacked' or 'unstacked'. You can unstack a different level by passing a level number or name. Unstacking might introduce missing data if all values in the level aren't foundin each subgroups.

In [106]:
result.unstack(0)
Out[106]:
state Ohio Colorado
number
one 0 3
two 1 4
three 2 5
In [107]:
result.unstack('state')
Out[107]:
state Ohio Colorado
number
one 0 3
two 1 4
three 2 5
In [108]:
s1 = pd.Series([0,1,2,3], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([4,5,6], index=['c','d','e'])
data2 = pd.concat([s1, s2], keys=['one', 'two'])

data2
Out[108]:
one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64
In [109]:
data2.unstack()
Out[109]:
a b c d e
one 0.0 1.0 2.0 3.0 NaN
two NaN NaN 4.0 5.0 6.0

Stacking filters out missing data by default, so that the operation is easily invertible. When you unstack, the level unstacked becomes lowest level in the result. when calling stack or unstack, we can indicate the name of the axis to the stack.

In [111]:
data2.unstack()
Out[111]:
a b c d e
one 0.0 1.0 2.0 3.0 NaN
two NaN NaN 4.0 5.0 6.0
In [113]:
data2.unstack().stack()
Out[113]:
one  a    0.0
     b    1.0
     c    2.0
     d    3.0
two  c    4.0
     d    5.0
     e    6.0
dtype: float64
In [114]:
data2.unstack().stack(dropna=False)
Out[114]:
one  a    0.0
     b    1.0
     c    2.0
     d    3.0
     e    NaN
two  a    NaN
     b    NaN
     c    4.0
     d    5.0
     e    6.0
dtype: float64
In [115]:
df = pd.DataFrame({'left': result, 'right': result+5},
                 columns=pd.Index(['left', 'right'], name='side'))
df
Out[115]:
side left right
state number
Ohio one 0 5
two 1 6
three 2 7
Colorado one 3 8
two 4 9
three 5 10
In [116]:
df.unstack('state')
Out[116]:
side left right
state Ohio Colorado Ohio Colorado
number
one 0 3 5 8
two 1 4 6 9
three 2 5 7 10
In [117]:
df.unstack('state').stack('side')
Out[117]:
state Colorado Ohio
number side
one left 3 0
right 8 5
two left 4 1
right 9 6
three left 5 2
right 10 7

Pivoting "Long" to "Wide" FormatΒΆ

A common way to store multiple time series in the so-called 'long' or 'stacked' format. One way to store time index is via 'PeriodIndex', which combines year and quarter columns to create a kind of time interval type. we use this interval to index our data.

In [124]:
data = pd.read_csv('examples/macrodata.csv')
data.head()
Out[124]:
year quarter realgdp realcons realinv realgovt realdpi cpi m1 tbilrate unemp pop infl realint
0 1959.0 1.0 2710.349 1707.4 286.898 470.045 1886.9 28.98 139.7 2.82 5.8 177.146 0.00 0.00
1 1959.0 2.0 2778.801 1733.7 310.859 481.301 1919.7 29.15 141.7 3.08 5.1 177.830 2.34 0.74
2 1959.0 3.0 2775.488 1751.8 289.226 491.260 1916.4 29.35 140.5 3.82 5.3 178.657 2.74 1.09
3 1959.0 4.0 2785.204 1753.7 299.356 484.052 1931.3 29.37 140.0 4.33 5.6 179.386 0.27 4.06
4 1960.0 1.0 2847.699 1770.5 331.722 462.199 1955.5 29.54 139.6 3.50 5.2 180.007 2.31 1.19
In [125]:
data.columns
Out[125]:
Index(['year', 'quarter', 'realgdp', 'realcons', 'realinv', 'realgovt',
       'realdpi', 'cpi', 'm1', 'tbilrate', 'unemp', 'pop', 'infl', 'realint'],
      dtype='object')
In [126]:
periods = pd.PeriodIndex(year=data.year, quarter=data.quarter,
                        name='date')
columns = pd.Index(['realgdp', 'infl', 'unemp'], name='item')
data = data.reindex(columns=columns)
data.index = periods.to_timestamp('D', 'end')
ldata = data.stack().reset_index().rename(columns={0:'value'})
In [128]:
ldata[:5]
Out[128]:
date item value
0 1959-03-31 realgdp 2710.349
1 1959-03-31 infl 0.000
2 1959-03-31 unemp 5.800
3 1959-06-30 realgdp 2778.801
4 1959-06-30 infl 2.340

The above way of displaying data is called 'long format' for multiple time series or other observational data with 2or more keys. Each row in the table is a single observation. Data is frequently stored in this format in RDBMS as a fixed schema as it allows number of distinct values in the 'item' column to change as more data is added to the table. In the previous example, date and item would be the primary keys, offering both relational integrity and easier joins. But in some cases, this kind of data would be more difficult to work with, as you might prefer a DataFrame containing 1 column per distinct item indexed by timestamps. This is exactly what 'pivot' method provides.

In [134]:
pivoted = ldata.pivot('date', 'item', 'value')
pivoted[:10]
Out[134]:
item infl realgdp unemp
date
1959-03-31 0.00 2710.349 5.8
1959-06-30 2.34 2778.801 5.1
1959-09-30 2.74 2775.488 5.3
1959-12-31 0.27 2785.204 5.6
1960-03-31 2.31 2847.699 5.2
1960-06-30 0.14 2834.390 5.2
1960-09-30 2.70 2839.022 5.6
1960-12-31 1.21 2802.616 6.3
1961-03-31 -0.40 2819.264 6.8
1961-06-30 1.47 2872.005 7.0

The first 2 values passed in the pivot method are the columns to be used as 'row' and 'column' index. If you had 2 'value' columns that you needed to reshape simultaneously. In pivot, by omitting the last argument, you get a DataFrame with hierarchical columns. Using pivot is just like creating a hierarchical index using the 'set_index' followed by a call to 'unstack'.

In [136]:
ldata['value2'] = np.random.randn(len(ldata))
ldata[:5]
Out[136]:
date item value value2
0 1959-03-31 realgdp 2710.349 -0.129071
1 1959-03-31 infl 0.000 -0.997675
2 1959-03-31 unemp 5.800 -0.388757
3 1959-06-30 realgdp 2778.801 -0.892829
4 1959-06-30 infl 2.340 -1.909896
In [137]:
pivoted = ldata.pivot('date', 'item')
pivoted[:5]
Out[137]:
value value2
item infl realgdp unemp infl realgdp unemp
date
1959-03-31 0.00 2710.349 5.8 -0.997675 -0.129071 -0.388757
1959-06-30 2.34 2778.801 5.1 -1.909896 -0.892829 0.348295
1959-09-30 2.74 2775.488 5.3 -0.327600 -0.433838 0.236335
1959-12-31 0.27 2785.204 5.6 -0.903713 0.328514 -1.249765
1960-03-31 2.31 2847.699 5.2 -1.634700 -0.421725 -0.299042
In [138]:
pivoted['value'][:5]
Out[138]:
item infl realgdp unemp
date
1959-03-31 0.00 2710.349 5.8
1959-06-30 2.34 2778.801 5.1
1959-09-30 2.74 2775.488 5.3
1959-12-31 0.27 2785.204 5.6
1960-03-31 2.31 2847.699 5.2
In [139]:
unstacked = ldata.set_index(['date', 'item']).unstack('item')
unstacked[:5]
Out[139]:
value value2
item infl realgdp unemp infl realgdp unemp
date
1959-03-31 0.00 2710.349 5.8 -0.997675 -0.129071 -0.388757
1959-06-30 2.34 2778.801 5.1 -1.909896 -0.892829 0.348295
1959-09-30 2.74 2775.488 5.3 -0.327600 -0.433838 0.236335
1959-12-31 0.27 2785.204 5.6 -0.903713 0.328514 -1.249765
1960-03-31 2.31 2847.699 5.2 -1.634700 -0.421725 -0.299042

Pivoting "Wide" to "Long" FormatΒΆ

The inverse to 'pivot' is 'pandas.melt'. It merges multiple columns into one, producing a DataFrame that is longer than the input. When using pandas.melt, we indicate which columns, if any, are group indicators. In this example, we will use 'key' as the group indicator.

In [145]:
df =pd.DataFrame({'key': ['foo', 'bar', 'baz'],
                 'A': [1,2,3],
                 'B': [4,5,6],
                 'C': [7,8,9]})
df
Out[145]:
key A B C
0 foo 1 4 7
1 bar 2 5 8
2 baz 3 6 9
In [146]:
melted = pd.melt(df, ['key'])
melted
Out[146]:
key variable value
0 foo A 1
1 bar A 2
2 baz A 3
3 foo B 4
4 bar B 5
5 baz B 6
6 foo C 7
7 bar C 8
8 baz C 9

We can use pivot to get the original layout back. But because it creates an index from the columns used as row labels, we may have to use 'reset_index' to move it back into a column. You can be explicit in which columns can be indicators and which can be values using the 'id_vars' and 'value_vars' parameters respectively. We can use pandas.melt without group identifiers too.

In [148]:
reshaped = melted.pivot('key', 'variable', 'value')
reshaped
Out[148]:
variable A B C
key
bar 2 5 8
baz 3 6 9
foo 1 4 7
In [149]:
reshaped.reset_index()
Out[149]:
variable key A B C
0 bar 2 5 8
1 baz 3 6 9
2 foo 1 4 7
In [150]:
pd.melt(df, id_vars=['key'], value_vars=['A','B'])
Out[150]:
key variable value
0 foo A 1
1 bar A 2
2 baz A 3
3 foo B 4
4 bar B 5
5 baz B 6
In [151]:
pd.melt(df, value_vars=['A', 'B', 'C'])
Out[151]:
variable value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 C 7
7 C 8
8 C 9
In [152]:
pd.melt(df, value_vars=['key', 'A', 'B'])
Out[152]:
variable value
0 key foo
1 key bar
2 key baz
3 A 1
4 A 2
5 A 3
6 B 4
7 B 5
8 B 6
>>

Source :

Wes Mckinney Available On Amazon

McKinney, W. (2013). Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. O'Reilly Media. ISBN: 9789351100065.

Made with Pingendo Free  Pingendo logo