Large portion of Data Analysis is taken up by preparation: loading, cleaning, transforming and rearranging. These tasks take up more than 80% of an analyst's time. This is because the way the data is stored in files or databases is not in the right format. Researchers prefer to do ad-hoc processing of data from one form to another using languages like R, Python, etc. pandas with built-in Python features provides high-level, flexible abd fast set of tools that enables you to manipulate data into right form.
The way missing data is represented in pandas is imperfect but functional for lot of users. For numeric data, pandas uses floating-point value NaN. It is called a Sentinel value and can be easily detected.
import pandas as pd
import numpy as np
string_data = pd.Series(['aardvak', 'artichoke', np.nan, 'avocado'])
string_data
string_data.isnull()
This convention of NaN was adopted from R, where missing values are refered as NA. In statistics, NA may either be data that does not exist or data that was not observed. During cleaning data, we should also analyse the missing data to identify data collection problems or potential bias due to missing data.
string_data[0] = None
string_data.isnull()
We always have the option to filter out missing data by hand using 'isnull' and boolean indexing. The 'dropna' function can be pretty useful too. For a Series it returns the Series with only non-null data and index values. For DataFrame, it is a bit complex. dropna by default will drop any row that contains even 1 missing value. By passing "how='all'" will target rows with all NAs. To drop columns, pass 'axis=1'.
from numpy import nan as NA
data = pd.Series([1, NA, 3.5, NA, 7])
data.dropna()
data[data.notnull()]
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
[NA, NA, NA], [NA, 6.5, 3.]])
cleaned = data.dropna()
data
cleaned
data.dropna(how='all')
data[4] = NA
data
data.dropna(axis=1, how='all')
Another DataFrame cleaning method concerns with time series data. To keep only rows with certain number of observations, use the 'thresh' argument. Any row that contains number of NAs greater than or equal thresh will be eliminated.
df = pd.DataFrame(np.random.randn(7,3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df
df.dropna()
df.dropna(thresh=2)
Rather than removing NAs and discarding important information in the same rows, we can also fill in the NAs in different ways. The 'fillna' is a workhorse function, where the constant we pass replaces missing values. If we call fillna with a dict, we can fill different value for each column.
df.fillna(0)
df.fillna({1:0.5, 2: 0})
By default it returns a new object, but we can modify it to change in-place. The interpolation methods used for reindexing like 'ffill' can also be used with fillna. It allows you to do lots of creative things, like filling with mean or median values.
_ = df.fillna(0, inplace=True)
df
df = pd.DataFrame(np.random.randn(6,3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df
df.fillna(method='ffill')
df.fillna(method='ffill', limit=2)
data = pd.Series([1., NA, 3.5, NA, 7])
data.fillna(data.mean())
Till now we have seen methods for rearranging data. Transformation involves filtering, cleaning and other different functions.
The DataFrame method 'duplicated' returns boolean Series indicating if each row is a duplicate (i.e. observed in a previous row) or not. Similarly, 'drop_duplicates' returns DataFrame where 'duplicated' array is False.
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
'k2': [1,1,2,3,3,4,4]})
data
data.duplicated()
data.drop_duplicates()
Both the above methods by default consider all of the columns. You can also specify any subset of the DataFrame to detect duplicates. By default, both keep the first observation in case of duplicates. We can specify "keep='last'" to instead keep the last observation.
data['v1'] = range(7)
data.drop_duplicates(['k1'])
data.drop_duplicates(['k1', 'k2'], keep='last')
We sometimes need to make transformations based on the values present in an array, Series or column in a DataFrame. We can use the map method with a function or dict-like object having the mapping to add or change a column. Sometimes the column that we base our mapping on may have varying case from our map. In such a case, we can convert all the values to lowercase. Or just pass a function that does it for us.
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
'Pastrami', 'corned beef', 'Bacon',
'pastrami', 'honey ham', 'nova lox'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}
lowercased = data['food'].str.lower()
lowercased
data['animal'] = lowercased.map(meat_to_animal)
data
data['food'].map(lambda x: meat_to_animal[x.lower()])
The fillna method is a special case of more general values replacement. The map function modifies a subset of values but, 'replace' provides simpler and more flexible way to do so. Passing the sentinel (or garbage) value followed by the replcae value will create a new object with the values replaced. If we want in-place replacement, use "inplace=True".
data = pd.Series([1., -999., 2., -999., 3.])
data
data.replace(-999, np.nan)
To replace multiple values with a single value, pass a list followed by substitute value. To have different replacements for different values, pass list of substitutes. You can also pass a dict as argument to replace multiple substitutes. NOTE - 'data.replace' is different from 'data.str.replace'. The latter is for element-wise string substitution.
data.replace([-999, -1000], np.nan)
data.replace([-999, -1000], [np.nan, 0])
data.replace({-999: np.nan, -1000: 0})
Just like values, axis labels can also be transformed by a function or mapping to produce differntly labeled objects. We can also modify axes in-place without any new data structure.
data = pd.DataFrame(np.arange(12).reshape((3,4)),
index = ['Ohio', 'Colorado', 'New York'],
columns = ['one', 'two', 'three', 'four'])
transform = lambda x: x[:4].upper()
data.index.map(transform)
data.index = data.index.map(transform)
data
To get a transformed version of a dataset without modifying the original, use 'rename'. It can also be used in conjunction with a dict-like object providing new values for subset of the axis labels. It saves you from copying DataFrame manuallyand then assigning it index and columns. To modify in-place, use parameter 'inplace=True'.
data.rename(index=str.title, columns=str.upper)
data.rename(index={'OHIO':"INDIANA"},
columns = {'three':'peekaboo'})
data.rename(index={'OHIO': 'INDIANA'}, inplace=True)
data
Continuous data is always discretized or seperated into 'bins' for analysis. To bin a set of continuous data, use the 'cut' method from pandas. In below example, we are binnning set of gaes into groups 18 to 25, 26 to 35, 36 to 60 and 61 nd older.
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats
pandas returns a special Categorical object from cuts function. The output describes the bins that each of the element is in. You can treat it like a bin name for each element. Internally, the output contains a categories array specifying distinct category names along with a labeling for the 'ages' data in the 'codes' attribute.
cats.codes
cats.categories
pd.value_counts(cats)
The interval system for cut is consistent with the mathematical notation. A parenthesis means that the side is open and a square bracket means that it is closed (inclusive). We can changes which side is closed by passing 'right=False'. We can have our own bin names by passing a list or array to the labels option.
pd.cut(ages, [18, 26, 36, 61, 100], right=False)
group_names = ['Youth', 'YoungAdult', 'MiddleAges', 'Senior']
pd.cut(ages, bins, labels=group_names)
Instead of specifying specific intervals, we can just pass an integer to get equal length bins of the same number based on the max and min value of the data. The 'precision' parameter limits decimal precision in the values. 'precision=2' limits decimal precision to 2 digits.
data = np.random.rand(20)
pd.cut(data, 4, precision=2)
cut has a closely related function - 'qcut' that bins data based on sample quantiles. Based on distribution, using cut will not usually result in each bin have the same number of data points. But as qcut uses sample quantiles, you wil rougjly obtain equal-size bins. We can even pass our own quantiles to qcut.
data = np.random.randn(1000)
cats = pd.qcut(data, 4)
cats
pd.value_counts(cats)
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])
Filtering and Transforming outliers is mostly a matter of applying array operations. To find values exceeding a threshold, just use boolean indexing with other functions like 'abs()' based on requirement. To get all rows having at least one value exceed a threshold, use the 'any(1)' method. Values can also be set based on these criteria. So you can cap values based on an interval or threshold. You can also use the 'np.sign()' function to get 1 and -1 where the data is positive or negative respectively.
data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()
col = data[2]
col[np.abs(col) > 3]
data[(np.abs(data) > 3).any(1)]
# Capping outside -3 to 3
data[np.abs(data) > 3] = np.sign(data) * 3
data.describe()
np.sign(data).head()
We can easily perform Permutation (randomly reordering) on a Series or rows of a DataFrame using 'numpy.random.permutation'. Calling it with length of the axis you want to permute cerates an array of integers indicating the new ordering. we can then use the same array in an iloc-based indexing or with an equivalent 'take' function.
df = pd.DataFrame(np.arange(5*4).reshape((5,4)))
sampler = np.random.permutation(5)
sampler
df
df.take(sampler)
Use the 'sample' method to select random subset without replacement. To generate subset with replacement (i.e. repeat choices) pass 'replace=True'.
df.sample(n=3)
choices = pd.Series([5, 7, -1, 6, 4])
draws = choices.sample(n=10, replace=True)
draws
Another transformation is to convert a categorical variable into a 'dummy' or 'indicator' matrix. If a column has k distinct values, we can derive a matrix or a DataFrame with k columns all containing 1s and 0s. pandas has the 'get_dummies' function to do this. You may want to add a prefix to the columns in the indicator DataFrame. This DataFrame can be merged with other data. get_dummies has the 'prefix' argument to do this.
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
'data1': range(6)})
pd.get_dummies(df['key'])
dummies = pd.get_dummies(df['key'], prefix='key')
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy
But if a row belongs to multiple categories, then dummying it becomes complicated. Adding indicator variable to a column like genre involves a bit of data wrangling. First we get all the unique categorical values in the DataFrame.
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('datasets/movielens/movies.dat', sep='::',
header=None, names=mnames)
movies[:10]
all_genres = []
for x in movies.genres:
all_genres.extend(x.split('|'))
genres = pd.unique(all_genres)
genres
One way to get this would be to start with a DataFrame of all zeros. The iterate through each row and set the entry in each row of 'dummies' to 1. For this we use 'dummies.columns' to compute column indices for each category. The we use iloc to set values based on those indices. After that we combine this indicator DataFrame with original DataFrame.
zero_matrix = np.zeros((len(movies), len(genres)))
dummies = pd.DataFrame(zero_matrix, columns=genres)
gen = movies.genres[0]
gen.split('|')
dummies.columns.get_indexer(gen.split('|'))
for i, gen in enumerate(movies.genres):
indices = dummies.columns.get_indexer(gen.split('|'))
dummies.iloc[i, indices] = 1
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0]
NOTE - For larger data, the above method for creating indicator method is not speedy. It would be better to write a lower-level function that writes directly to NumPy array and then wraps the result in a DataFrame. Another useful method in statistical applications is to combine the dummy function with discretization function like cut. This shows if a value is present in a bin or not. For below example we will use random seed to make the example more deterministic.
np.random.seed(12345)
values = np.random.rand(10)
values
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
pd.get_dummies(pd.cut(values, bins))
Python is a popular raw data manipulation language due to its ease of use for string and text processing. Simple text operations can be done using String object's built-in methods. For more complex pattern matching and text manipulations, we can use regular expressions. pandas enable us to apply both string and regex functions on whole arrays of data.
In most string manipulation scenarios, inbuilt string methods are mostly sufficient. A string can be broken based on a seperator using 'split'. It is often combined with 'strip' to trim out whitespace, including line breaks. Substrings can be concatenated together using the '+' operator. A more faster and pythonic way to do so is to pass a list or tuple of substrings to the 'join' method on the 'stitching' string.
val = 'a,b, guido'
val.split(',')
pieces = [x.strip() for x in val.split(',')]
pieces
first, second, third = pieces
first + '::' + second + '::' + third
'::'.join(pieces)
The 'in' keyword is the best way to detect substring. Although, 'index' and 'find' can also be used. There is 1 major difference between 'find' and 'index'. 'index' returns an Exception if substring is not found. 'find' returns a -1. 'count' returns number of occurences of a particular substring. 'replace' substitutes occurence of one pattern for another. It is commonly used to delete patterns by passing an empty string as replacement.
'guido' in val
val.index(',')
val.find(':')
# val.index(':')
val.count(',')
val.replace(',', '::')
val.replace(',', '')
It is a flexible way to search or match (often complex) string patterns in text. The expression 'regex' is a string formed by 'regular expression'. Python has built-in 're' module for applying regular expressions to strings. This module works in 3 categories, all are related : 'pattern matching', 'substitution' and 'splitting'. We first describe a pattern that we want to locate in text. This can be used for many purposes.
Once we define a regex, we can use it to split a text using 'split'. In this command, the regex is first compiled and then its split method is called on to the passed text. We can compile regex on our own using 're.compile' which will form a reusable regex object. Compiling regex beforehand is highly recommended if we have to apply it to many strings. This will save CPU cycles. NOTE - To avoid using escape sequences in regex, use raw string literals. eg - r'C:\x' instead of 'c:\x'.
import re
text = "foo bar\t baz \tqux"
re.split('\s+', text)
regex = re.compile('\s+')
regex.split(text)
We can use the 'findall' function to list all the matching patterns to the regex. To get only the first match in a string, we can use 'search' method. 'match' only matches at the beginning of the string i.e. it checks if the string startes with the regex. Using 'IGNORECASE' makes the regex case-insensitive.
regex.findall(text)
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
regex = re.compile(pattern, flags=re.IGNORECASE)
'findall' returns a list of matching pattern. 'search' returns a special match object for the first matching pattern. It can only tell us the start and end positions of the pattern in a string. 'match' will return None if the pattern does not exist in the string. 'sub' will return a new string with occurences of the pattern replaced by a new pattern.
regex.findall(text)
m = regex.search(text)
m
text[m.start():m.end()]
print(regex.match(text))
print(regex.sub('REDACTED', text))
To find the patterns and at the same time segment it into different parts, put parentheses around the parts of the pattern to segment. In this case, 'findall' will return a list of tuples, when the pattern has groups. 'match' will return a tuple of the pattern components with its 'groups' method. 'sub' can also be used to access the groups in each match with special symbols '\1', '\2'. Here '\1' corresponds to first matched group.
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)
regex.findall(text)
m = regex.match('wes@bright.net')
m.groups()
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))
The presence of missing data in columns makes data cleaning in datasets very difficult. We could apply string or regex methods using lambda or other functions to each value using data.map, but this will fail on the NA values. To avoid this Series has array-oriented methods for string operations that skip NA values. They can be accessed using the 'str' attribute.
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
'Rob': 'rob@gmail.com', 'Wes': np.nan}
data = pd.Series(data)
data
data.isnull()
We can check if a string contains a pattern using the 'str.contains' function. We can use regex too, with options like 'IGNORECASE'.
data.str.contains('gmail')
pattern
data.str.findall(pattern, flags=re.IGNORECASE)
To retrieve elements from Vector, we can either use 'str.get' or 'index'. To access elements in embedded list, we pass an index to either of these functions. We can also slice strings with this syntax.
matches = data.str.match(pattern, flags=re.IGNORECASE)
matches
matches.str.get(1)
matches.str[0]
data.str[:5]
In many applications, data may be spread across number of files or databases, or be arranged in a for that is not easy to analyze. Knowing how to combine, join and rearrange data is an important skill in the Data Analyst toolkit.
It is an important feature in pandas that enables you to have multiple index levels on an axis. It provides a way to work with higher dimensional data in lower dimensional form. When looking at a Series or DataFrame with multi-index, you will see "gaps" in the higher index, which means "same as the one above".
import pandas as pd
import numpy as np
data = pd.Series(np.random.randn(9),
index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
[1, 2, 3, 1, 3, 1, 2, 2, 3]])
data
data.index
With hierarchical indexed objects, you can perform 'partial indexing', which enables us to concisely select subsets of data. Selection is also possible for "inner" level of indexes.
data['b']
data['b':'c']
data.loc[['b', 'd']]
data.loc[:,2]
Hierarchical indexing has important role in reshaping data and group=based operations. eg - forming pivot table. You could rearrange data into a DataFrame using its 'unstack' method. The inverse operation of stack is 'stack'.
data.unstack()
data.unstack().stack()
In a DataFrame, either axis can have hierarchical index. The hierarchical indexes can have names and they will be shown in the console output. NOTE - Be careful not to mix-up index names with row labels. With partial column indexing, we can select groups of columns. A 'MultiIndex' can be created by itself and then reused.
frame = pd.DataFrame(np.arange(12).reshape((4,3)),
index=[['a', 'a', 'b', 'b'], [1,2,1,2]],
columns = [['Ohio', 'Ohio', 'Colorado'],
['Green', 'Red', 'Green']])
frame
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame
frame['Ohio']
pd.MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']],
names = ['state', 'color'])
Sometimes, we may need to rearrange the order of the levels on an axis or sort the data by values in one specific level. We can use 'swaplevel', which takes 2 level numbers or name and returns a new object with the levels interchanged. The data is unaltered. 'sort_index' sorts the data using the values in a single level. We often use sort_index when swapping levels so that result is lexicographically sorted. NOTE - Data selection performance is better on hierarchically indexed objects if index is lexicographically sorted, starting from outermost level.
frame.swaplevel('key1', 'key2')
frame.sort_index(level=1)
frame.swaplevel(0,1).sort_index(level=0)
Many Descriptive and Summary statistics on DataFrame and Series have 'level' option, where you can specify the level you want to aggregate by on an axis. we can aggregate on either rows or columns. Under the hood, it utilizes pandas' 'groupby machinery'.
frame.sum(level='key2')
frame.sum(level='color', axis=1)
We may want to use 1 or more columns as Row Index in DataFrame or alternatively, move row index into columns. The 'set_index' method can create a new DataFrame using 1 or more of its columns as index. By default, columns are removed from the DataFrame, but we can also leave them in. 'reset_index' is the opposite of set_index. It moves hierarchical index levels into columns.
frame = pd.DataFrame({'a': range(7), 'b':range(7,0,-1),
'c':['one', 'one', 'one', 'two', 'two',
'two', 'two'],
'd': [0,1,2,0,1,2,3]})
frame
frame2 = frame.set_index(['c', 'd'])
frame2
frame.set_index(['c','d'], drop=False)
frame2.reset_index()
Data in pandas can be combined in many ways:
1. 'pandas.merge' connects rows in DataFrames based on 1 or more keys. This is similar to 'join' operations in SQL.
2. 'pandas.concat' concatenates or 'stacks' together objects along an axis.
3. 'combine_first' method enables splicing together overlapping data to fill in missing values in one object from another.
Merge or join combines datasets by linking rows using 1 or more keys. They are central to relational databasees. The 'merge' function in pandas is the main entry point for using these operations on your dataset. We can have a many-to-one join, where 1 object has multiple rows with same index labels whereas the other object has only 1 row for each label. We get a combination of the rows from both objects.
df1 = pd.DataFrame({'key': ['b','b','a','c','a','a','b'],
'data1':range(7)})
df2 = pd.DataFrame({'key':['a','b','d'],
'data2': range(3)})
df1
df2
pd.merge(df1, df2)
It is not necessary to specify which column to join on. 'merge' uses overlapping column names as the keys. But it is a good practice to specify explicitly. If column names are different in the 2 objects, you need to specify them in the merge explicitly. By default, merge does an 'inner join' so keys in the result are intersection or common set found in both objects.There are other possible join options- 'left', 'right' and 'outer'. 'Outer join' does a union of the keys, combining the effect of left and right joins.
pd.merge(df1, df2, on='key')
df3 = pd.DataFrame({'lkey': ['b','b','a','c','a','a','b'],
'data1':range(7)})
df4 = pd.DataFrame({'rkey':['a','b','d'],
'data2': range(3)})
pd.merge(df3, df4, left_on='lkey', right_on='rkey')
pd.merge(df1, df2, how='outer')
Many-to-many merges have well-defined although not vey intuitive behaviour. They form Cartesian product of the rows. eg- 3 rows on left and 2 rows on right for the same label will lead to 6 rows in the result.
df1 = pd.DataFrame({'key': ['b','b','a','c','a', 'b'],
'data1':range(6)})
df2 = pd.DataFrame({'key':['a','b','a','b','d'],
'data2': range(5)})
df1
df2
pd.merge(df1, df2, on='key', how='left')
pd.merge(df1, df2, how='inner')
To merge on multiple keys, pass a list of columns. To determine the key combinations in the result, think an array of tuples formed by the multiple keys being used as a single join key. NOTE - When joining columns on columns, indexes on passed DataFrame objects are discarded. When merging, we may get overlapping column names. We can address manually by renaming the axis labels. 'merge' has a 'suffixes' option for specifying strings to append to overlapping names in left and right objects.
left = pd.DataFrame({'key1':['foo', 'foo', 'bar'],
'key2':['one', 'two', 'one'],
'lval':[1, 2, 3]})
right = pd.DataFrame({'key1':['foo', 'foo', 'bar', 'bar'],
'key2': ['one', 'one', 'one', 'two'],
'rval': [4, 5, 6, 7]})
pd.merge(left, right, on=['key1', 'key2'], how='outer')
pd.merge(left, right, on='key1')
pd.merge(left, right, on='key1', suffixes=('_left', '_right'))
In some scenarios, the index will contain the key(s) along which the objects need to be merged. You can pass 'left_index=True' or 'right_index=True' or both to indicate which index should be used to merge. The default method for merge is to intersect join keys. But we can instead form 'union' with an 'outer_join'.
left1 = pd.DataFrame({'key':['a','b','a','a','b','c'],
'value':range(6)})
right1 = pd.DataFrame({'group_val':[3.5, 7]}, index=['a','b'])
left1
right1
pd.merge(left1, right1, left_on='key', right_index=True)
pd.merge(left1, right1, left_on='key', right_index=True, how='outer')
Things are more complicated with hierarchically indexed data, as joining on index is implicitly multi-key merge. You can indicate multiple columns to merge as a list. And handle duplicate data with 'how=outer'. Using the indexes of both sides is also possible.
lefth = pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'key2': [2000, 2001, 2002, 2001, 2002],
'data': np.arange(5.)})
righth = pd.DataFrame(np.arange(12).reshape((6,2)),
index=[['Nevada', 'Nevada', 'Ohio', 'Ohio',
'Ohio', 'Ohio'],
[2001, 2000, 2000, 2000, 2001, 2002]],
columns=['event1', 'event2'])
lefth
righth
pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True)
pd.merge(lefth, righth, left_on=['key1', 'key2'],
right_index=True, how='outer')
left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]],
index=['a', 'c', 'e'],
columns=['Ohio', 'Nevada'])
right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],
index=['b','c','d','e'],
columns=['Missouri','Alabama'])
left2
right2
pd.merge(left2, right2, how='outer', left_index=True, right_index=True)
Dataframe has a 'join' method for merging by index. It can also be used to combine together many DataFrame objects that have similar indexes but non-overlapping columns. Mostly for legacy reasons, the join method performs a left join on the keys, preserving the left frame's ro index. It also supports joining the index of passed DataFrame on one of the columns of the calling DataFrame. For index-on-index merges, you can pass a list of DataFrames to join instead of using the 'concat' function.
left2.join(right2, how='outer')
left1.join(right1, on='key')
another = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [16., 17.]],
index=['a', 'c', 'e', 'f'],
columns=['New York', 'Oregon'])
another
left2.join([right2, another])
left2.join([right2, another], how='outer')
Another data concatenation method is interchangeably known as concatenation, binding or stacking. NumPy has 'concatenate' which performs this on NumPy arrays.
arr = np.arange(12).reshape((3,4))
arr
np.concatenate([arr, arr], axis=1)
In pandas' objects, we have labelled axes, which help us in generalizing array concatenation. There are a few things to think about before concatenating pandas objects:
1. If the 2 objects are indexed differently on the other axis, should we combine the distinct element or use only the shared values along that axis (intersection)?
2. Do the concatenated chunks need to be identifiable in the resulting object?
3. Does the 'concatenating axis' have sata that needs to be preserved? In general, integer labels should best be discarded during concatenation.
The 'concat' method provides a consistent way to address the above issues. For 'Series' with no overlap, concat just glues the values and indexes. By default, it works along 'axis=0', but if we pass 'axis=1', teh result will be a DataFrame (axis=1 is for columns).
s1 = pd.Series([0,1], index=['a', 'b'])
s2 = pd.Series([2,3,4], index=['c', 'd', 'e'])
s3 = pd.Series([5,6], index=['f', 'g'])
pd.concat([s1, s2, s3])
pd.concat([s1, s2, s3], axis=1)
If there is no overlap on the other axis (like above example), we see a 'sorted union' (or outer join). You can get an intersect by passing "join='inner'". If you want to specify the axes to be used in 'other axes' do it with 'join_axes'.
s4 = pd.concat([s1, s3])
s4
pd.concat([s1, s4], axis=1)
pd.concat([s1, s4], axis=1, join='inner')
pd.concat([s1, s4], axis=1, join_axes=[['a', 'c', 'b', 'e']])
Another issue is to identify concatenated pieces in the result. To create a hierarchical index in the concatenated axis, we can use the 'keys' argument. When we use axis=1, the keys become column headers. This is true to both Series and DataFrame. If we pass a dict of objects instead of a list, then dict's keys will be used for the 'keys' option. There are other arguments in hierarchical index creation. eg - To name a created axis just use the 'names' argument.
result = pd.concat([s1, s2, s3], keys=['one', 'two', 'three'])
result
result.unstack()
pd.concat([s1, s2, s3], axis=1, keys=['one', 'two', 'three'])
df1 = pd.DataFrame(np.arange(6).reshape(3,2), index=['a', 'b', 'c'],
columns=['one', 'two'])
df2 = pd.DataFrame(5 + np.arange(4).reshape(2,2), index=['a', 'c'],
columns=['three', 'four'])
df1
df2
pd.concat([df1, df2], axis=1, keys=['level1', 'level2'])
pd.concat({'level1':df1, 'level2': df2}, axis=1)
pd.concat([df1, df2], axis=1, keys=['level1', 'level2'],
names=['upper', 'lower'])
The last issue is that the row index does not contain any relevant data. To avoid this, pass 'ignore_index=True'.
df1 = pd.DataFrame(np.random.randn(3,4), columns=['a', 'b', 'c', 'd'])
df2 = pd.DataFrame(np.random.randn(2,3), columns=['b', 'd', 'a'])
df1
df2
pd.concat([df1, df2], ignore_index=True)
There is another data operation involving 2 datasets with indexes fully or partly overlapped. NumPy has 'where' function which performs array-equivalent of an 'if-else'. pandas has its equivalent 'combine_first'. In Series, this will perform the equivalent operation with the data alignment logic of pandas. In DataFrames, it does the same column by column. So it is "patching" missing data from calling object with values from passed object.
a = pd.Series([np.nan, 2.5, 0.0, 3.5, 4.5, np.nan],
index=['f','e','d','c','b','a'])
b = pd.Series([0., np.nan, 2., np.nan, np.nan, 5.],
index=['a','b','c','d','e','f'])
a
b
np.where(pd.isnull(a), b, a)
b.combine_first(a)
df1 = pd.DataFrame({'a':[1., np.nan, 5., np.nan],
'b':[np.nan, 2., np.nan, 6.],
'c':range(2,18,4)})
df2 = pd.DataFrame({'a':[5., 4., np.nan, 3., 7.],
'b':[np.nan, 3., 4., 6., 8.]})
df1
df2
df1.combine_first(df2)
Hierarchical indexing allows to rearrange data consistently in a DataFrame. There are 2 primary actions:
1. stack - It 'rotates' or pivots from columns to rows.
2. unstack - Pivots rows into columns.
Calling 'stack' pivots columns into rows, producing a Series. If you have a hierarchically indexed Series, we can rearrange the data back into a DataFrame with 'unstack'.
data = pd.DataFrame(np.arange(6).reshape((2,3)),
index=pd.Index(['Ohio', 'Colorado'], name='state'),
columns = pd.Index(['one', 'two', 'three'],
name = 'number'))
data
result = data.stack()
result
result.unstack()
By default the innermost index gets 'stacked' or 'unstacked'. You can unstack a different level by passing a level number or name. Unstacking might introduce missing data if all values in the level aren't foundin each subgroups.
result.unstack(0)
result.unstack('state')
s1 = pd.Series([0,1,2,3], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([4,5,6], index=['c','d','e'])
data2 = pd.concat([s1, s2], keys=['one', 'two'])
data2
data2.unstack()
Stacking filters out missing data by default, so that the operation is easily invertible. When you unstack, the level unstacked becomes lowest level in the result. when calling stack or unstack, we can indicate the name of the axis to the stack.
data2.unstack()
data2.unstack().stack()
data2.unstack().stack(dropna=False)
df = pd.DataFrame({'left': result, 'right': result+5},
columns=pd.Index(['left', 'right'], name='side'))
df
df.unstack('state')
df.unstack('state').stack('side')
A common way to store multiple time series in the so-called 'long' or 'stacked' format. One way to store time index is via 'PeriodIndex', which combines year and quarter columns to create a kind of time interval type. we use this interval to index our data.
data = pd.read_csv('examples/macrodata.csv')
data.head()
data.columns
periods = pd.PeriodIndex(year=data.year, quarter=data.quarter,
name='date')
columns = pd.Index(['realgdp', 'infl', 'unemp'], name='item')
data = data.reindex(columns=columns)
data.index = periods.to_timestamp('D', 'end')
ldata = data.stack().reset_index().rename(columns={0:'value'})
ldata[:5]
The above way of displaying data is called 'long format' for multiple time series or other observational data with 2or more keys. Each row in the table is a single observation. Data is frequently stored in this format in RDBMS as a fixed schema as it allows number of distinct values in the 'item' column to change as more data is added to the table. In the previous example, date and item would be the primary keys, offering both relational integrity and easier joins. But in some cases, this kind of data would be more difficult to work with, as you might prefer a DataFrame containing 1 column per distinct item indexed by timestamps. This is exactly what 'pivot' method provides.
pivoted = ldata.pivot('date', 'item', 'value')
pivoted[:10]
The first 2 values passed in the pivot method are the columns to be used as 'row' and 'column' index. If you had 2 'value' columns that you needed to reshape simultaneously. In pivot, by omitting the last argument, you get a DataFrame with hierarchical columns. Using pivot is just like creating a hierarchical index using the 'set_index' followed by a call to 'unstack'.
ldata['value2'] = np.random.randn(len(ldata))
ldata[:5]
pivoted = ldata.pivot('date', 'item')
pivoted[:5]
pivoted['value'][:5]
unstacked = ldata.set_index(['date', 'item']).unstack('item')
unstacked[:5]
The inverse to 'pivot' is 'pandas.melt'. It merges multiple columns into one, producing a DataFrame that is longer than the input. When using pandas.melt, we indicate which columns, if any, are group indicators. In this example, we will use 'key' as the group indicator.
df =pd.DataFrame({'key': ['foo', 'bar', 'baz'],
'A': [1,2,3],
'B': [4,5,6],
'C': [7,8,9]})
df
melted = pd.melt(df, ['key'])
melted
We can use pivot to get the original layout back. But because it creates an index from the columns used as row labels, we may have to use 'reset_index' to move it back into a column. You can be explicit in which columns can be indicators and which can be values using the 'id_vars' and 'value_vars' parameters respectively. We can use pandas.melt without group identifiers too.
reshaped = melted.pivot('key', 'variable', 'value')
reshaped
reshaped.reset_index()
pd.melt(df, id_vars=['key'], value_vars=['A','B'])
pd.melt(df, value_vars=['A', 'B', 'C'])
pd.melt(df, value_vars=['key', 'A', 'B'])