Pandas DataFrame: How to concatenate with Python examples
Concatenation of DataFrames and Series with Python in Pandas.
How to concatenate DataFrames and Series in Pandas with Python
Pandas is a powerful and versatile Python library designed for data manipulation and analysis. It provides two primary data structures: DataFrames and Series, which are used to represent tabular data and one-dimensional arrays, respectively. These structures make it easy to work with large datasets, clean data, perform calculations and visualize results.
DataFrames are essentially tables with labeled rows and columns, similar to spreadsheets or SQL tables. They can store a variety of data types, including strings, integers and floats. Series, on the other hand, are one-dimensional arrays that can store any data type but are typically used for numerical data.
In the world of data science and engineering, it’s common to encounter situations where you need to combine multiple datasets or manipulate them in various ways. For example, you might need to combine data from different sources and remove duplicate instances. One such operation to handle this is concatenation. In the context of Pandas, concatenation describes the process of joining DataFrames or Series together.
Pandas concat() method
The concat() method in Pandas is a powerful tool that lets you combine DataFrames or Series along a particular axis (either rows or columns). It’s especially useful for merging and analyzing datasets with similar structures.
Here’s a quick overview of the concat() method and its parameters:
pandas.concat(objs, axis=0, join=’outer’, ignore-index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)
And here’s a breakdown of the key parameters and what they do:
-
‘objs’: Used to sequence or map DataFrames or Series for concatenation.
-
‘axis’: This defines the axis on which data is concatenated along. By default, it’s set to 0, meaning the function continues concatenating vertically.
-
‘join’: Specifies how to handle indexes on the other axis. Options include ‘outer’, which unions all indexes, or ‘inner’, which intersects them. It defaults to outer.
-
‘ignore_index’: Resets the index in the resulting DataFrame or Series. It’s set to False by default.
-
‘keys’: This is an optional sequence used to create a hierarchical index for the concatenated objects.
-
‘levels’: This allows specifying unique values to use when constructing a MultiIndex.
-
‘names’: Provides the ability to assign names for the levels in the resulting hierarchical index.
-
‘verify_integrity’: If set to True, this checks whether the new concatenated axis contains duplicates. It defaults to False.
-
‘sort’: This sorts the non-concatenation axis if it isn’t aligned with join=‘outer’ and is set to True. By default, it’s set to False.
-
‘copy’: When set to False, this avoids copying data from input objects, if possible. It’s set to True by default.
Pandas concat() examples
As demonstrated by the sheer number of parameters, the Pandas concat() method is versatile and easily customizable to suit a variety of data analysis tasks. The examples below demonstrate a few of the many ways Pandas can improve tasks.
Combine DataFrame objects with concat()
For stacking two DataFrames with the same columns on top of each other — concatenating vertically, in other words — Pandas makes short work of the task. The example below shows how to concatenate DataFrame objects vertically with the default parameters.
Input:
import pandas as pd
data1 = {‘A’: [1,2], ‘B’: [3,4]}
data2 = {‘A’: [5,6], ‘B’: [7,8]}
# Create two Pandas DataFrame
df1 = pd.DataFrame (data1)
df2 = pd.DataFrame (data2)
# Use the concat() method to concatenate the DataFrames and create a new DataFrame
result = pd.concat([df1, df2])
print(result)
Output:
A B
0 1 3
1 2 4
0 5 7
1 6 8
Notice that the index values are preserved from the original DataFrames. If you want to reset the index in the resulting DataFrame, set the ignore_index parameter to True:
Input:
result = pd.concat([df1, df2], ignore_index=True)
print(result)
Output:
A B
0 1 3
1 2 4
2 5 7
3 6 8
Concatenating DataFrames horizontally
To concatenate DataFrames horizontally (i.e., side by side), set the axis parameter to 1:
Input:
result = pd.concat([df1, df2], axis=1)
print(result)
Output:
A B A B
0 1 3 5 7
1 2 4 6 8
Note that the column names are preserved from the original DataFrames. If you want to avoid duplicate column names, you can use the keys parameter to create a hierarchical index:
Input:
result = pd.concat([df1, df2], axis=1, keys=[‘df1’, ‘df2’])
print(result)
Output:
A B A B
0 1 3 5 7
1 2 4 6 8
Concatenating series
The concat() method is also useful for concatenating Series objects. Let’s create two Series and concatenate them vertically.
Input:
import pandas as pd
# Create two Series
s1 = pd.Series([1, 2, 3])
s2 = pd.Series([4, 5,6])
# Concatenate series vertically using the default parameters
result = pd.concat([s1, s2])
print(result)
Output:
0 1
1 2
2 3
0 4
1 5
2 6
As with DataFrames, you can reset the index by setting the ignore_index parameter to True or concatenate Series horizontally by setting the axis parameter to 1.
Using the join keyword argument
The join keyword argument specifies how to handle indexes on the other axis when concatenating DataFrames. Options include the default ‘outer’ (union of all indexes) and ‘inner’ (intersection of indexes). The example below demonstrates the output when using the ‘inner’ join. Note that you need to set the axis to 1 to specify where to join.
Input:
import pandas as pd
data1 = {'A': [1, 2], 'B': [3, 4]}
data2 = {'A': [5, 6], 'B': [7, 8]}
#Create two DataFrames with different indexes
df1 = pd.DataFrame(data1, index=['a', 'b'])
df2 = pd.DataFrame(data2, index=['b', 'c'])
# Create a new DataFrame using the concat() method and optional join parameter
result = pd.concat([df1, df2], axis=1, join='inner')
print(result)
Output:
A B A B
b 2 4 5 7
In this example above, two DataFrames with different indexes are concatenated using an inner join. The resulting DataFrame contains only the row with matching index values.
Assigning keys to indexes
The keys parameter creates a hierarchical index for the concatenated objects, which is useful for tracking the original DataFrames after concatenation.
Input:
import pandas as pd
data1 = {'A': [1, 2], 'B': [3, 4]}
data2 = {'A': [5, 6], 'B': [7, 8]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
# Concatenate DataFrames and assign keys to indexes
result = pd.concat([df1, df2], keys=['first', 'second'])
print(result)
Output:
A B
first 0 1 3
1 2 4
second 0 5 7
1 6 8
Other Pandas methods for merging and joining data
While the concat() method is powerful for combining DataFrames and Series, Pandas also offers other methods for merging and joining data, such as join() and merge(). These methods provide more flexibility in certain situations and can be more suitable depending on specific needs.
How to join two DataFrames: Pandas join() method
The join() method combines two DataFrames based on their index values. It allows merging DataFrames with different columns while preserving the index structure. The basic syntax for the join() method is:
DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)
How to merge two DataFrames: Pandas merge() method
The merge() method combines two DataFrames based on a common column or index. It resembles SQL’s JOIN operation and offers more control over how DataFrames are combined. The basic syntax for the merge() method is:
pd.merge(left, right, on=None, left_on=None, right_on=None,
left_index=False, right_index=False,
how='inner', suffixes=('_x', '_y'), copy=True)
Choosing between concat(), join(), and merge()
Selecting between concat(), join() and merge() depends on specific needs and the data structure you’re working with. Here are some general guidelines:
-
Use ‘concat()’ for combining DataFrames or Series along a particular axis (rows or columns) without considering any common keys or indexes. It’s best suited for merging datasets with similar structures for further analysis.
-
Use ‘join()’ for combining DataFrames based on their index values. It’s useful when DataFrames have different columns but share an index structure.
-
Use ‘merge()’ for combining DataFrames based on a common column or index. It provides more control over how DataFrames are combined and resembles SQL’s JOIN operation.
Wrapping up: Data analysis and Pandas
The examples above provide a good starting point for using the Pandas concat(), join() and merge() methods to perform data analysis. For more ways to use Python for data analysis, consider moving into data visualization with libraries like Matplotlib or Seaborn.
For additional resources, check out the official Pandas documentation. You can also explore Capital One’s open source DataComPy package to discover more techniques using Pandas.