Subset Pandas Dataframe by Column Name and Index

subset-pandas-dataframe-by-column-name-and-index-or-number-python

While you are working with data, especially pandas dataframe, subset data is a common task you need to perform. In this article, I will discuss about different methods to subset Pandas dataframe by column name and column index or column number.

I will divide this entire article into two parts:

  • Subset pandas dataframe by column name
  • Subset pandas dataframe by column number or index

Before we start let me tell you that in this article I will not explain the basics of Pandas. If you are new to Pandas library or data analysis field, I will suggest you to take this Udemy course: Data Analysis with Pandas and Python.

Subset Pandas DataFrame by Column Name

Before we go into data manipulation part, let’s first create a dummy pandas data frame. I am going to use this dummy data in this entire tutorial.

import pandas as pd
from numpy.random import randn

# Pandas create dummy dataframe
df = pd.DataFrame({'Load': randn(10), 'Speed 1T': randn(10), 'Eccentricity 1H': randn(10), 'Speed 2T': randn(10)})
df
create-pandas-dummy-dataframe-in-python

Now how do you select a DataFrame by column name in Pandas? Let’s see different methods for that.

Method1: Using Double Brackets

Let’s say we want to select only Speed 1T and Speed 2T column from our sample dataframe. To do that you can use double bracket technique of pandas library in Python. Inside the double square brackets, you need to provide a list of column names that you want to select from the input pandas dataframe.

# Using Double Brackets (pandas select multiple columns by name)
df1 = df[['Speed 1T', 'Speed 2T']]
df1
pandas-select-multiple-columns-by-name

Method2: Using loc[]

The loc[] method allows you to subset DataFrame by both rows and columns using labels. To subset by columns, you can specify a single column name or a list of column names of your input pandas dataframe.

# loc[] method
df2 = df.loc[:, ['Speed 1T', 'Speed 2T']]
df2
pandas-select-multiple-columns-by-name

Method3: Using the filter()

The filter() method allows you to select or subset multiple columns based on a regex pattern. For example, in the below code, I just wanted to select columns with name “Speed”.

# Subset pandas dataframe using the filter() method
df3 = df.filter(like='Speed')
df3
pandas-select-multiple-columns-by-name

Instead of like, you can also use regex (regular expression) to filter pandas dataframe by column name in Python. In the below code, we are selecting those columns whose name contains text “Ecc” or “Speed”.

# Using Regex (filter pandas dataframe by regular expression)
df3 = df.filter(regex='Ecc|Speed')
df3
subset-or-filter-pandas-dataframe-based-on-column-name-using-regular-expression-in-python

Method4: Using boolean indexing

You can create a boolean Series that masks the columns you want to select and then use it to subset the DataFrame. Below is the code to do that.

# Subset pandas dataframe by column name using boolean indexing
selected_columns = df.columns[df.columns.isin(['Speed 1T', 'Speed 2T'])]
df4 = df[selected_columns]
df4
pandas-select-multiple-columns-by-name

Method5: Using for Loop

Yes, you can also use for loop to select multiple columns by name in pandas. In the below code, we are doing similar thing like method 3. We are just looking for those columns whose name contains text “Speed”. If match found select those columns.

# Using for loop
df5 = df[[col for col in df.columns if "Speed" in col]]
df5
pandas-select-multiple-columns-by-name

Method6: Using drop()

You can also drop unwanted columns from your input pandas data frame to get a subset of the required column. In the below code, I am dropping all unwanted columns to keep only Speed 1T and Speed 2T columns.

# Subset pandas dataframe using drop() method
df6 = df.drop(columns=['Load', 'Eccentricity 1H'])
df6
pandas-select-multiple-columns-by-name

Note: I tested this code (method 6) in Python 3.7. This code (method 6) will not work in older versions of Python. In the older version, you may get this error: TypeError: drop() got an unexpected keyword argument ‘columns’.

Also Read:  Learn Web Scraping using Beautifulsoup and Python

Subset by Column Number or Index

Let’s now see different methods or techniques to select multiple columns by their index or column number. I am going to use same input data for this step also.

Method1: Using .iloc[] with boolean indexing

You can create a boolean list or Series to mask the columns you want to select and then use .iloc[] with this boolean mask. This is a good approach if you have less number of columns in your input dataframe.

# Using .iloc[] with boolean indexing (to subset or select columns by index )
selected_columns_mask = [False, True, False, True]
df7 = df.iloc[:, selected_columns_mask]
df7
pandas-select-multiple-columns-by-name

Method2: Using .iloc[] Column integer position

If you want to get a list of all column names based on their integer positions, you can use the .iloc[] method. In my opinion this is the best method to select or subset pandas dataframe by column index (number).

In the below code, I am selecting the 2nd and 4th column from our input data to make our subset.

# Using .iloc[] Column integer position (subset list of columns by index position)
column_positions = [1, 3]
df8 = df.iloc[:, column_positions]
df8
pandas-select-multiple-columns-by-name

Method3: Select Range of Columns

You can also use slice notation with .iloc[] to select a range of columns by integer positions. This technique you should implement only when you need to select some range of columns. You can not select any specific columns with this technique.

Here end_position is the desired column index – 1.

# Using .iloc[] with column integer slice
# End position is desired column index - 1
start_position = 1
end_position = 3
df9 = df.iloc[:, start_position:end_position]
df9
pandas-select-columns-by-index-with-iloc

Just to Say

In this article, I listed down all possible methods to subset or filter pandas dataframe by column name or column index (number). These are the techniques I prefer to use in my daily analysis work.

Also Read:  What is the difference between ' ' and " " in python?

In this article, I did not explain the basics of Pandas. If you are new to Pandas library of data analysis field, I will suggest you to take this Udemy course: Data Analysis with Pandas and Python.

Leave a comment