Subset Pandas Dataframe by Column Value

pandas-subset-or-filter-or-select-row-of-dataframe-by-column-value

This is another Pandas data manipulation related post. In this article, I will discuss different methods to subset a pandas dataframe by its column value.

Before we start let me tell you that in this article I will not explain the basics of Pandas. If you are new to Pandas library of data analysis field, I will suggest you to take this Udemy course: Data Analysis with Pandas and Python.

Subset by Column Value Example

Before we go into data manipulation part, let’s first create a dummy pandas dataframe. I am going to use this dummy data in this entire tutorial.

import pandas as pd
from numpy.random import randn

df = pd.DataFrame({'Col_1': range(0,10), 'Col_2': range(21,31), 'Col_3': range(42,52), 'Col_4': range(63,73)})
df
pandas-create-dummy-data-frame-in-python

Now let me show you different methods to subset or select rows of a pandas dataframe by specific or multiple conditions by column value.

Method1: Using boolean condition indexing

You can subset or select rows of a pandas dataframe based on specific condition of a column value. For example the below Python code is to select only rows where the value in “Col_1” is greater than 5.

# Using boolean condition indexing
df1 = df[df['Col_1'] > 5]
df1
pandas-select-rows-by-condition

We can achieve the same result using pandas loc[] function. Below code is to filter pandas data frame using loc[] function.

# Using loc[] condition
df7 = df.loc[df['Col_1'] > 5]
df7

As you can see row name of the output dataframe does not start from 0 and it may not be in a range of numbers. This happens when you row-wise subset any pandas dataframe. To solve this you need to apply reset index in output pandas dataframe.

Also Read:  Download YouTube Videos in Python With PyTube

Method2: Using query() method

You can also filter a dataframe by column value using query() method of pandas library. Below is the code to do that.

# filter rows using query() method
df2 = df.query('Col_1 > 5')
df2
pandas-select-rows-by-condition

Method3: Using iloc[] index position

To select a specific range of rows of a pandas dataframe, you can use iloc[] function. Below code is to select rows between 2nd to 5th index of our input dataframe. This is not a column-specific condition, it applies to entire data frame.

You can find similarity of this technique with head() or tail() function of pandas. Only difference is that, in head() or tail() function you can only select first or last specific number of rows. But with iloc[] you can select any specific range of rows.

# Select rows using iloc[] index position
start_position = 1
end_position = 5
df3 = df.iloc[start_position:end_position]
df3
select-range-of-rows-of-a-pandas-dataframe-using-iloc

Note: You can also use loc[] function to achieve same result. Below is the Python code with loc[] method.

# Using loc[] method
start_label = 2
end_label = 5
df5 = df.loc[start_label:end_label]
df5

Method4: Using iloc[] index list

In this method, we will also use pandas iloc[] function to select rows based on column value in list. Here we can specify list of rows we want to select from our input dataframe.

# Using iloc[] index list
# pandas select rows based on column value in list
index_position_list = [1,3,5,6]
df4 = df.iloc[index_position_list]
df4
pandas-select-rows-based-on-column-value-in-list

Note: You can also use loc[] function to achieve same result. Below is the Python code with loc[] method.

# Using loc[] index list
index_position_list = [1,3,5,6]
df6 = df.loc[index_position_list]
df6

Method5: Filter by Multiple Condition

Techniques mentioned in the above method can be applied to filter pandas dataframe by column value with multiple conditions.

Also Read:  Rename columns in R
Normal Way
# Combine multiple conditions in loc[]
df8 = df[(df['Col_1'] >= 5) & (df['Col_3'] <= 50)]
df8
loc[] Way
# Combine multiple conditions in loc[]
df8 = df.loc[(df['Col_1'] >= 5) & (df['Col_3'] <= 50)]
df8

The same output for both techniques

pandas-filter-by-column-value-multiple-conditions

End Note

In this article, I listed all possible methods to subset or filter pandas dataframe by column name or column index (number). These are the techniques I prefer to use in my daily analysis work.

In this article, I did not explain the basics of Pandas. If you are new to Pandas library of data analysis field, I will suggest you to take this Udemy course: Data Analysis with Pandas and Python.

Leave a comment