Advanced operations with pandas
Contents
Advanced operations with pandas
#
Announcements#
CAPES and survey available!
Goals of this lecture#
Previously, we discussed several basic operations we can performan with a pandas.DataFrame
.
In today’s lecture, we’ll cover more advanced operations:
Filtering a
DataFrame
using a conditional statement.Using
groupby
to aggregate across members of a category.Identifying missing or
nan
values.
### Always remember to import the package!
import pandas as pd
Review: reading in a csv
file#
pandas.read_csv
takes in a filepath as an argument, and returns a DataFrame
.
## Pass the filepath into pd.read_csv
df_pokemon = pd.read_csv("data/pokemon.csv")
df_pokemon.head(2)
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
1 | 2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
Filtering a DataFrame
#
Often, we want to filter a DataFrame
so we only see observations that meet certain conditions.
“Only show trials from the
Congruent
condition”.“Only show neighborhoods that are
NEAR BAY
.”“Only show scores below 70%.”
Fundamentally, this is done by using a conditional statement.
Refresher: conditionals#
In a nutshell, a conditional is a statement that checks for whether some condition is met.
We can use the if
command to control which lines of code are executed.
def square_if_even(x):
if x % 2 == 0:
return x ** 2
else:
return x
square_if_even(2)
4
square_if_even(3)
3
Conditionals and filtering#
Filtering a
DataFrame
is analogous to applying a condition.I.e., “show me only the rows that meet this condition.
Example 1: filtering on a categorical variable#
The
legendary
column is a categorical variable, meaning there are several discrete categories.
## How many legendary pokemon?
df_pokemon[df_pokemon['Legendary']==True]
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
156 | 144 | Articuno | Ice | Flying | 580 | 90 | 85 | 100 | 95 | 125 | 85 | 1 | True |
157 | 145 | Zapdos | Electric | Flying | 580 | 90 | 90 | 85 | 125 | 90 | 100 | 1 | True |
158 | 146 | Moltres | Fire | Flying | 580 | 90 | 100 | 90 | 125 | 85 | 90 | 1 | True |
162 | 150 | Mewtwo | Psychic | NaN | 680 | 106 | 110 | 90 | 154 | 90 | 130 | 1 | True |
163 | 150 | MewtwoMega Mewtwo X | Psychic | Fighting | 780 | 106 | 190 | 100 | 154 | 100 | 130 | 1 | True |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
795 | 719 | Diancie | Rock | Fairy | 600 | 50 | 100 | 150 | 100 | 150 | 50 | 6 | True |
796 | 719 | DiancieMega Diancie | Rock | Fairy | 700 | 50 | 160 | 110 | 160 | 110 | 110 | 6 | True |
797 | 720 | HoopaHoopa Confined | Psychic | Ghost | 600 | 80 | 110 | 60 | 150 | 130 | 70 | 6 | True |
798 | 720 | HoopaHoopa Unbound | Psychic | Dark | 680 | 80 | 160 | 60 | 170 | 130 | 80 | 6 | True |
799 | 721 | Volcanion | Fire | Water | 600 | 80 | 110 | 120 | 130 | 90 | 70 | 6 | True |
65 rows × 13 columns
## Compare to actual length of dataframe
len(df_pokemon)
800
Example 2: filtering on a continuous variable#
The
HP
column is a continuous variable.Let’s show only the rows for Pokemon with a
HP > 150
.
df_pokemon[df_pokemon['HP'] > 150]
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
121 | 113 | Chansey | Normal | NaN | 450 | 250 | 5 | 5 | 35 | 105 | 50 | 1 | False |
155 | 143 | Snorlax | Normal | NaN | 540 | 160 | 110 | 65 | 65 | 110 | 30 | 1 | False |
217 | 202 | Wobbuffet | Psychic | NaN | 405 | 190 | 33 | 58 | 33 | 58 | 33 | 2 | False |
261 | 242 | Blissey | Normal | NaN | 540 | 255 | 10 | 10 | 75 | 135 | 55 | 2 | False |
351 | 321 | Wailord | Water | NaN | 500 | 170 | 90 | 45 | 90 | 45 | 60 | 3 | False |
655 | 594 | Alomomola | Water | NaN | 470 | 165 | 75 | 80 | 40 | 45 | 65 | 5 | False |
## Compare to actual length of df_pokemon
len(df_pokemon)
800
Interim summary: filtering syntax#
To filter a DataFrame
, we can use the bracket ([]
) syntax.:
df_name[df_name['col_name']==CONDITION]
(Note that ==CONDITION
would be replaced with whatever condition you want to filter on.)
Check-in#
Write a statement to filter df_pokemon
to show only the pokemon where Attack
is greater than or equal to 50.
How many rows are in this subset?
What proportion of Pokemon meet this criteria?
What is the mean
HP
in this subset?
### Your code here
Solution#
## First, write the filter statement
df_subset = df_pokemon[df_pokemon['Attack'] >= 50]
### How many rows are in this subset?
len(df_subset)
667
### Proportion of dataset ... about half
len(df_subset) / len(df_pokemon)
0.83375
## Mean HP
df_subset['HP'].mean()
72.55472263868066
Check-in#
Now write a statement to filter df_pokemon
to show only the Pokemon where Attack
is less than or equal to 50.
How many rows are in this subset?
What proportion of Pokemon meet this criteria?
What is the mean
HP
in this subset?
## First, write the filter statement
df_subset = df_pokemon[df_pokemon['Attack'] <= 50]
### How many rows are in this subset?
len(df_subset)
170
### Proportion of dataset ... about half
len(df_subset) / len(df_pokemon)
0.2125
## Mean *median* house value
df_subset['HP'].mean()
53.72941176470588
Check-in#
What is the mean HP
of Pokemon that are Legendary
vs. those that aren’t?
Hint: Create two subset DataFrame
s, filtering for == True
and != False
.
### Your code here
Solution#
## First, write the filter statement
df_legendary = df_pokemon[df_pokemon['Legendary'] == True]
df_not_legendary = df_pokemon[df_pokemon['Legendary'] == False]
## Mean for legendary
print(df_legendary['HP'].mean())
## Mean for not legendary
print(df_not_legendary['HP'].mean())
92.73846153846154
67.18231292517007
Interim summary#
We can apply conditionals to a
DataFrame
to filter thatDataFrame
.“Show me only observations that meet some criteria.”
We also used this to look at the
mean
value of some column (HP
) as a function of some other column (Legendary
).But there’s actually an even easier way to do this!
Using groupby
#
The
groupby
function allows you to split data (i.e., along different categories) then apply some function to each split of that data (e.g.,mean
).
Examples:
Calculate the
mean
HP
for each category inType 1
.Calculate the
sum
ofcosts
for eachmonth
.
Simple example#
Let’s demo this in a simple
DataFrame
.First, we create a
DataFrame
with information aboutReaction Time
in two differentCondition
s.Are people faster or slower for
Congruent
vs.Incongruent
conditions?E.g., supppose these are results from the Stroop task.
df_rt = pd.DataFrame({'Condition': ['Congruent', 'Congruent', 'Congruent',
'Incongruent', 'Incongruent', 'Incongruent'],
'Reaction Time': [500, 460, 532, 730, 720, 670]})
df_rt
Condition | Reaction Time | |
---|---|---|
0 | Congruent | 500 |
1 | Congruent | 460 |
2 | Congruent | 532 |
3 | Incongruent | 730 |
4 | Incongruent | 720 |
5 | Incongruent | 670 |
Using groupby
#
To calculate the mean
Reaction Time
in each condition, use:
df_rt.groupby("Condition").mean()
Reaction Time | |
---|---|
Condition | |
Congruent | 497.333333 |
Incongruent | 706.666667 |
Using groupby
and describe
#
To get more statistics about each condition, use describe
:
df_rt.groupby("Condition").describe()
Reaction Time | ||||||||
---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | |
Condition | ||||||||
Congruent | 3.0 | 497.333333 | 36.073998 | 460.0 | 480.0 | 500.0 | 516.0 | 532.0 |
Incongruent | 3.0 | 706.666667 | 32.145503 | 670.0 | 695.0 | 720.0 | 725.0 | 730.0 |
Check-in#
Use groupby
to calculate the average HP
for each category in Legendary
. Which category has the highest average?
### Your code here
Solution#
Note: By default, this will calculate the mean
of all available columns (including HP
).
df_pokemon.groupby("Legendary").mean()
# | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | |
---|---|---|---|---|---|---|---|---|---|
Legendary | |||||||||
False | 353.315646 | 417.213605 | 67.182313 | 75.669388 | 71.559184 | 68.454422 | 68.892517 | 65.455782 | 3.284354 |
True | 470.215385 | 637.384615 | 92.738462 | 116.676923 | 99.661538 | 122.184615 | 105.938462 | 100.184615 | 3.769231 |
Restricting output to a single column#
To look only a specific column, you can first slice only the columns you’re interested in:
df_pokemon[['Legendary', 'HP']].head(2)
Legendary | HP | |
---|---|---|
0 | False | 45 |
1 | False | 60 |
Then we can call groupby
on this simpler DataFrame
object with fewer columns.
df_pokemon[['Legendary', 'HP']].groupby("Legendary").mean()
HP | |
---|---|
Legendary | |
False | 67.182313 |
True | 92.738462 |
Missing data!#
In practice, real-world data is often messy.
This includes missing values, which take on the value/label
NaN
.NaN
= “Not a Number”.
Dealing with
NaN
values is one of the main challenges in CSS!
Where do missing values come from?#
Lots of reasons for missing data!
Experiment crashed, so some trials are missing.
Some people don’t answer all questions on a survey.
Some people “drop” out of a longitudinal study.
Subject |
Condition |
RT |
---|---|---|
1 |
Congruent |
500 |
2 |
Congruent |
550 |
3 |
Incongruent |
670 |
4 |
NaN |
NaN |
Why does missing data matter?#
Again, lots of reasons:
If you’re unaware of it, then you might be overestimating how many observations you have. (E.g., you think
N = 500
, but actuallyN = 450
.)If missing values are non-randomly distributed, your results might be biased. (E.g., maybe the
Placebo
group is more likely to drop out of a longitudinal study?)Missing values may complicate your analysis.
How to deal with missing data#
Identify missing values.
Analyze whether they’re randomly or non-randomly distributed.
Decide how to handle them:
Remove any rows with missing data from all analyses?
Ignore rows with missing data for some analyses but not all?
Try to impute (“fill in”) the missing data using other information?
Not an easy problem!
Identifying missing data#
The first step is identifying missing values.
For this, you can use
isna
.Returns a vector of
True
/False
values indicating whether a given observation isNaN
.
### This means none of those columns
### have NaN in top 5 rows
df_pokemon.isna().head(5)
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False | False | False | False | False | False | False |
1 | False | False | False | False | False | False | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False | False | False | False | False | False | False |
3 | False | False | False | False | False | False | False | False | False | False | False | False | False |
4 | False | False | False | True | False | False | False | False | False | False | False | False | False |
Which columns have NaN
?#
We can discover whether any column has NaN
using isna().any()
.
### Tells you which columns have NaN values
df_pokemon.isna().any()
# False
Name False
Type 1 False
Type 2 True
Total False
HP False
Attack False
Defense False
Sp. Atk False
Sp. Def False
Speed False
Generation False
Legendary False
dtype: bool
Inspecting Type 2
#
### These rows have NaN for "total_bedrooms"
df_pokemon[df_pokemon['Type 2'].isna()].head(5)
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | 4 | Charmander | Fire | NaN | 309 | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False |
5 | 5 | Charmeleon | Fire | NaN | 405 | 58 | 64 | 58 | 80 | 65 | 80 | 1 | False |
9 | 7 | Squirtle | Water | NaN | 314 | 44 | 48 | 65 | 50 | 64 | 43 | 1 | False |
10 | 8 | Wartortle | Water | NaN | 405 | 59 | 63 | 80 | 65 | 80 | 58 | 1 | False |
11 | 9 | Blastoise | Water | NaN | 530 | 79 | 83 | 100 | 85 | 105 | 78 | 1 | False |
len(df_pokemon[df_pokemon['Type 2'].isna()])
386
drop_na
#
One approach is to drop these observations altogether. You can do this using dropna(column_name)
.
df_pokemon_subset = df_pokemon.dropna(subset = "Type 2")
len(df_pokemon_subset)
414
### No more NaN
df_pokemon_subset.isna().any()
# False
Name False
Type 1 False
Type 2 False
Total False
HP False
Attack False
Defense False
Sp. Atk False
Sp. Def False
Speed False
Generation False
Legendary False
dtype: bool
Filtering on ~isna
#
You can also drop observations by filtering on values that aren’t NaN
.
The
~
syntax is a way to writenot
inpandas
syntax.
df_pokemon_subset = df_pokemon[~df_pokemon['Type 2'].isna()]
len(df_pokemon_subset)
414
### No more NaN
df_pokemon_subset.isna().any()
# False
Name False
Type 1 False
Type 2 False
Total False
HP False
Attack False
Defense False
Sp. Atk False
Sp. Def False
Speed False
Generation False
Legendary False
dtype: bool
Do we want to remove NaN
?#
In this particular case, removing
NaN
may not be necessary.Lots of Pokeon just don’t have a secondary Type.
But in other cases,
NaN
actually implies missing data, not just a feature of some observations.In those cases, we might want to remove those observations.
Conclusion#
This concludes our unit on pandas
.
pandas
is central to data manipulation and data analysis in Python.It’s a handy way to read
.csv
files (read_csv
).As you’ll see in CSS 2,
pandas
can also be used in conjunction with visualization and analysis packages.