Advanced operations with pandas
Contents
Advanced operations with pandas#
Announcements#
CAPES and survey available!
Goals of this lecture#
Previously, we discussed several basic operations we can performan with a pandas.DataFrame.
In today’s lecture, we’ll cover more advanced operations:
Filtering a
DataFrameusing a conditional statement.Using
groupbyto aggregate across members of a category.Identifying missing or
nanvalues.
### Always remember to import the package!
import pandas as pd
Review: reading in a csv file#
pandas.read_csv takes in a filepath as an argument, and returns a DataFrame.
## Pass the filepath into pd.read_csv
df_pokemon = pd.read_csv("data/pokemon.csv")
df_pokemon.head(2)
| # | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
| 1 | 2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
Filtering a DataFrame#
Often, we want to filter a DataFrame so we only see observations that meet certain conditions.
“Only show trials from the
Congruentcondition”.“Only show neighborhoods that are
NEAR BAY.”“Only show scores below 70%.”
Fundamentally, this is done by using a conditional statement.
Refresher: conditionals#
In a nutshell, a conditional is a statement that checks for whether some condition is met.
We can use the if command to control which lines of code are executed.
def square_if_even(x):
if x % 2 == 0:
return x ** 2
else:
return x
square_if_even(2)
4
square_if_even(3)
3
Conditionals and filtering#
Filtering a
DataFrameis analogous to applying a condition.I.e., “show me only the rows that meet this condition.
Example 1: filtering on a categorical variable#
The
legendarycolumn is a categorical variable, meaning there are several discrete categories.
## How many legendary pokemon?
df_pokemon[df_pokemon['Legendary']==True]
| # | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 156 | 144 | Articuno | Ice | Flying | 580 | 90 | 85 | 100 | 95 | 125 | 85 | 1 | True |
| 157 | 145 | Zapdos | Electric | Flying | 580 | 90 | 90 | 85 | 125 | 90 | 100 | 1 | True |
| 158 | 146 | Moltres | Fire | Flying | 580 | 90 | 100 | 90 | 125 | 85 | 90 | 1 | True |
| 162 | 150 | Mewtwo | Psychic | NaN | 680 | 106 | 110 | 90 | 154 | 90 | 130 | 1 | True |
| 163 | 150 | MewtwoMega Mewtwo X | Psychic | Fighting | 780 | 106 | 190 | 100 | 154 | 100 | 130 | 1 | True |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 795 | 719 | Diancie | Rock | Fairy | 600 | 50 | 100 | 150 | 100 | 150 | 50 | 6 | True |
| 796 | 719 | DiancieMega Diancie | Rock | Fairy | 700 | 50 | 160 | 110 | 160 | 110 | 110 | 6 | True |
| 797 | 720 | HoopaHoopa Confined | Psychic | Ghost | 600 | 80 | 110 | 60 | 150 | 130 | 70 | 6 | True |
| 798 | 720 | HoopaHoopa Unbound | Psychic | Dark | 680 | 80 | 160 | 60 | 170 | 130 | 80 | 6 | True |
| 799 | 721 | Volcanion | Fire | Water | 600 | 80 | 110 | 120 | 130 | 90 | 70 | 6 | True |
65 rows × 13 columns
## Compare to actual length of dataframe
len(df_pokemon)
800
Example 2: filtering on a continuous variable#
The
HPcolumn is a continuous variable.Let’s show only the rows for Pokemon with a
HP > 150.
df_pokemon[df_pokemon['HP'] > 150]
| # | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 121 | 113 | Chansey | Normal | NaN | 450 | 250 | 5 | 5 | 35 | 105 | 50 | 1 | False |
| 155 | 143 | Snorlax | Normal | NaN | 540 | 160 | 110 | 65 | 65 | 110 | 30 | 1 | False |
| 217 | 202 | Wobbuffet | Psychic | NaN | 405 | 190 | 33 | 58 | 33 | 58 | 33 | 2 | False |
| 261 | 242 | Blissey | Normal | NaN | 540 | 255 | 10 | 10 | 75 | 135 | 55 | 2 | False |
| 351 | 321 | Wailord | Water | NaN | 500 | 170 | 90 | 45 | 90 | 45 | 60 | 3 | False |
| 655 | 594 | Alomomola | Water | NaN | 470 | 165 | 75 | 80 | 40 | 45 | 65 | 5 | False |
## Compare to actual length of df_pokemon
len(df_pokemon)
800
Interim summary: filtering syntax#
To filter a DataFrame, we can use the bracket ([]) syntax.:
df_name[df_name['col_name']==CONDITION]
(Note that ==CONDITION would be replaced with whatever condition you want to filter on.)
Check-in#
Write a statement to filter df_pokemon to show only the pokemon where Attack is greater than or equal to 50.
How many rows are in this subset?
What proportion of Pokemon meet this criteria?
What is the mean
HPin this subset?
### Your code here
Solution#
## First, write the filter statement
df_subset = df_pokemon[df_pokemon['Attack'] >= 50]
### How many rows are in this subset?
len(df_subset)
667
### Proportion of dataset ... about half
len(df_subset) / len(df_pokemon)
0.83375
## Mean HP
df_subset['HP'].mean()
72.55472263868066
Check-in#
Now write a statement to filter df_pokemon to show only the Pokemon where Attack is less than or equal to 50.
How many rows are in this subset?
What proportion of Pokemon meet this criteria?
What is the mean
HPin this subset?
## First, write the filter statement
df_subset = df_pokemon[df_pokemon['Attack'] <= 50]
### How many rows are in this subset?
len(df_subset)
170
### Proportion of dataset ... about half
len(df_subset) / len(df_pokemon)
0.2125
## Mean *median* house value
df_subset['HP'].mean()
53.72941176470588
Check-in#
What is the mean HP of Pokemon that are Legendary vs. those that aren’t?
Hint: Create two subset DataFrames, filtering for == True and != False.
### Your code here
Solution#
## First, write the filter statement
df_legendary = df_pokemon[df_pokemon['Legendary'] == True]
df_not_legendary = df_pokemon[df_pokemon['Legendary'] == False]
## Mean for legendary
print(df_legendary['HP'].mean())
## Mean for not legendary
print(df_not_legendary['HP'].mean())
92.73846153846154
67.18231292517007
Interim summary#
We can apply conditionals to a
DataFrameto filter thatDataFrame.“Show me only observations that meet some criteria.”
We also used this to look at the
meanvalue of some column (HP) as a function of some other column (Legendary).But there’s actually an even easier way to do this!
Using groupby#
The
groupbyfunction allows you to split data (i.e., along different categories) then apply some function to each split of that data (e.g.,mean).
Examples:
Calculate the
meanHPfor each category inType 1.Calculate the
sumofcostsfor eachmonth.
Simple example#
Let’s demo this in a simple
DataFrame.First, we create a
DataFramewith information aboutReaction Timein two differentConditions.Are people faster or slower for
Congruentvs.Incongruentconditions?E.g., supppose these are results from the Stroop task.
df_rt = pd.DataFrame({'Condition': ['Congruent', 'Congruent', 'Congruent',
'Incongruent', 'Incongruent', 'Incongruent'],
'Reaction Time': [500, 460, 532, 730, 720, 670]})
df_rt
| Condition | Reaction Time | |
|---|---|---|
| 0 | Congruent | 500 |
| 1 | Congruent | 460 |
| 2 | Congruent | 532 |
| 3 | Incongruent | 730 |
| 4 | Incongruent | 720 |
| 5 | Incongruent | 670 |
Using groupby#
To calculate the mean Reaction Time in each condition, use:
df_rt.groupby("Condition").mean()
| Reaction Time | |
|---|---|
| Condition | |
| Congruent | 497.333333 |
| Incongruent | 706.666667 |
Using groupby and describe#
To get more statistics about each condition, use describe:
df_rt.groupby("Condition").describe()
| Reaction Time | ||||||||
|---|---|---|---|---|---|---|---|---|
| count | mean | std | min | 25% | 50% | 75% | max | |
| Condition | ||||||||
| Congruent | 3.0 | 497.333333 | 36.073998 | 460.0 | 480.0 | 500.0 | 516.0 | 532.0 |
| Incongruent | 3.0 | 706.666667 | 32.145503 | 670.0 | 695.0 | 720.0 | 725.0 | 730.0 |
Check-in#
Use groupby to calculate the average HP for each category in Legendary. Which category has the highest average?
### Your code here
Solution#
Note: By default, this will calculate the mean of all available columns (including HP).
df_pokemon.groupby("Legendary").mean()
| # | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | |
|---|---|---|---|---|---|---|---|---|---|
| Legendary | |||||||||
| False | 353.315646 | 417.213605 | 67.182313 | 75.669388 | 71.559184 | 68.454422 | 68.892517 | 65.455782 | 3.284354 |
| True | 470.215385 | 637.384615 | 92.738462 | 116.676923 | 99.661538 | 122.184615 | 105.938462 | 100.184615 | 3.769231 |
Restricting output to a single column#
To look only a specific column, you can first slice only the columns you’re interested in:
df_pokemon[['Legendary', 'HP']].head(2)
| Legendary | HP | |
|---|---|---|
| 0 | False | 45 |
| 1 | False | 60 |
Then we can call groupby on this simpler DataFrame object with fewer columns.
df_pokemon[['Legendary', 'HP']].groupby("Legendary").mean()
| HP | |
|---|---|
| Legendary | |
| False | 67.182313 |
| True | 92.738462 |
Missing data!#
In practice, real-world data is often messy.
This includes missing values, which take on the value/label
NaN.NaN= “Not a Number”.
Dealing with
NaNvalues is one of the main challenges in CSS!
Where do missing values come from?#
Lots of reasons for missing data!
Experiment crashed, so some trials are missing.
Some people don’t answer all questions on a survey.
Some people “drop” out of a longitudinal study.
Subject |
Condition |
RT |
|---|---|---|
1 |
Congruent |
500 |
2 |
Congruent |
550 |
3 |
Incongruent |
670 |
4 |
NaN |
NaN |
Why does missing data matter?#
Again, lots of reasons:
If you’re unaware of it, then you might be overestimating how many observations you have. (E.g., you think
N = 500, but actuallyN = 450.)If missing values are non-randomly distributed, your results might be biased. (E.g., maybe the
Placebogroup is more likely to drop out of a longitudinal study?)Missing values may complicate your analysis.
How to deal with missing data#
Identify missing values.
Analyze whether they’re randomly or non-randomly distributed.
Decide how to handle them:
Remove any rows with missing data from all analyses?
Ignore rows with missing data for some analyses but not all?
Try to impute (“fill in”) the missing data using other information?
Not an easy problem!
Identifying missing data#
The first step is identifying missing values.
For this, you can use
isna.Returns a vector of
True/Falsevalues indicating whether a given observation isNaN.
### This means none of those columns
### have NaN in top 5 rows
df_pokemon.isna().head(5)
| # | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | False | False | False | False | False | False | False | False | False | False | False | False | False |
| 1 | False | False | False | False | False | False | False | False | False | False | False | False | False |
| 2 | False | False | False | False | False | False | False | False | False | False | False | False | False |
| 3 | False | False | False | False | False | False | False | False | False | False | False | False | False |
| 4 | False | False | False | True | False | False | False | False | False | False | False | False | False |
Which columns have NaN?#
We can discover whether any column has NaN using isna().any().
### Tells you which columns have NaN values
df_pokemon.isna().any()
# False
Name False
Type 1 False
Type 2 True
Total False
HP False
Attack False
Defense False
Sp. Atk False
Sp. Def False
Speed False
Generation False
Legendary False
dtype: bool
Inspecting Type 2#
### These rows have NaN for "total_bedrooms"
df_pokemon[df_pokemon['Type 2'].isna()].head(5)
| # | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 | 4 | Charmander | Fire | NaN | 309 | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False |
| 5 | 5 | Charmeleon | Fire | NaN | 405 | 58 | 64 | 58 | 80 | 65 | 80 | 1 | False |
| 9 | 7 | Squirtle | Water | NaN | 314 | 44 | 48 | 65 | 50 | 64 | 43 | 1 | False |
| 10 | 8 | Wartortle | Water | NaN | 405 | 59 | 63 | 80 | 65 | 80 | 58 | 1 | False |
| 11 | 9 | Blastoise | Water | NaN | 530 | 79 | 83 | 100 | 85 | 105 | 78 | 1 | False |
len(df_pokemon[df_pokemon['Type 2'].isna()])
386
drop_na#
One approach is to drop these observations altogether. You can do this using dropna(column_name).
df_pokemon_subset = df_pokemon.dropna(subset = "Type 2")
len(df_pokemon_subset)
414
### No more NaN
df_pokemon_subset.isna().any()
# False
Name False
Type 1 False
Type 2 False
Total False
HP False
Attack False
Defense False
Sp. Atk False
Sp. Def False
Speed False
Generation False
Legendary False
dtype: bool
Filtering on ~isna#
You can also drop observations by filtering on values that aren’t NaN.
The
~syntax is a way to writenotinpandassyntax.
df_pokemon_subset = df_pokemon[~df_pokemon['Type 2'].isna()]
len(df_pokemon_subset)
414
### No more NaN
df_pokemon_subset.isna().any()
# False
Name False
Type 1 False
Type 2 False
Total False
HP False
Attack False
Defense False
Sp. Atk False
Sp. Def False
Speed False
Generation False
Legendary False
dtype: bool
Do we want to remove NaN?#
In this particular case, removing
NaNmay not be necessary.Lots of Pokeon just don’t have a secondary Type.
But in other cases,
NaNactually implies missing data, not just a feature of some observations.In those cases, we might want to remove those observations.
Conclusion#
This concludes our unit on pandas.
pandasis central to data manipulation and data analysis in Python.It’s a handy way to read
.csvfiles (read_csv).As you’ll see in CSS 2,
pandascan also be used in conjunction with visualization and analysis packages.