Advanced operations with pandas#

Announcements#

Goals of this lecture#

Previously, we discussed several basic operations we can performan with a pandas.DataFrame.

In today’s lecture, we’ll cover more advanced operations:

  • Filtering a DataFrame using a conditional statement.

  • Using groupby to aggregate across members of a category.

  • Identifying missing or nan values.

### Always remember to import the package!
import pandas as pd

Review: reading in a csv file#

pandas.read_csv takes in a filepath as an argument, and returns a DataFrame.

## Pass the filepath into pd.read_csv
df_pokemon = pd.read_csv("data/pokemon.csv")
df_pokemon.head(2)
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False

Filtering a DataFrame#

Often, we want to filter a DataFrame so we only see observations that meet certain conditions.

  • “Only show trials from the Congruent condition”.

  • “Only show neighborhoods that are NEAR BAY.”

  • “Only show scores below 70%.”

Fundamentally, this is done by using a conditional statement.

Refresher: conditionals#

In a nutshell, a conditional is a statement that checks for whether some condition is met.

We can use the if command to control which lines of code are executed.

def square_if_even(x):
    if x % 2 == 0:
        return x ** 2
    else:
        return x
square_if_even(2)
4
square_if_even(3)
3

Conditionals and filtering#

  • Filtering a DataFrame is analogous to applying a condition.

  • I.e., “show me only the rows that meet this condition.

Example 1: filtering on a categorical variable#

  • The legendary column is a categorical variable, meaning there are several discrete categories.

## How many legendary pokemon?
df_pokemon[df_pokemon['Legendary']==True]
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
156 144 Articuno Ice Flying 580 90 85 100 95 125 85 1 True
157 145 Zapdos Electric Flying 580 90 90 85 125 90 100 1 True
158 146 Moltres Fire Flying 580 90 100 90 125 85 90 1 True
162 150 Mewtwo Psychic NaN 680 106 110 90 154 90 130 1 True
163 150 MewtwoMega Mewtwo X Psychic Fighting 780 106 190 100 154 100 130 1 True
... ... ... ... ... ... ... ... ... ... ... ... ... ...
795 719 Diancie Rock Fairy 600 50 100 150 100 150 50 6 True
796 719 DiancieMega Diancie Rock Fairy 700 50 160 110 160 110 110 6 True
797 720 HoopaHoopa Confined Psychic Ghost 600 80 110 60 150 130 70 6 True
798 720 HoopaHoopa Unbound Psychic Dark 680 80 160 60 170 130 80 6 True
799 721 Volcanion Fire Water 600 80 110 120 130 90 70 6 True

65 rows × 13 columns

## Compare to actual length of dataframe
len(df_pokemon)
800

Example 2: filtering on a continuous variable#

  • The HP column is a continuous variable.

  • Let’s show only the rows for Pokemon with a HP > 150.

df_pokemon[df_pokemon['HP'] > 150]
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
121 113 Chansey Normal NaN 450 250 5 5 35 105 50 1 False
155 143 Snorlax Normal NaN 540 160 110 65 65 110 30 1 False
217 202 Wobbuffet Psychic NaN 405 190 33 58 33 58 33 2 False
261 242 Blissey Normal NaN 540 255 10 10 75 135 55 2 False
351 321 Wailord Water NaN 500 170 90 45 90 45 60 3 False
655 594 Alomomola Water NaN 470 165 75 80 40 45 65 5 False
## Compare to actual length of df_pokemon
len(df_pokemon)
800

Interim summary: filtering syntax#

To filter a DataFrame, we can use the bracket ([]) syntax.:

df_name[df_name['col_name']==CONDITION]

(Note that ==CONDITION would be replaced with whatever condition you want to filter on.)

Check-in#

Write a statement to filter df_pokemon to show only the pokemon where Attack is greater than or equal to 50.

  • How many rows are in this subset?

  • What proportion of Pokemon meet this criteria?

  • What is the mean HP in this subset?

### Your code here

Solution#

## First, write the filter statement
df_subset = df_pokemon[df_pokemon['Attack'] >= 50]
### How many rows are in this subset? 
len(df_subset)
667
### Proportion of dataset ... about half
len(df_subset) / len(df_pokemon)
0.83375
## Mean HP
df_subset['HP'].mean()
72.55472263868066

Check-in#

Now write a statement to filter df_pokemon to show only the Pokemon where Attack is less than or equal to 50.

  • How many rows are in this subset?

  • What proportion of Pokemon meet this criteria?

  • What is the mean HP in this subset?

## First, write the filter statement
df_subset = df_pokemon[df_pokemon['Attack'] <= 50]
### How many rows are in this subset? 
len(df_subset)
170
### Proportion of dataset ... about half
len(df_subset) / len(df_pokemon)
0.2125
## Mean *median* house value
df_subset['HP'].mean()
53.72941176470588

Check-in#

What is the mean HP of Pokemon that are Legendary vs. those that aren’t?

Hint: Create two subset DataFrames, filtering for == True and != False.

### Your code here

Solution#

## First, write the filter statement
df_legendary = df_pokemon[df_pokemon['Legendary'] == True]
df_not_legendary = df_pokemon[df_pokemon['Legendary'] == False]
## Mean for legendary
print(df_legendary['HP'].mean())
## Mean for not legendary
print(df_not_legendary['HP'].mean())
92.73846153846154
67.18231292517007

Interim summary#

  • We can apply conditionals to a DataFrame to filter that DataFrame.

    • “Show me only observations that meet some criteria.”

  • We also used this to look at the mean value of some column (HP) as a function of some other column (Legendary).

  • But there’s actually an even easier way to do this!

Using groupby#

The groupby function allows you to split data (i.e., along different categories) then apply some function to each split of that data (e.g., mean).

Examples:

  • Calculate the mean HP for each category in Type 1.

  • Calculate the sum of costs for each month.

Simple example#

  • Let’s demo this in a simple DataFrame.

  • First, we create a DataFrame with information about Reaction Time in two different Conditions.

  • Are people faster or slower for Congruent vs. Incongruent conditions?

    • E.g., supppose these are results from the Stroop task.

df_rt = pd.DataFrame({'Condition': ['Congruent', 'Congruent', 'Congruent',
                                    'Incongruent', 'Incongruent', 'Incongruent'],
                          'Reaction Time': [500, 460, 532, 730, 720, 670]})
df_rt
Condition Reaction Time
0 Congruent 500
1 Congruent 460
2 Congruent 532
3 Incongruent 730
4 Incongruent 720
5 Incongruent 670

Using groupby#

To calculate the mean Reaction Time in each condition, use:

df_rt.groupby("Condition").mean()
Reaction Time
Condition
Congruent 497.333333
Incongruent 706.666667

Using groupby and describe#

To get more statistics about each condition, use describe:

df_rt.groupby("Condition").describe()
Reaction Time
count mean std min 25% 50% 75% max
Condition
Congruent 3.0 497.333333 36.073998 460.0 480.0 500.0 516.0 532.0
Incongruent 3.0 706.666667 32.145503 670.0 695.0 720.0 725.0 730.0

Check-in#

Use groupby to calculate the average HP for each category in Legendary. Which category has the highest average?

### Your code here

Solution#

Note: By default, this will calculate the mean of all available columns (including HP).

df_pokemon.groupby("Legendary").mean()
# Total HP Attack Defense Sp. Atk Sp. Def Speed Generation
Legendary
False 353.315646 417.213605 67.182313 75.669388 71.559184 68.454422 68.892517 65.455782 3.284354
True 470.215385 637.384615 92.738462 116.676923 99.661538 122.184615 105.938462 100.184615 3.769231

Restricting output to a single column#

To look only a specific column, you can first slice only the columns you’re interested in:

df_pokemon[['Legendary', 'HP']].head(2)
Legendary HP
0 False 45
1 False 60

Then we can call groupby on this simpler DataFrame object with fewer columns.

df_pokemon[['Legendary', 'HP']].groupby("Legendary").mean()
HP
Legendary
False 67.182313
True 92.738462

Missing data!#

Where do missing values come from?#

  • Lots of reasons for missing data!

    • Experiment crashed, so some trials are missing.

    • Some people don’t answer all questions on a survey.

    • Some people “drop” out of a longitudinal study.

Subject

Condition

RT

1

Congruent

500

2

Congruent

550

3

Incongruent

670

4

NaN

NaN

Why does missing data matter?#

Again, lots of reasons:

  1. If you’re unaware of it, then you might be overestimating how many observations you have. (E.g., you think N = 500, but actually N = 450.)

  2. If missing values are non-randomly distributed, your results might be biased. (E.g., maybe the Placebo group is more likely to drop out of a longitudinal study?)

  3. Missing values may complicate your analysis.

How to deal with missing data#

  • Identify missing values.

  • Analyze whether they’re randomly or non-randomly distributed.

  • Decide how to handle them:

    • Remove any rows with missing data from all analyses?

    • Ignore rows with missing data for some analyses but not all?

    • Try to impute (“fill in”) the missing data using other information?

  • Not an easy problem!

Identifying missing data#

  • The first step is identifying missing values.

  • For this, you can use isna.

    • Returns a vector of True/False values indicating whether a given observation is NaN.

### This means none of those columns
### have NaN in top 5 rows
df_pokemon.isna().head(5)
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 False False False False False False False False False False False False False
1 False False False False False False False False False False False False False
2 False False False False False False False False False False False False False
3 False False False False False False False False False False False False False
4 False False False True False False False False False False False False False

Which columns have NaN?#

We can discover whether any column has NaN using isna().any().

### Tells you which columns have NaN values
df_pokemon.isna().any()
#             False
Name          False
Type 1        False
Type 2         True
Total         False
HP            False
Attack        False
Defense       False
Sp. Atk       False
Sp. Def       False
Speed         False
Generation    False
Legendary     False
dtype: bool

Inspecting Type 2#

### These rows have NaN for "total_bedrooms"
df_pokemon[df_pokemon['Type 2'].isna()].head(5)
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False
5 5 Charmeleon Fire NaN 405 58 64 58 80 65 80 1 False
9 7 Squirtle Water NaN 314 44 48 65 50 64 43 1 False
10 8 Wartortle Water NaN 405 59 63 80 65 80 58 1 False
11 9 Blastoise Water NaN 530 79 83 100 85 105 78 1 False
len(df_pokemon[df_pokemon['Type 2'].isna()])
386

drop_na#

One approach is to drop these observations altogether. You can do this using dropna(column_name).

df_pokemon_subset = df_pokemon.dropna(subset = "Type 2")
len(df_pokemon_subset)
414
### No more NaN
df_pokemon_subset.isna().any()
#             False
Name          False
Type 1        False
Type 2        False
Total         False
HP            False
Attack        False
Defense       False
Sp. Atk       False
Sp. Def       False
Speed         False
Generation    False
Legendary     False
dtype: bool

Filtering on ~isna#

You can also drop observations by filtering on values that aren’t NaN.

  • The ~ syntax is a way to write not in pandas syntax.

df_pokemon_subset = df_pokemon[~df_pokemon['Type 2'].isna()]
len(df_pokemon_subset)
414
### No more NaN
df_pokemon_subset.isna().any()
#             False
Name          False
Type 1        False
Type 2        False
Total         False
HP            False
Attack        False
Defense       False
Sp. Atk       False
Sp. Def       False
Speed         False
Generation    False
Legendary     False
dtype: bool

Do we want to remove NaN?#

  • In this particular case, removing NaN may not be necessary.

    • Lots of Pokeon just don’t have a secondary Type.

  • But in other cases, NaN actually implies missing data, not just a feature of some observations.

  • In those cases, we might want to remove those observations.

Conclusion#

This concludes our unit on pandas.

  • pandas is central to data manipulation and data analysis in Python.

  • It’s a handy way to read .csv files (read_csv).

  • As you’ll see in CSS 2, pandas can also be used in conjunction with visualization and analysis packages.