Advanced operations with `pandas`#

Announcements#

CAPES and survey available!
- Survey: https://forms.gle/xox1KgV6FwCYoESX6
- CAPES: https://cape.ucsd.edu/student/instructions.html

Goals of this lecture#

Previously, we discussed several basic operations we can performan with a pandas.DataFrame.

In today’s lecture, we’ll cover more advanced operations:

Filtering a DataFrame using a conditional statement.
Using groupby to aggregate across members of a category.
Identifying missing or nan values.

### Always remember to import the package!
import pandas as pd

Review: reading in a `csv` file#

pandas.read_csv takes in a filepath as an argument, and returns a DataFrame.

## Pass the filepath into pd.read_csv
df_pokemon = pd.read_csv("data/pokemon.csv")
df_pokemon.head(2)

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False

Filtering a `DataFrame`#

Often, we want to filter a DataFrame so we only see observations that meet certain conditions.

“Only show trials from the Congruent condition”.
“Only show neighborhoods that are NEAR BAY.”
“Only show scores below 70%.”

Fundamentally, this is done by using a conditional statement.

Refresher: conditionals#

In a nutshell, a conditional is a statement that checks for whether some condition is met.

We can use the if command to control which lines of code are executed.

def square_if_even(x):
    if x % 2 == 0:
        return x ** 2
    else:
        return x

square_if_even(2)

square_if_even(3)

Conditionals and filtering#

Filtering a DataFrame is analogous to applying a condition.
I.e., “show me only the rows that meet this condition.

Example 1: filtering on a categorical variable#

The legendary column is a categorical variable, meaning there are several discrete categories.

## How many legendary pokemon?
df_pokemon[df_pokemon['Legendary']==True]

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
156	144	Articuno	Ice	Flying	580	90	85	100	95	125	85	1	True
157	145	Zapdos	Electric	Flying	580	90	90	85	125	90	100	1	True
158	146	Moltres	Fire	Flying	580	90	100	90	125	85	90	1	True
162	150	Mewtwo	Psychic	NaN	680	106	110	90	154	90	130	1	True
163	150	MewtwoMega Mewtwo X	Psychic	Fighting	780	106	190	100	154	100	130	1	True
...	...	...	...	...	...	...	...	...	...	...	...	...	...
795	719	Diancie	Rock	Fairy	600	50	100	150	100	150	50	6	True
796	719	DiancieMega Diancie	Rock	Fairy	700	50	160	110	160	110	110	6	True
797	720	HoopaHoopa Confined	Psychic	Ghost	600	80	110	60	150	130	70	6	True
798	720	HoopaHoopa Unbound	Psychic	Dark	680	80	160	60	170	130	80	6	True
799	721	Volcanion	Fire	Water	600	80	110	120	130	90	70	6	True

65 rows × 13 columns

## Compare to actual length of dataframe
len(df_pokemon)

Example 2: filtering on a continuous variable#

The HP column is a continuous variable.
Let’s show only the rows for Pokemon with a HP > 150.

df_pokemon[df_pokemon['HP'] > 150]

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
121	113	Chansey	Normal	NaN	450	250	5	5	35	105	50	1	False
155	143	Snorlax	Normal	NaN	540	160	110	65	65	110	30	1	False
217	202	Wobbuffet	Psychic	NaN	405	190	33	58	33	58	33	2	False
261	242	Blissey	Normal	NaN	540	255	10	10	75	135	55	2	False
351	321	Wailord	Water	NaN	500	170	90	45	90	45	60	3	False
655	594	Alomomola	Water	NaN	470	165	75	80	40	45	65	5	False

## Compare to actual length of df_pokemon
len(df_pokemon)

Interim summary: filtering syntax#

To filter a DataFrame, we can use the bracket ([]) syntax.:

df_name[df_name['col_name']==CONDITION]

(Note that ==CONDITION would be replaced with whatever condition you want to filter on.)

Check-in#

Write a statement to filter df_pokemon to show only the pokemon where Attack is greater than or equal to 50.

How many rows are in this subset?
What proportion of Pokemon meet this criteria?
What is the mean HP in this subset?

### Your code here

Solution#

## First, write the filter statement
df_subset = df_pokemon[df_pokemon['Attack'] >= 50]

### How many rows are in this subset? 
len(df_subset)

### Proportion of dataset ... about half
len(df_subset) / len(df_pokemon)

0.83375

## Mean HP
df_subset['HP'].mean()

72.55472263868066

Check-in#

Now write a statement to filter df_pokemon to show only the Pokemon where Attack is less than or equal to 50.

How many rows are in this subset?
What proportion of Pokemon meet this criteria?
What is the mean HP in this subset?

## First, write the filter statement
df_subset = df_pokemon[df_pokemon['Attack'] <= 50]

### How many rows are in this subset? 
len(df_subset)

### Proportion of dataset ... about half
len(df_subset) / len(df_pokemon)

0.2125

## Mean *median* house value
df_subset['HP'].mean()

53.72941176470588

Check-in#

What is the mean HP of Pokemon that are Legendary vs. those that aren’t?

Hint: Create two subset DataFrames, filtering for == True and != False.

### Your code here

Solution#

## First, write the filter statement
df_legendary = df_pokemon[df_pokemon['Legendary'] == True]
df_not_legendary = df_pokemon[df_pokemon['Legendary'] == False]

## Mean for legendary
print(df_legendary['HP'].mean())
## Mean for not legendary
print(df_not_legendary['HP'].mean())

92.73846153846154
67.18231292517007

Interim summary#

We can apply conditionals to a DataFrame to filter that DataFrame.
- “Show me only observations that meet some criteria.”
We also used this to look at the mean value of some column (HP) as a function of some other column (Legendary).
But there’s actually an even easier way to do this!

Using `groupby`#

The groupby function allows you to split data (i.e., along different categories) then apply some function to each split of that data (e.g., mean).

Examples:

Calculate the mean HP for each category in Type 1.
Calculate the sum of costs for each month.

Simple example#

Let’s demo this in a simple DataFrame.
First, we create a DataFrame with information about Reaction Time in two different Conditions.
Are people faster or slower for Congruent vs. Incongruent conditions?
- E.g., supppose these are results from the Stroop task.

df_rt = pd.DataFrame({'Condition': ['Congruent', 'Congruent', 'Congruent',
                                    'Incongruent', 'Incongruent', 'Incongruent'],
                          'Reaction Time': [500, 460, 532, 730, 720, 670]})
df_rt

	Condition	Reaction Time
0	Congruent	500
1	Congruent	460
2	Congruent	532
3	Incongruent	730
4	Incongruent	720
5	Incongruent	670

Using `groupby`#

To calculate the mean Reaction Time in each condition, use:

df_rt.groupby("Condition").mean()

	Reaction Time
Condition
Congruent	497.333333
Incongruent	706.666667

Using `groupby` and `describe`#

To get more statistics about each condition, use describe:

df_rt.groupby("Condition").describe()

	Reaction Time
	count	mean	std	min	25%	50%	75%	max
Condition
Congruent	3.0	497.333333	36.073998	460.0	480.0	500.0	516.0	532.0
Incongruent	3.0	706.666667	32.145503	670.0	695.0	720.0	725.0	730.0

Check-in#

Use groupby to calculate the average HP for each category in Legendary. Which category has the highest average?

### Your code here

Solution#

Note: By default, this will calculate the mean of all available columns (including HP).

df_pokemon.groupby("Legendary").mean()

	#	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation
Legendary
False	353.315646	417.213605	67.182313	75.669388	71.559184	68.454422	68.892517	65.455782	3.284354
True	470.215385	637.384615	92.738462	116.676923	99.661538	122.184615	105.938462	100.184615	3.769231

Restricting output to a single column#

To look only a specific column, you can first slice only the columns you’re interested in:

df_pokemon[['Legendary', 'HP']].head(2)

	Legendary	HP
0	False	45
1	False	60

Then we can call groupby on this simpler DataFrame object with fewer columns.

df_pokemon[['Legendary', 'HP']].groupby("Legendary").mean()

	HP
Legendary
False	67.182313
True	92.738462

Missing data!#

In practice, real-world data is often messy.
This includes missing values, which take on the value/label NaN.
- NaN = “Not a Number”.
Dealing with NaN values is one of the main challenges in CSS!

Where do missing values come from?#

Lots of reasons for missing data!
- Experiment crashed, so some trials are missing.
- Some people don’t answer all questions on a survey.
- Some people “drop” out of a longitudinal study.

Subject	Condition	RT
1	Congruent	500
2	Congruent	550
3	Incongruent	670
4	NaN	NaN

Why does missing data matter?#

Again, lots of reasons:

If you’re unaware of it, then you might be overestimating how many observations you have. (E.g., you think N = 500, but actually N = 450.)
If missing values are non-randomly distributed, your results might be biased. (E.g., maybe the Placebo group is more likely to drop out of a longitudinal study?)
Missing values may complicate your analysis.

How to deal with missing data#

Identify missing values.
Analyze whether they’re randomly or non-randomly distributed.
Decide how to handle them:
- Remove any rows with missing data from all analyses?
- Ignore rows with missing data for some analyses but not all?
- Try to impute (“fill in”) the missing data using other information?
Not an easy problem!

Identifying missing data#

The first step is identifying missing values.
For this, you can use isna.
- Returns a vector of True/False values indicating whether a given observation is NaN.

### This means none of those columns
### have NaN in top 5 rows
df_pokemon.isna().head(5)

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	False	False	False	False	False	False	False	False	False	False	False	False	False
1	False	False	False	False	False	False	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False	False	False	False	False	False	False
4	False	False	False	True	False	False	False	False	False	False	False	False	False

Which columns have `NaN`?#

We can discover whether any column has NaN using isna().any().

### Tells you which columns have NaN values
df_pokemon.isna().any()

#             False
Name          False
Type 1        False
Type 2         True
Total         False
HP            False
Attack        False
Defense       False
Sp. Atk       False
Sp. Def       False
Speed         False
Generation    False
Legendary     False
dtype: bool

Inspecting `Type 2`#

### These rows have NaN for "total_bedrooms"
df_pokemon[df_pokemon['Type 2'].isna()].head(5)

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
4	4	Charmander	Fire	NaN	309	39	52	43	60	50	65	1	False
5	5	Charmeleon	Fire	NaN	405	58	64	58	80	65	80	1	False
9	7	Squirtle	Water	NaN	314	44	48	65	50	64	43	1	False
10	8	Wartortle	Water	NaN	405	59	63	80	65	80	58	1	False
11	9	Blastoise	Water	NaN	530	79	83	100	85	105	78	1	False

len(df_pokemon[df_pokemon['Type 2'].isna()])

`drop_na`#

One approach is to drop these observations altogether. You can do this using dropna(column_name).

df_pokemon_subset = df_pokemon.dropna(subset = "Type 2")
len(df_pokemon_subset)

### No more NaN
df_pokemon_subset.isna().any()

#             False
Name          False
Type 1        False
Type 2        False
Total         False
HP            False
Attack        False
Defense       False
Sp. Atk       False
Sp. Def       False
Speed         False
Generation    False
Legendary     False
dtype: bool

Filtering on `~isna`#

You can also drop observations by filtering on values that aren’t NaN.

The ~ syntax is a way to write not in pandas syntax.

df_pokemon_subset = df_pokemon[~df_pokemon['Type 2'].isna()]
len(df_pokemon_subset)

### No more NaN
df_pokemon_subset.isna().any()

#             False
Name          False
Type 1        False
Type 2        False
Total         False
HP            False
Attack        False
Defense       False
Sp. Atk       False
Sp. Def       False
Speed         False
Generation    False
Legendary     False
dtype: bool

Do we want to remove `NaN`?#

In this particular case, removing NaN may not be necessary.
- Lots of Pokeon just don’t have a secondary Type.
But in other cases, NaN actually implies missing data, not just a feature of some observations.
In those cases, we might want to remove those observations.

Conclusion#

This concludes our unit on pandas.

pandas is central to data manipulation and data analysis in Python.
It’s a handy way to read .csv files (read_csv).
As you’ll see in CSS 2, pandas can also be used in conjunction with visualization and analysis packages.

CSS 1

Advanced operations with pandas

Contents

Advanced operations with `pandas`#

Announcements#

Goals of this lecture#

Review: reading in a `csv` file#

Filtering a `DataFrame`#

Refresher: conditionals#

Conditionals and filtering#

Example 1: filtering on a categorical variable#

Example 2: filtering on a continuous variable#

Interim summary: filtering syntax#

Check-in#

Solution#

Check-in#

Check-in#

Solution#

Interim summary#

Using `groupby`#

Simple example#

Using `groupby`#

Using `groupby` and `describe`#

Check-in#

Solution#

Restricting output to a single column#

Missing data!#

Where do missing values come from?#

Why does missing data matter?#

How to deal with missing data#

Identifying missing data#

Which columns have `NaN`?#

Inspecting `Type 2`#

`drop_na`#

Filtering on `~isna`#

Do we want to remove `NaN`?#

Conclusion#

CSS 1

Advanced operations with pandas

Contents

Advanced operations with pandas#

Announcements#

Goals of this lecture#

Review: reading in a csv file#

Filtering a DataFrame#

Refresher: conditionals#

Conditionals and filtering#

Example 1: filtering on a categorical variable#

Example 2: filtering on a continuous variable#

Interim summary: filtering syntax#

Check-in#

Solution#

Check-in#

Check-in#

Solution#

Interim summary#

Using groupby#

Simple example#

Using groupby#

Using groupby and describe#

Check-in#

Solution#

Restricting output to a single column#

Missing data!#

Where do missing values come from?#

Why does missing data matter?#

How to deal with missing data#

Identifying missing data#

Which columns have NaN?#

Inspecting Type 2#

drop_na#

Filtering on ~isna#

Do we want to remove NaN?#

Conclusion#

Advanced operations with `pandas`#

Review: reading in a `csv` file#

Filtering a `DataFrame`#

Using `groupby`#

Using `groupby`#

Using `groupby` and `describe`#

Which columns have `NaN`?#

Inspecting `Type 2`#

`drop_na`#

Filtering on `~isna`#

Do we want to remove `NaN`?#