Basic operations with `pandas`#

Goals of this lecture#

Previously, we discussed why pandas.DataFrame is so useful for representing data.

In today’s lecture, we’ll cover:

How to read in a .csv file as a DataFrame using pandas.read_csv.
Sorting a DataFrame with sort_values.
Basic descriptive statistics with pandas (mean(), etc.)
Creating a new column of a DataFrame.

### Always remember to import the package!
import pandas as pd

Review: What is tabular data?#

Tabular data is data organized in a table with rows and columns.

This kind of data is two-dimensional.
Typically, each row represents an “observation”.
- A person.
- A country.
- An experimental trial.
Typicallly, each column represents an attribute.
- height
- gdp or population
- reaction_time or experimental_condition

Example: test scores#

Check-in: What are the columns of this table?

Participant	Score
1	85
2	92
3	78
4	98
5	76

Reading in a `.csv` file#

A .csv file is a comma-separated values file.

Tabular data is often represented in a .csv file.

The values of each column are separated by a comma (1,85).
Each row is represented by a new line.

A .csv file can also be opened in Excel or Google Sheets.

Review: file paths#

Previously, we’ve discussed how a file path represents the “location” of a file on your computer.

An absolute file path specifies the location of a file relative to some root directory.
A relative file path specifies the location of a file relative to the current directory (i.e., the one you’re in right now).

We already know how to read in .txt and .json files, but .csv files are new.

A sample `.csv` file#

The data directory contains a file called scores_sample.csv.

For a series of made-up research participants, this file contains:

The student number.
Their score on a reading comprehension task.

Now we need to read this file in!

Check-in#

If we’re currently in lectures, and the file is in lectures/data/scores_sample.csv, what would the relative file path be?

Solution#

The filepath would be: data/scores_sample.csv

Using `read_csv`#

pandas.read_csv takes in a filepath as an argument, and returns a DataFrame.

## Pass the filepath into pd.read_csv
df_scores = pd.read_csv("data/scores_sample.csv")

## Let's inspect the first couple of rows
df_scores.head(2)

	Participant	Score
0	1	83
1	2	84

Check-in#

What are the column names of this DataFrame? What does each row represent?

df_scores.head(5)

	Participant	Score
0	1	83
1	2	84
2	3	89
3	4	80
4	5	81

Check-in#

How would we figure out how many rows and columns this DataFrame has?

### Your code here

Solution#

## First number = number of rows
## Second number = number of columns
df_scores.shape

(20, 2)

Check-in#

Now let’s try to read in a new .csv file. This one is also located in data, and it’s called pokemon.csv.

The original data can be found on Kaggle, a website for data science competitions).
It contains information about different Pokemon.

Call the new DataFrame df_pokemon.

### Your code here

Solution#

df_pokemon = pd.read_csv("data/pokemon.csv")
df_pokemon.head(2)

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False

df_pokemon.shape

(800, 13)

Sorting a `DataFrame`#

Often, we might find it useful to sort a DataFrame according to the values in a particular column.

`sort_values`#

By default, sort_values will sort from lowest to highest.

df_scores.sort_values("Score").head(2)

	Participant	Score
19	20	75
18	19	79

df_scores.sort_values("Score").tail(2)

	Participant	Score
17	18	94
11	12	98

`ascending = False`#

We can change that with the ascending parameter.

df_scores.sort_values("Score", ascending = False).head(2)

	Participant	Score
11	12	98
17	18	94

Check-in#

Use sort_values to find the block with the lowest HP. Then, answer the following questions:

What is the Speed of this Pokemon?
What is the Legendary status of this Pokemon?
What is the Type 1 and Type 2 of this Pokemon?

### Your code here

Solution#

df_pokemon.sort_values("HP").head(1)

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
316	292	Shedinja	Bug	Ghost	236	1	90	45	30	30	40	3	False

Check-in#

Use sort_values to find the block with the highest HP. Then, answer the following questions:

What is the Speed of this Pokemon?
What is the Legendary status of this Pokemon?
What is the Type 1 and Type 2 of this Pokemon?

### Your code here

Solution#

df_pokemon.sort_values("HP", ascending = False).head(1)

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
261	242	Blissey	Normal	NaN	540	255	10	10	75	135	55	2	False

Basic descriptive statistics in `pandas`#

Descriptive statistics are ways to summarize and organize data.

We’ve already covered using numpy to calculate descriptive statistics.
But we can also use pandas in a very similar way!
- (A column of a DataFrame is roughly analogous to a numpy.ndarray.)

`sum`#

The sum of a set of numbers is simply the result of adding each number together.

The sum of a column can be calculated by calling:

df_name[column_name].sum()

df_scores['Score'].sum()

`mean`#

The mean of a set of scores is the sum of those scores divided by the number of the observations.

The mean of a column can be calculated by calling:

df_name[column_name].mean()

df_scores['Score'].mean()

86.4

`describe`#

The describe function gives you a bunch of handy descriptive statistics in a single call.

df_scores['Score'].describe()

count    20.000000
mean     86.400000
std       5.374403
min      75.000000
25%      83.750000
50%      86.500000
75%      89.250000
max      98.000000
Name: Score, dtype: float64

Check-in#

What is the mean HP in our Pokemon dataset? What about the minimum and maximum?

### Your code here

Solution#

print(df_pokemon['HP'].mean())
print(df_pokemon['HP'].max())
print(df_pokemon['HP'].min())

69.25875
255
1

Check-in#

What is the mean Attack Pokemon dataset? What about the minimum and maximum?

### Your code here

Solution#

print(df_pokemon['Attack'].mean())
print(df_pokemon['Attack'].max())
print(df_pokemon['Attack'].min())

79.00125
190
5

Check-in#

Could we calculate the mean of the Type 1 variable? Why or why not?

### Your code here

Solution#

No, because Type 1 is a qualitative variable. We can only count how many of each category there are.

df_pokemon['Type 1'].value_counts()

Water       112
Normal       98
Grass        70
Bug          69
Psychic      57
Fire         52
Electric     44
Rock         44
Dragon       32
Ground       32
Ghost        32
Dark         31
Poison       28
Steel        27
Fighting     27
Ice          24
Fairy        17
Flying        4
Name: Type 1, dtype: int64

Creating a new column#

pandas also makes it relatively easy to make a new column.

Syntax:

dataframe_name[new_column_name] = ...

### Setting the column to a single value 
### just duplicates that value throughout the column
df_scores['Researcher'] = 'Bergen'

df_scores.head(2)

	Participant	Score	Researcher
0	1	83	Bergen
1	2	84	Bergen

Using an existing column to make a new column#

You can also use values in an existing column to make a new column.

Example: a new column called Pass that encodes whether a score is >= 70.

df_scores['Pass'] = df_scores['Score'] >= 70

df_scores.head(2)

	Participant	Score	Researcher	Pass
0	1	83	Bergen	True
1	2	84	Bergen	True

Check-in#

Create a new column of df_pokemon that calculates the ratio between the Attack of a Pokeon and its Defense.

### Your code here

Solution#

df_pokemon['attack_defense_ratio'] = df_pokemon['Attack'] / df_pokemon['Defense']
df_pokemon.head(2)

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary	attack_defense_ratio
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False	1.000000
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False	0.984127

Conclusion#

This concludes our second lecture on pandas. Now you know:

How to read in a .csv file with read_csv.
How to calculate basic descriptive statistics with pandas.
How to sort a DataFrame and create a new column.

CSS 1

Basic operations with pandas

Contents

Basic operations with `pandas`#

Goals of this lecture#

Review: What is tabular data?#

Example: test scores#

Reading in a `.csv` file#

Review: file paths#

A sample `.csv` file#

Check-in#

Solution#

Using `read_csv`#

Check-in#

Check-in#

Solution#

Check-in#

Solution#

Sorting a `DataFrame`#

`sort_values`#

`ascending = False`#

Check-in#

Solution#

Check-in#

Solution#

Basic descriptive statistics in `pandas`#

`sum`#

`mean`#

`describe`#

Check-in#

Solution#

Check-in#

Solution#

Check-in#

Solution#

Creating a new column#

Using an existing column to make a new column#

Check-in#

Solution#

Conclusion#

CSS 1

Basic operations with pandas

Contents

Basic operations with pandas#

Goals of this lecture#

Review: What is tabular data?#

Example: test scores#

Reading in a .csv file#

Review: file paths#

A sample .csv file#

Check-in#

Solution#

Using read_csv#

Check-in#

Check-in#

Solution#

Check-in#

Solution#

Sorting a DataFrame#

sort_values#

ascending = False#

Check-in#

Solution#

Check-in#

Solution#

Basic descriptive statistics in pandas#

sum#

mean#

describe#

Check-in#

Solution#

Check-in#

Solution#

Check-in#

Solution#

Creating a new column#

Using an existing column to make a new column#

Check-in#

Solution#

Conclusion#

Basic operations with `pandas`#

Reading in a `.csv` file#

A sample `.csv` file#

Using `read_csv`#

Sorting a `DataFrame`#

`sort_values`#

`ascending = False`#

Basic descriptive statistics in `pandas`#

`sum`#

`mean`#

`describe`#