Basic operations with pandas#

Goals of this lecture#

Previously, we discussed why pandas.DataFrame is so useful for representing data.

In today’s lecture, we’ll cover:

  • How to read in a .csv file as a DataFrame using pandas.read_csv.

  • Sorting a DataFrame with sort_values.

  • Basic descriptive statistics with pandas (mean(), etc.)

  • Creating a new column of a DataFrame.

### Always remember to import the package!
import pandas as pd

Review: What is tabular data?#

Tabular data is data organized in a table with rows and columns.

  • This kind of data is two-dimensional.

  • Typically, each row represents an “observation”.

    • A person.

    • A country.

    • An experimental trial.

  • Typicallly, each column represents an attribute.

    • height

    • gdp or population

    • reaction_time or experimental_condition

Example: test scores#

Check-in: What are the columns of this table?

Participant

Score

1

85

2

92

3

78

4

98

5

76

Reading in a .csv file#

A .csv file is a comma-separated values file.

Tabular data is often represented in a .csv file.

  • The values of each column are separated by a comma (1,85).

  • Each row is represented by a new line.

A .csv file can also be opened in Excel or Google Sheets.

Review: file paths#

Previously, we’ve discussed how a file path represents the “location” of a file on your computer.

  • An absolute file path specifies the location of a file relative to some root directory.

  • A relative file path specifies the location of a file relative to the current directory (i.e., the one you’re in right now).

We already know how to read in .txt and .json files, but .csv files are new.

A sample .csv file#

The data directory contains a file called scores_sample.csv.

For a series of made-up research participants, this file contains:

  • The student number.

  • Their score on a reading comprehension task.

Now we need to read this file in!

Check-in#

If we’re currently in lectures, and the file is in lectures/data/scores_sample.csv, what would the relative file path be?

Solution#

The filepath would be: data/scores_sample.csv

Using read_csv#

pandas.read_csv takes in a filepath as an argument, and returns a DataFrame.

## Pass the filepath into pd.read_csv
df_scores = pd.read_csv("data/scores_sample.csv")
## Let's inspect the first couple of rows
df_scores.head(2)
Participant Score
0 1 83
1 2 84

Check-in#

What are the column names of this DataFrame? What does each row represent?

df_scores.head(5)
Participant Score
0 1 83
1 2 84
2 3 89
3 4 80
4 5 81

Check-in#

How would we figure out how many rows and columns this DataFrame has?

### Your code here

Solution#

## First number = number of rows
## Second number = number of columns
df_scores.shape
(20, 2)

Check-in#

Now let’s try to read in a new .csv file. This one is also located in data, and it’s called pokemon.csv.

Call the new DataFrame df_pokemon.

### Your code here

Solution#

df_pokemon = pd.read_csv("data/pokemon.csv")
df_pokemon.head(2)
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
df_pokemon.shape
(800, 13)

Sorting a DataFrame#

Often, we might find it useful to sort a DataFrame according to the values in a particular column.

sort_values#

  • By default, sort_values will sort from lowest to highest.

df_scores.sort_values("Score").head(2)
Participant Score
19 20 75
18 19 79
df_scores.sort_values("Score").tail(2)
Participant Score
17 18 94
11 12 98

ascending = False#

We can change that with the ascending parameter.

df_scores.sort_values("Score", ascending = False).head(2)
Participant Score
11 12 98
17 18 94

Check-in#

Use sort_values to find the block with the lowest HP. Then, answer the following questions:

  • What is the Speed of this Pokemon?

  • What is the Legendary status of this Pokemon?

  • What is the Type 1 and Type 2 of this Pokemon?

### Your code here

Solution#

df_pokemon.sort_values("HP").head(1)
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
316 292 Shedinja Bug Ghost 236 1 90 45 30 30 40 3 False

Check-in#

Use sort_values to find the block with the highest HP. Then, answer the following questions:

  • What is the Speed of this Pokemon?

  • What is the Legendary status of this Pokemon?

  • What is the Type 1 and Type 2 of this Pokemon?

### Your code here

Solution#

df_pokemon.sort_values("HP", ascending = False).head(1)
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
261 242 Blissey Normal NaN 540 255 10 10 75 135 55 2 False

Basic descriptive statistics in pandas#

Descriptive statistics are ways to summarize and organize data.

sum#

The sum of a set of numbers is simply the result of adding each number together.

The sum of a column can be calculated by calling:

df_name[column_name].sum()
df_scores['Score'].sum()
1728

mean#

The mean of a set of scores is the sum of those scores divided by the number of the observations.

The mean of a column can be calculated by calling:

df_name[column_name].mean()
df_scores['Score'].mean()
86.4

describe#

The describe function gives you a bunch of handy descriptive statistics in a single call.

df_scores['Score'].describe()
count    20.000000
mean     86.400000
std       5.374403
min      75.000000
25%      83.750000
50%      86.500000
75%      89.250000
max      98.000000
Name: Score, dtype: float64

Check-in#

What is the mean HP in our Pokemon dataset? What about the minimum and maximum?

### Your code here

Solution#

print(df_pokemon['HP'].mean())
print(df_pokemon['HP'].max())
print(df_pokemon['HP'].min())
69.25875
255
1

Check-in#

What is the mean Attack Pokemon dataset? What about the minimum and maximum?

### Your code here

Solution#

print(df_pokemon['Attack'].mean())
print(df_pokemon['Attack'].max())
print(df_pokemon['Attack'].min())
79.00125
190
5

Check-in#

Could we calculate the mean of the Type 1 variable? Why or why not?

### Your code here

Solution#

No, because Type 1 is a qualitative variable. We can only count how many of each category there are.

df_pokemon['Type 1'].value_counts()
Water       112
Normal       98
Grass        70
Bug          69
Psychic      57
Fire         52
Electric     44
Rock         44
Dragon       32
Ground       32
Ghost        32
Dark         31
Poison       28
Steel        27
Fighting     27
Ice          24
Fairy        17
Flying        4
Name: Type 1, dtype: int64

Creating a new column#

pandas also makes it relatively easy to make a new column.

Syntax:

dataframe_name[new_column_name] = ...
### Setting the column to a single value 
### just duplicates that value throughout the column
df_scores['Researcher'] = 'Bergen'
df_scores.head(2)
Participant Score Researcher
0 1 83 Bergen
1 2 84 Bergen

Using an existing column to make a new column#

You can also use values in an existing column to make a new column.

  • Example: a new column called Pass that encodes whether a score is >= 70.

df_scores['Pass'] = df_scores['Score'] >= 70
df_scores.head(2)
Participant Score Researcher Pass
0 1 83 Bergen True
1 2 84 Bergen True

Check-in#

Create a new column of df_pokemon that calculates the ratio between the Attack of a Pokeon and its Defense.

### Your code here

Solution#

df_pokemon['attack_defense_ratio'] = df_pokemon['Attack'] / df_pokemon['Defense']
df_pokemon.head(2)
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary attack_defense_ratio
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False 1.000000
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False 0.984127

Conclusion#

This concludes our second lecture on pandas. Now you know:

  • How to read in a .csv file with read_csv.

  • How to calculate basic descriptive statistics with pandas.

  • How to sort a DataFrame and create a new column.