Basic operations with pandas
Contents
Basic operations with pandas
#
Goals of this lecture#
Previously, we discussed why pandas.DataFrame
is so useful for representing data.
In today’s lecture, we’ll cover:
How to read in a
.csv
file as aDataFrame
usingpandas.read_csv
.Sorting a
DataFrame
withsort_values
.Basic descriptive statistics with
pandas
(mean()
, etc.)Creating a new column of a
DataFrame
.
### Always remember to import the package!
import pandas as pd
Review: What is tabular data?#
Tabular data is data organized in a table with rows and columns.
This kind of data is two-dimensional.
Typically, each row represents an “observation”.
A person.
A country.
An experimental trial.
Typicallly, each column represents an attribute.
height
gdp
orpopulation
reaction_time
orexperimental_condition
Example: test scores#
Check-in: What are the columns of this table?
Participant |
Score |
---|---|
1 |
85 |
2 |
92 |
3 |
78 |
4 |
98 |
5 |
76 |
Reading in a .csv
file#
A
.csv
file is a comma-separated values file.
Tabular data is often represented in a .csv
file.
The values of each column are separated by a comma (
1,85
).Each row is represented by a new line.
A .csv
file can also be opened in Excel or Google Sheets.
Review: file paths#
Previously, we’ve discussed how a file path represents the “location” of a file on your computer.
An absolute file path specifies the location of a file relative to some root directory.
A relative file path specifies the location of a file relative to the current directory (i.e., the one you’re in right now).
We already know how to read in .txt
and .json
files, but .csv
files are new.
A sample .csv
file#
The data
directory contains a file called scores_sample.csv
.
For a series of made-up research participants, this file contains:
The student number.
Their score on a reading comprehension task.
Now we need to read this file in!
Check-in#
If we’re currently in lectures
, and the file is in lectures/data/scores_sample.csv
, what would the relative file path be?
Solution#
The filepath would be: data/scores_sample.csv
Using read_csv
#
pandas.read_csv
takes in a filepath as an argument, and returns a DataFrame
.
## Pass the filepath into pd.read_csv
df_scores = pd.read_csv("data/scores_sample.csv")
## Let's inspect the first couple of rows
df_scores.head(2)
Participant | Score | |
---|---|---|
0 | 1 | 83 |
1 | 2 | 84 |
Check-in#
What are the column names of this DataFrame
? What does each row represent?
df_scores.head(5)
Participant | Score | |
---|---|---|
0 | 1 | 83 |
1 | 2 | 84 |
2 | 3 | 89 |
3 | 4 | 80 |
4 | 5 | 81 |
Check-in#
How would we figure out how many rows and columns this DataFrame
has?
### Your code here
Solution#
## First number = number of rows
## Second number = number of columns
df_scores.shape
(20, 2)
Check-in#
Now let’s try to read in a new .csv
file. This one is also located in data
, and it’s called pokemon.csv
.
The original data can be found on Kaggle, a website for data science competitions).
It contains information about different Pokemon.
Call the new DataFrame
df_pokemon
.
### Your code here
Solution#
df_pokemon = pd.read_csv("data/pokemon.csv")
df_pokemon.head(2)
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
1 | 2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
df_pokemon.shape
(800, 13)
Sorting a DataFrame
#
Often, we might find it useful to sort a DataFrame
according to the values in a particular column.
sort_values
#
By default,
sort_values
will sort from lowest to highest.
df_scores.sort_values("Score").head(2)
Participant | Score | |
---|---|---|
19 | 20 | 75 |
18 | 19 | 79 |
df_scores.sort_values("Score").tail(2)
Participant | Score | |
---|---|---|
17 | 18 | 94 |
11 | 12 | 98 |
ascending = False
#
We can change that with the ascending
parameter.
df_scores.sort_values("Score", ascending = False).head(2)
Participant | Score | |
---|---|---|
11 | 12 | 98 |
17 | 18 | 94 |
Check-in#
Use sort_values
to find the block with the lowest HP
. Then, answer the following questions:
What is the
Speed
of this Pokemon?What is the
Legendary
status of this Pokemon?What is the
Type 1
andType 2
of this Pokemon?
### Your code here
Solution#
df_pokemon.sort_values("HP").head(1)
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
316 | 292 | Shedinja | Bug | Ghost | 236 | 1 | 90 | 45 | 30 | 30 | 40 | 3 | False |
Check-in#
Use sort_values
to find the block with the highest HP
. Then, answer the following questions:
What is the
Speed
of this Pokemon?What is the
Legendary
status of this Pokemon?What is the
Type 1
andType 2
of this Pokemon?
### Your code here
Solution#
df_pokemon.sort_values("HP", ascending = False).head(1)
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
261 | 242 | Blissey | Normal | NaN | 540 | 255 | 10 | 10 | 75 | 135 | 55 | 2 | False |
Basic descriptive statistics in pandas
#
Descriptive statistics are ways to summarize and organize data.
We’ve already covered using
numpy
to calculate descriptive statistics.But we can also use
pandas
in a very similar way!(A column of a
DataFrame
is roughly analogous to anumpy.ndarray
.)
sum
#
The sum of a set of numbers is simply the result of adding each number together.
The sum
of a column can be calculated by calling:
df_name[column_name].sum()
df_scores['Score'].sum()
1728
mean
#
The mean of a set of scores is the sum of those scores divided by the number of the observations.
The mean
of a column can be calculated by calling:
df_name[column_name].mean()
df_scores['Score'].mean()
86.4
describe
#
The describe
function gives you a bunch of handy descriptive statistics in a single call.
df_scores['Score'].describe()
count 20.000000
mean 86.400000
std 5.374403
min 75.000000
25% 83.750000
50% 86.500000
75% 89.250000
max 98.000000
Name: Score, dtype: float64
Check-in#
What is the mean HP
in our Pokemon dataset? What about the minimum and maximum?
### Your code here
Solution#
print(df_pokemon['HP'].mean())
print(df_pokemon['HP'].max())
print(df_pokemon['HP'].min())
69.25875
255
1
Check-in#
What is the mean Attack
Pokemon dataset? What about the minimum and maximum?
### Your code here
Solution#
print(df_pokemon['Attack'].mean())
print(df_pokemon['Attack'].max())
print(df_pokemon['Attack'].min())
79.00125
190
5
Check-in#
Could we calculate the mean
of the Type 1
variable? Why or why not?
### Your code here
Solution#
No, because Type 1
is a qualitative variable. We can only count how many of each category there are.
df_pokemon['Type 1'].value_counts()
Water 112
Normal 98
Grass 70
Bug 69
Psychic 57
Fire 52
Electric 44
Rock 44
Dragon 32
Ground 32
Ghost 32
Dark 31
Poison 28
Steel 27
Fighting 27
Ice 24
Fairy 17
Flying 4
Name: Type 1, dtype: int64
Creating a new column#
pandas
also makes it relatively easy to make a new column.
Syntax:
dataframe_name[new_column_name] = ...
### Setting the column to a single value
### just duplicates that value throughout the column
df_scores['Researcher'] = 'Bergen'
df_scores.head(2)
Participant | Score | Researcher | |
---|---|---|---|
0 | 1 | 83 | Bergen |
1 | 2 | 84 | Bergen |
Using an existing column to make a new column#
You can also use values in an existing column to make a new column.
Example: a new column called
Pass
that encodes whether a score is>= 70
.
df_scores['Pass'] = df_scores['Score'] >= 70
df_scores.head(2)
Participant | Score | Researcher | Pass | |
---|---|---|---|---|
0 | 1 | 83 | Bergen | True |
1 | 2 | 84 | Bergen | True |
Check-in#
Create a new column of df_pokemon
that calculates the ratio between the Attack
of a Pokeon and its Defense
.
### Your code here
Solution#
df_pokemon['attack_defense_ratio'] = df_pokemon['Attack'] / df_pokemon['Defense']
df_pokemon.head(2)
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | attack_defense_ratio | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False | 1.000000 |
1 | 2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False | 0.984127 |
Conclusion#
This concludes our second lecture on pandas
. Now you know:
How to read in a
.csv
file withread_csv
.How to calculate basic descriptive statistics with
pandas
.How to sort a
DataFrame
and create a new column.