{ "cells": [ { "cell_type": "markdown", "id": "357a45b3", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Advanced operations with `pandas`" ] }, { "cell_type": "markdown", "id": "9fdaec8e", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Announcements\n", "\n", "- CAPES and survey available!\n", " - Survey: https://forms.gle/xox1KgV6FwCYoESX6\n", " - CAPES: https://cape.ucsd.edu/student/instructions.html\n" ] }, { "cell_type": "markdown", "id": "08a31438", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Goals of this lecture\n", "\n", "Previously, we discussed several basic operations we can performan with a [`pandas.DataFrame`](23-pandas).\n", "\n", "In today's lecture, we'll cover more advanced operations:\n", "\n", "- Filtering a `DataFrame` using a conditional statement. \n", "- Using `groupby` to aggregate across members of a category. \n", "- Identifying missing or `nan` values. " ] }, { "cell_type": "code", "execution_count": 2, "id": "bb926eb1", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "### Always remember to import the package!\n", "import pandas as pd" ] }, { "cell_type": "markdown", "id": "cb7a04ec", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Review: reading in a `csv` file\n", "\n", "`pandas.read_csv` takes in a **filepath** as an argument, and returns a `DataFrame`." ] }, { "cell_type": "code", "execution_count": 3, "id": "5222fb7c", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#NameType 1Type 2TotalHPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
01BulbasaurGrassPoison3184549496565451False
12IvysaurGrassPoison4056062638080601False
\n", "
" ], "text/plain": [ " # Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def \\\n", "0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 \n", "1 2 Ivysaur Grass Poison 405 60 62 63 80 80 \n", "\n", " Speed Generation Legendary \n", "0 45 1 False \n", "1 60 1 False " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Pass the filepath into pd.read_csv\n", "df_pokemon = pd.read_csv(\"data/pokemon.csv\")\n", "df_pokemon.head(2)" ] }, { "cell_type": "markdown", "id": "8eaa4452", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Filtering a `DataFrame`\n", "\n", "Often, we want to **filter** a `DataFrame` so we only see observations that meet certain **conditions**.\n", "\n", "- \"Only show trials from the `Congruent` condition\". \n", "- \"Only show neighborhoods that are `NEAR BAY`.\" \n", "- \"Only show scores below 70%.\"\n", "\n", "Fundamentally, this is done by using a **conditional statement**." ] }, { "cell_type": "markdown", "id": "1421cff6", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Refresher: conditionals\n", "\n", "> In a nutshell, a **conditional** is a statement that checks for whether some *condition* is met.\n", "\n", "We can use the `if` command to **control** which lines of code are executed." ] }, { "cell_type": "code", "execution_count": 4, "id": "d488f152", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "def square_if_even(x):\n", " if x % 2 == 0:\n", " return x ** 2\n", " else:\n", " return x" ] }, { "cell_type": "code", "execution_count": 5, "id": "aeb1fe8d", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "square_if_even(2)" ] }, { "cell_type": "code", "execution_count": 6, "id": "45636b60", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "square_if_even(3)" ] }, { "cell_type": "markdown", "id": "56f9742a", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Conditionals and filtering\n", "\n", "- **Filtering** a `DataFrame` is analogous to applying a *condition*. \n", "- I.e., \"show me only the rows that meet *this condition*." ] }, { "cell_type": "markdown", "id": "96392e85", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Example 1: filtering on a categorical variable\n", "\n", "- The `legendary` column is a *categorical* variable, meaning there are several discrete categories. " ] }, { "cell_type": "code", "execution_count": 7, "id": "4a5b124a", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#NameType 1Type 2TotalHPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
156144ArticunoIceFlying580908510095125851True
157145ZapdosElectricFlying580909085125901001True
158146MoltresFireFlying580901009012585901True
162150MewtwoPsychicNaN68010611090154901301True
163150MewtwoMega Mewtwo XPsychicFighting7801061901001541001301True
..........................................
795719DiancieRockFairy60050100150100150506True
796719DiancieMega DiancieRockFairy700501601101601101106True
797720HoopaHoopa ConfinedPsychicGhost6008011060150130706True
798720HoopaHoopa UnboundPsychicDark6808016060170130806True
799721VolcanionFireWater6008011012013090706True
\n", "

65 rows × 13 columns

\n", "
" ], "text/plain": [ " # Name Type 1 Type 2 Total HP Attack \\\n", "156 144 Articuno Ice Flying 580 90 85 \n", "157 145 Zapdos Electric Flying 580 90 90 \n", "158 146 Moltres Fire Flying 580 90 100 \n", "162 150 Mewtwo Psychic NaN 680 106 110 \n", "163 150 MewtwoMega Mewtwo X Psychic Fighting 780 106 190 \n", ".. ... ... ... ... ... ... ... \n", "795 719 Diancie Rock Fairy 600 50 100 \n", "796 719 DiancieMega Diancie Rock Fairy 700 50 160 \n", "797 720 HoopaHoopa Confined Psychic Ghost 600 80 110 \n", "798 720 HoopaHoopa Unbound Psychic Dark 680 80 160 \n", "799 721 Volcanion Fire Water 600 80 110 \n", "\n", " Defense Sp. Atk Sp. Def Speed Generation Legendary \n", "156 100 95 125 85 1 True \n", "157 85 125 90 100 1 True \n", "158 90 125 85 90 1 True \n", "162 90 154 90 130 1 True \n", "163 100 154 100 130 1 True \n", ".. ... ... ... ... ... ... \n", "795 150 100 150 50 6 True \n", "796 110 160 110 110 6 True \n", "797 60 150 130 70 6 True \n", "798 60 170 130 80 6 True \n", "799 120 130 90 70 6 True \n", "\n", "[65 rows x 13 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## How many legendary pokemon?\n", "df_pokemon[df_pokemon['Legendary']==True]" ] }, { "cell_type": "code", "execution_count": 8, "id": "bde470db", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "800" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Compare to actual length of dataframe\n", "len(df_pokemon)" ] }, { "cell_type": "markdown", "id": "8d3caf00", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Example 2: filtering on a continuous variable\n", "\n", "- The `HP` column is a *continuous* variable.\n", "- Let's show only the rows for Pokemon with a `HP > 150`." ] }, { "cell_type": "code", "execution_count": 9, "id": "87a1eb47", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#NameType 1Type 2TotalHPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
121113ChanseyNormalNaN4502505535105501False
155143SnorlaxNormalNaN5401601106565110301False
217202WobbuffetPsychicNaN40519033583358332False
261242BlisseyNormalNaN540255101075135552False
351321WailordWaterNaN50017090459045603False
655594AlomomolaWaterNaN47016575804045655False
\n", "
" ], "text/plain": [ " # Name Type 1 Type 2 Total HP Attack Defense Sp. Atk \\\n", "121 113 Chansey Normal NaN 450 250 5 5 35 \n", "155 143 Snorlax Normal NaN 540 160 110 65 65 \n", "217 202 Wobbuffet Psychic NaN 405 190 33 58 33 \n", "261 242 Blissey Normal NaN 540 255 10 10 75 \n", "351 321 Wailord Water NaN 500 170 90 45 90 \n", "655 594 Alomomola Water NaN 470 165 75 80 40 \n", "\n", " Sp. Def Speed Generation Legendary \n", "121 105 50 1 False \n", "155 110 30 1 False \n", "217 58 33 2 False \n", "261 135 55 2 False \n", "351 45 60 3 False \n", "655 45 65 5 False " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_pokemon[df_pokemon['HP'] > 150]" ] }, { "cell_type": "code", "execution_count": 10, "id": "342b1c5c", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "800" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Compare to actual length of df_pokemon\n", "len(df_pokemon)" ] }, { "cell_type": "markdown", "id": "f2c707e0", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Interim summary: filtering syntax\n", "\n", "To filter a `DataFrame`, we can use the bracket (`[]`) syntax.:\n", "\n", "```python\n", "df_name[df_name['col_name']==CONDITION]\n", "```\n", "\n", "(Note that `==CONDITION` would be replaced with whatever **condition** you want to filter on.)" ] }, { "cell_type": "markdown", "id": "8a477e39", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Check-in\n", "\n", "Write a statement to **filter** `df_pokemon` to show only the pokemon where `Attack` is greater than or equal to 50. \n", "\n", "- How many rows are in this subset? \n", "- What proportion of Pokemon meet this criteria?\n", "- What is the mean `HP` in this subset?" ] }, { "cell_type": "code", "execution_count": 19, "id": "f3fe61c6", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "### Your code here" ] }, { "cell_type": "markdown", "id": "1d6f2720", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Solution" ] }, { "cell_type": "code", "execution_count": 21, "id": "eded1ec3", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "## First, write the filter statement\n", "df_subset = df_pokemon[df_pokemon['Attack'] >= 50]" ] }, { "cell_type": "code", "execution_count": 22, "id": "15e1c3a1", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "667" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "### How many rows are in this subset? \n", "len(df_subset)" ] }, { "cell_type": "code", "execution_count": 24, "id": "b433d8f9", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "0.83375" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "### Proportion of dataset ... about half\n", "len(df_subset) / len(df_pokemon)" ] }, { "cell_type": "code", "execution_count": 25, "id": "bfe69b76", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "72.55472263868066" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Mean HP\n", "df_subset['HP'].mean()" ] }, { "cell_type": "markdown", "id": "8769132a", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Check-in\n", "\n", "Now write a statement to **filter** `df_pokemon` to show only the Pokemon where `Attack` is *less than* or equal to 50. \n", "\n", "- How many rows are in this subset? \n", "- What proportion of Pokemon meet this criteria?\n", "- What is the mean `HP` in this subset?" ] }, { "cell_type": "code", "execution_count": 29, "id": "8918b6a5", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "## First, write the filter statement\n", "df_subset = df_pokemon[df_pokemon['Attack'] <= 50]" ] }, { "cell_type": "code", "execution_count": 30, "id": "b7889b12", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "170" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "### How many rows are in this subset? \n", "len(df_subset)" ] }, { "cell_type": "code", "execution_count": 31, "id": "edadf730", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "0.2125" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "### Proportion of dataset ... about half\n", "len(df_subset) / len(df_pokemon)" ] }, { "cell_type": "code", "execution_count": 32, "id": "3e2e9363", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "53.72941176470588" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Mean *median* house value\n", "df_subset['HP'].mean()" ] }, { "cell_type": "markdown", "id": "9206fe58", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Check-in\n", "\n", "What is the mean `HP` of Pokemon that are `Legendary` vs. those that aren't?\n", "\n", "**Hint**: Create *two* subset `DataFrame`s, filtering for `== True` and `!= False`." ] }, { "cell_type": "code", "execution_count": 33, "id": "3402110a", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "### Your code here" ] }, { "cell_type": "markdown", "id": "7b33a406", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Solution" ] }, { "cell_type": "code", "execution_count": 34, "id": "2854aee1", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "## First, write the filter statement\n", "df_legendary = df_pokemon[df_pokemon['Legendary'] == True]\n", "df_not_legendary = df_pokemon[df_pokemon['Legendary'] == False]" ] }, { "cell_type": "code", "execution_count": 36, "id": "0ca435e9", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "92.73846153846154\n", "67.18231292517007\n" ] } ], "source": [ "## Mean for legendary\n", "print(df_legendary['HP'].mean())\n", "## Mean for not legendary\n", "print(df_not_legendary['HP'].mean())" ] }, { "cell_type": "markdown", "id": "dd02c796", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Interim summary\n", "\n", "- We can apply **conditionals** to a `DataFrame` to *filter* that `DataFrame`.\n", " - \"Show me only observations that meet some criteria.\"\n", "- We also used this to look at the `mean` value of some column (`HP`) as a function of some other column (`Legendary`).\n", "- But there's actually an even easier way to do this!" ] }, { "cell_type": "markdown", "id": "f9a1c007", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Using `groupby`\n", "\n", "> The [`groupby` function](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) allows you to **split data** (i.e., along different categories) then **apply** some function to each split of that data (e.g., `mean`).\n", "\n", "Examples:\n", "\n", "- Calculate the `mean` `HP` for each category in `Type 1`.\n", "- Calculate the `sum` of `costs` for each `month`." ] }, { "cell_type": "markdown", "id": "caec9384", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Simple example\n", "\n", "- Let's demo this in a simple `DataFrame`.\n", "- First, we create a `DataFrame` with information about `Reaction Time` in two different `Condition`s.\n", "- Are people faster or slower for `Congruent` vs. `Incongruent` conditions?\n", " - E.g., supppose these are results from the [Stroop task](https://en.wikipedia.org/wiki/Stroop_effect). " ] }, { "cell_type": "code", "execution_count": 37, "id": "7e1e79bf", "metadata": { "scrolled": true, "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ConditionReaction Time
0Congruent500
1Congruent460
2Congruent532
3Incongruent730
4Incongruent720
5Incongruent670
\n", "
" ], "text/plain": [ " Condition Reaction Time\n", "0 Congruent 500\n", "1 Congruent 460\n", "2 Congruent 532\n", "3 Incongruent 730\n", "4 Incongruent 720\n", "5 Incongruent 670" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_rt = pd.DataFrame({'Condition': ['Congruent', 'Congruent', 'Congruent',\n", " 'Incongruent', 'Incongruent', 'Incongruent'],\n", " 'Reaction Time': [500, 460, 532, 730, 720, 670]})\n", "df_rt" ] }, { "cell_type": "markdown", "id": "ea5b7ce7", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Using `groupby`\n", "\n", "To calculate the `mean` `Reaction Time` in each condition, use:" ] }, { "cell_type": "code", "execution_count": 38, "id": "66269e65", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Reaction Time
Condition
Congruent497.333333
Incongruent706.666667
\n", "
" ], "text/plain": [ " Reaction Time\n", "Condition \n", "Congruent 497.333333\n", "Incongruent 706.666667" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_rt.groupby(\"Condition\").mean()" ] }, { "cell_type": "markdown", "id": "7ca251bc", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Using `groupby` and `describe`\n", "\n", "To get more statistics about each condition, use `describe`:" ] }, { "cell_type": "code", "execution_count": 39, "id": "1b0ef464", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Reaction Time
countmeanstdmin25%50%75%max
Condition
Congruent3.0497.33333336.073998460.0480.0500.0516.0532.0
Incongruent3.0706.66666732.145503670.0695.0720.0725.0730.0
\n", "
" ], "text/plain": [ " Reaction Time \\\n", " count mean std min 25% 50% 75% \n", "Condition \n", "Congruent 3.0 497.333333 36.073998 460.0 480.0 500.0 516.0 \n", "Incongruent 3.0 706.666667 32.145503 670.0 695.0 720.0 725.0 \n", "\n", " \n", " max \n", "Condition \n", "Congruent 532.0 \n", "Incongruent 730.0 " ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_rt.groupby(\"Condition\").describe()" ] }, { "cell_type": "markdown", "id": "8f4a9dc4", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Check-in\n", "\n", "Use `groupby` to calculate the average `HP` for each category in `Legendary`. Which category has the highest average?" ] }, { "cell_type": "code", "execution_count": 40, "id": "3c795c91", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "### Your code here" ] }, { "cell_type": "markdown", "id": "9707a1fd", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Solution\n", "\n", "**Note**: By default, this will calculate the `mean` of **all available columns** (including `HP`)." ] }, { "cell_type": "code", "execution_count": 41, "id": "d9d374d1", "metadata": { "scrolled": true, "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#TotalHPAttackDefenseSp. AtkSp. DefSpeedGeneration
Legendary
False353.315646417.21360567.18231375.66938871.55918468.45442268.89251765.4557823.284354
True470.215385637.38461592.738462116.67692399.661538122.184615105.938462100.1846153.769231
\n", "
" ], "text/plain": [ " # Total HP Attack Defense \\\n", "Legendary \n", "False 353.315646 417.213605 67.182313 75.669388 71.559184 \n", "True 470.215385 637.384615 92.738462 116.676923 99.661538 \n", "\n", " Sp. Atk Sp. Def Speed Generation \n", "Legendary \n", "False 68.454422 68.892517 65.455782 3.284354 \n", "True 122.184615 105.938462 100.184615 3.769231 " ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_pokemon.groupby(\"Legendary\").mean()" ] }, { "cell_type": "markdown", "id": "41ebc656", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Restricting output to a single column\n", "\n", "To look only a specific column, you can first **slice** only the columns you're interested in:" ] }, { "cell_type": "code", "execution_count": 43, "id": "c907be09", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
LegendaryHP
0False45
1False60
\n", "
" ], "text/plain": [ " Legendary HP\n", "0 False 45\n", "1 False 60" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_pokemon[['Legendary', 'HP']].head(2)" ] }, { "cell_type": "markdown", "id": "86b163ee", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "Then we can call `groupby` on this simpler `DataFrame` object with fewer columns." ] }, { "cell_type": "code", "execution_count": 44, "id": "33b0c796", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
HP
Legendary
False67.182313
True92.738462
\n", "
" ], "text/plain": [ " HP\n", "Legendary \n", "False 67.182313\n", "True 92.738462" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_pokemon[['Legendary', 'HP']].groupby(\"Legendary\").mean()" ] }, { "cell_type": "markdown", "id": "7b7652e0", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Missing data!\n", "\n", "- In practice, **real-world** data is often messy. \n", "- This includes **missing values**, which take on the value/label `NaN`.\n", " - `NaN` = \"Not a Number\".\n", "- [Dealing with `NaN` values is one of the main challenges in CSS!](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html)" ] }, { "cell_type": "markdown", "id": "92c70100", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Where do missing values come from?\n", "\n", "- Lots of reasons for [missing data](https://en.wikipedia.org/wiki/Missing_data)!\n", " - Experiment crashed, so some trials are missing.\n", " - Some people don't answer all questions on a survey.\n", " - Some people \"drop\" out of a longitudinal study. \n", "\n", "| Subject | Condition | RT |\n", "| ------- | ---------- | --- | \n", "| 1 | Congruent | 500 |\n", "| 2 | Congruent | 550 |\n", "| 3 | Incongruent | 670 |\n", "| 4 | NaN | NaN |" ] }, { "cell_type": "markdown", "id": "b25acec0", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Why does missing data matter?\n", "\n", "Again, [lots of reasons](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3668100/):\n", "\n", "1. If you're unaware of it, then you might be **overestimating** how many observations you have. (E.g., you think `N = 500`, but actually `N = 450`.)\n", "2. If missing values are **non-randomly distributed**, your results might be **biased**. (E.g., maybe the `Placebo` group is more likely to drop out of a longitudinal study?)\n", "3. Missing values may complicate your analysis. \n" ] }, { "cell_type": "markdown", "id": "e0c2a277", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### How to deal with missing data\n", "\n", "- **Identify** missing values. \n", "- **Analyze** whether they're randomly or non-randomly distributed. \n", "- **Decide** how to handle them:\n", " - *Remove* any rows with missing data from all analyses? \n", " - *Ignore* rows with missing data for *some* analyses but not all? \n", " - Try to *impute* (\"fill in\") the missing data using other information?\n", "- Not an easy problem!" ] }, { "cell_type": "markdown", "id": "b503d234", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Identifying missing data\n", "\n", "- The first step is **identifying** missing values.\n", "- For this, you can use `isna`.\n", " - Returns a vector of `True`/`False` values indicating whether a given observation is `NaN`." ] }, { "cell_type": "code", "execution_count": 45, "id": "f2eed699", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#NameType 1Type 2TotalHPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
0FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
1FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
2FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
3FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
4FalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalse
\n", "
" ], "text/plain": [ " # Name Type 1 Type 2 Total HP Attack Defense Sp. Atk \\\n", "0 False False False False False False False False False \n", "1 False False False False False False False False False \n", "2 False False False False False False False False False \n", "3 False False False False False False False False False \n", "4 False False False True False False False False False \n", "\n", " Sp. Def Speed Generation Legendary \n", "0 False False False False \n", "1 False False False False \n", "2 False False False False \n", "3 False False False False \n", "4 False False False False " ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "### This means none of those columns\n", "### have NaN in top 5 rows\n", "df_pokemon.isna().head(5)" ] }, { "cell_type": "markdown", "id": "555826e4", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Which columns have `NaN`?\n", "\n", "We can discover whether any column has `NaN` using `isna().any()`." ] }, { "cell_type": "code", "execution_count": 46, "id": "7417355f", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "# False\n", "Name False\n", "Type 1 False\n", "Type 2 True\n", "Total False\n", "HP False\n", "Attack False\n", "Defense False\n", "Sp. Atk False\n", "Sp. Def False\n", "Speed False\n", "Generation False\n", "Legendary False\n", "dtype: bool" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "### Tells you which columns have NaN values\n", "df_pokemon.isna().any()" ] }, { "cell_type": "markdown", "id": "ca0cd67b", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Inspecting `Type 2`" ] }, { "cell_type": "code", "execution_count": 47, "id": "74152c8a", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#NameType 1Type 2TotalHPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
44CharmanderFireNaN3093952436050651False
55CharmeleonFireNaN4055864588065801False
97SquirtleWaterNaN3144448655064431False
108WartortleWaterNaN4055963806580581False
119BlastoiseWaterNaN530798310085105781False
\n", "
" ], "text/plain": [ " # Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def \\\n", "4 4 Charmander Fire NaN 309 39 52 43 60 50 \n", "5 5 Charmeleon Fire NaN 405 58 64 58 80 65 \n", "9 7 Squirtle Water NaN 314 44 48 65 50 64 \n", "10 8 Wartortle Water NaN 405 59 63 80 65 80 \n", "11 9 Blastoise Water NaN 530 79 83 100 85 105 \n", "\n", " Speed Generation Legendary \n", "4 65 1 False \n", "5 80 1 False \n", "9 43 1 False \n", "10 58 1 False \n", "11 78 1 False " ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "### These rows have NaN for \"total_bedrooms\"\n", "df_pokemon[df_pokemon['Type 2'].isna()].head(5)" ] }, { "cell_type": "code", "execution_count": 49, "id": "fe114860", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "386" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df_pokemon[df_pokemon['Type 2'].isna()])" ] }, { "cell_type": "markdown", "id": "d46ddb16", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### `drop_na`\n", "\n", "One approach is to **drop** these observations altogether. You can do this using `dropna(column_name)`." ] }, { "cell_type": "code", "execution_count": 51, "id": "023eae85", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "414" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_pokemon_subset = df_pokemon.dropna(subset = \"Type 2\")\n", "len(df_pokemon_subset)" ] }, { "cell_type": "code", "execution_count": 52, "id": "99ecbe49", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "# False\n", "Name False\n", "Type 1 False\n", "Type 2 False\n", "Total False\n", "HP False\n", "Attack False\n", "Defense False\n", "Sp. Atk False\n", "Sp. Def False\n", "Speed False\n", "Generation False\n", "Legendary False\n", "dtype: bool" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "### No more NaN\n", "df_pokemon_subset.isna().any()" ] }, { "cell_type": "markdown", "id": "2c6abdbb", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Filtering on `~isna`\n", "\n", "You can also **drop** observations by *filtering* on values that *aren't* `NaN`.\n", "\n", "- The `~` syntax is a way to write `not` in `pandas` syntax." ] }, { "cell_type": "code", "execution_count": 54, "id": "5b8047c1", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "414" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_pokemon_subset = df_pokemon[~df_pokemon['Type 2'].isna()]\n", "len(df_pokemon_subset)" ] }, { "cell_type": "code", "execution_count": 55, "id": "f5b7f1a3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "# False\n", "Name False\n", "Type 1 False\n", "Type 2 False\n", "Total False\n", "HP False\n", "Attack False\n", "Defense False\n", "Sp. Atk False\n", "Sp. Def False\n", "Speed False\n", "Generation False\n", "Legendary False\n", "dtype: bool" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "### No more NaN\n", "df_pokemon_subset.isna().any()" ] }, { "cell_type": "markdown", "id": "10b3dcd5", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Do we *want* to remove `NaN`?\n", "\n", "- In this particular case, removing `NaN` may not be necessary.\n", " - Lots of Pokeon just don't *have* a secondary Type. \n", "- But in other cases, `NaN` actually implies **missing data**, not just a feature of some observations.\n", "- In those cases, we might want to remove those observations." ] }, { "cell_type": "markdown", "id": "9c9760c6", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Conclusion\n", "\n", "This concludes our unit on `pandas`.\n", "\n", "- `pandas` is central to data manipulation and data analysis in Python. \n", "- It's a handy way to read `.csv` files (`read_csv`). \n", "- As you'll see in CSS 2, `pandas` can also be used in conjunction with **visualization** and **analysis** packages." ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 5 }