{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "357a45b3",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Advanced operations with `pandas`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9fdaec8e",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Announcements\n",
    "\n",
    "- CAPES and survey available!\n",
    "   - Survey: https://forms.gle/xox1KgV6FwCYoESX6\n",
    "   - CAPES: https://cape.ucsd.edu/student/instructions.html\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "08a31438",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Goals of this lecture\n",
    "\n",
    "Previously, we discussed several basic operations we can performan with a [`pandas.DataFrame`](23-pandas).\n",
    "\n",
    "In today's lecture, we'll cover more advanced operations:\n",
    "\n",
    "- Filtering a `DataFrame` using a conditional statement.  \n",
    "- Using `groupby` to aggregate across members of a category.  \n",
    "- Identifying missing or `nan` values. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "bb926eb1",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "### Always remember to import the package!\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cb7a04ec",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Review: reading in a `csv` file\n",
    "\n",
    "`pandas.read_csv` takes in a **filepath** as an argument, and returns a `DataFrame`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "5222fb7c",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>#</th>\n",
       "      <th>Name</th>\n",
       "      <th>Type 1</th>\n",
       "      <th>Type 2</th>\n",
       "      <th>Total</th>\n",
       "      <th>HP</th>\n",
       "      <th>Attack</th>\n",
       "      <th>Defense</th>\n",
       "      <th>Sp. Atk</th>\n",
       "      <th>Sp. Def</th>\n",
       "      <th>Speed</th>\n",
       "      <th>Generation</th>\n",
       "      <th>Legendary</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Bulbasaur</td>\n",
       "      <td>Grass</td>\n",
       "      <td>Poison</td>\n",
       "      <td>318</td>\n",
       "      <td>45</td>\n",
       "      <td>49</td>\n",
       "      <td>49</td>\n",
       "      <td>65</td>\n",
       "      <td>65</td>\n",
       "      <td>45</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Ivysaur</td>\n",
       "      <td>Grass</td>\n",
       "      <td>Poison</td>\n",
       "      <td>405</td>\n",
       "      <td>60</td>\n",
       "      <td>62</td>\n",
       "      <td>63</td>\n",
       "      <td>80</td>\n",
       "      <td>80</td>\n",
       "      <td>60</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   #       Name Type 1  Type 2  Total  HP  Attack  Defense  Sp. Atk  Sp. Def  \\\n",
       "0  1  Bulbasaur  Grass  Poison    318  45      49       49       65       65   \n",
       "1  2    Ivysaur  Grass  Poison    405  60      62       63       80       80   \n",
       "\n",
       "   Speed  Generation  Legendary  \n",
       "0     45           1      False  \n",
       "1     60           1      False  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## Pass the filepath into pd.read_csv\n",
    "df_pokemon = pd.read_csv(\"data/pokemon.csv\")\n",
    "df_pokemon.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8eaa4452",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Filtering a `DataFrame`\n",
    "\n",
    "Often, we want to **filter** a `DataFrame` so we only see observations that meet certain **conditions**.\n",
    "\n",
    "- \"Only show trials from the `Congruent` condition\".  \n",
    "- \"Only show neighborhoods that are `NEAR BAY`.\" \n",
    "- \"Only show scores below 70%.\"\n",
    "\n",
    "Fundamentally, this is done by using a **conditional statement**."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1421cff6",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Refresher: conditionals\n",
    "\n",
    "> In a nutshell, a **conditional** is a statement that checks for whether some *condition* is met.\n",
    "\n",
    "We can use the `if` command to **control** which lines of code are executed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "d488f152",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "def square_if_even(x):\n",
    "    if x % 2 == 0:\n",
    "        return x ** 2\n",
    "    else:\n",
    "        return x"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "aeb1fe8d",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "square_if_even(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "45636b60",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "square_if_even(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56f9742a",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Conditionals and filtering\n",
    "\n",
    "- **Filtering** a `DataFrame` is analogous to applying a *condition*.  \n",
    "- I.e., \"show me only the rows that meet *this condition*."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "96392e85",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Example 1: filtering on a categorical variable\n",
    "\n",
    "- The `legendary` column is a *categorical* variable, meaning there are several discrete categories.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "4a5b124a",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>#</th>\n",
       "      <th>Name</th>\n",
       "      <th>Type 1</th>\n",
       "      <th>Type 2</th>\n",
       "      <th>Total</th>\n",
       "      <th>HP</th>\n",
       "      <th>Attack</th>\n",
       "      <th>Defense</th>\n",
       "      <th>Sp. Atk</th>\n",
       "      <th>Sp. Def</th>\n",
       "      <th>Speed</th>\n",
       "      <th>Generation</th>\n",
       "      <th>Legendary</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>156</th>\n",
       "      <td>144</td>\n",
       "      <td>Articuno</td>\n",
       "      <td>Ice</td>\n",
       "      <td>Flying</td>\n",
       "      <td>580</td>\n",
       "      <td>90</td>\n",
       "      <td>85</td>\n",
       "      <td>100</td>\n",
       "      <td>95</td>\n",
       "      <td>125</td>\n",
       "      <td>85</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>157</th>\n",
       "      <td>145</td>\n",
       "      <td>Zapdos</td>\n",
       "      <td>Electric</td>\n",
       "      <td>Flying</td>\n",
       "      <td>580</td>\n",
       "      <td>90</td>\n",
       "      <td>90</td>\n",
       "      <td>85</td>\n",
       "      <td>125</td>\n",
       "      <td>90</td>\n",
       "      <td>100</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>158</th>\n",
       "      <td>146</td>\n",
       "      <td>Moltres</td>\n",
       "      <td>Fire</td>\n",
       "      <td>Flying</td>\n",
       "      <td>580</td>\n",
       "      <td>90</td>\n",
       "      <td>100</td>\n",
       "      <td>90</td>\n",
       "      <td>125</td>\n",
       "      <td>85</td>\n",
       "      <td>90</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>162</th>\n",
       "      <td>150</td>\n",
       "      <td>Mewtwo</td>\n",
       "      <td>Psychic</td>\n",
       "      <td>NaN</td>\n",
       "      <td>680</td>\n",
       "      <td>106</td>\n",
       "      <td>110</td>\n",
       "      <td>90</td>\n",
       "      <td>154</td>\n",
       "      <td>90</td>\n",
       "      <td>130</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>163</th>\n",
       "      <td>150</td>\n",
       "      <td>MewtwoMega Mewtwo X</td>\n",
       "      <td>Psychic</td>\n",
       "      <td>Fighting</td>\n",
       "      <td>780</td>\n",
       "      <td>106</td>\n",
       "      <td>190</td>\n",
       "      <td>100</td>\n",
       "      <td>154</td>\n",
       "      <td>100</td>\n",
       "      <td>130</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>795</th>\n",
       "      <td>719</td>\n",
       "      <td>Diancie</td>\n",
       "      <td>Rock</td>\n",
       "      <td>Fairy</td>\n",
       "      <td>600</td>\n",
       "      <td>50</td>\n",
       "      <td>100</td>\n",
       "      <td>150</td>\n",
       "      <td>100</td>\n",
       "      <td>150</td>\n",
       "      <td>50</td>\n",
       "      <td>6</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>796</th>\n",
       "      <td>719</td>\n",
       "      <td>DiancieMega Diancie</td>\n",
       "      <td>Rock</td>\n",
       "      <td>Fairy</td>\n",
       "      <td>700</td>\n",
       "      <td>50</td>\n",
       "      <td>160</td>\n",
       "      <td>110</td>\n",
       "      <td>160</td>\n",
       "      <td>110</td>\n",
       "      <td>110</td>\n",
       "      <td>6</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>797</th>\n",
       "      <td>720</td>\n",
       "      <td>HoopaHoopa Confined</td>\n",
       "      <td>Psychic</td>\n",
       "      <td>Ghost</td>\n",
       "      <td>600</td>\n",
       "      <td>80</td>\n",
       "      <td>110</td>\n",
       "      <td>60</td>\n",
       "      <td>150</td>\n",
       "      <td>130</td>\n",
       "      <td>70</td>\n",
       "      <td>6</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>798</th>\n",
       "      <td>720</td>\n",
       "      <td>HoopaHoopa Unbound</td>\n",
       "      <td>Psychic</td>\n",
       "      <td>Dark</td>\n",
       "      <td>680</td>\n",
       "      <td>80</td>\n",
       "      <td>160</td>\n",
       "      <td>60</td>\n",
       "      <td>170</td>\n",
       "      <td>130</td>\n",
       "      <td>80</td>\n",
       "      <td>6</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>799</th>\n",
       "      <td>721</td>\n",
       "      <td>Volcanion</td>\n",
       "      <td>Fire</td>\n",
       "      <td>Water</td>\n",
       "      <td>600</td>\n",
       "      <td>80</td>\n",
       "      <td>110</td>\n",
       "      <td>120</td>\n",
       "      <td>130</td>\n",
       "      <td>90</td>\n",
       "      <td>70</td>\n",
       "      <td>6</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>65 rows × 13 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       #                 Name    Type 1    Type 2  Total   HP  Attack  \\\n",
       "156  144             Articuno       Ice    Flying    580   90      85   \n",
       "157  145               Zapdos  Electric    Flying    580   90      90   \n",
       "158  146              Moltres      Fire    Flying    580   90     100   \n",
       "162  150               Mewtwo   Psychic       NaN    680  106     110   \n",
       "163  150  MewtwoMega Mewtwo X   Psychic  Fighting    780  106     190   \n",
       "..   ...                  ...       ...       ...    ...  ...     ...   \n",
       "795  719              Diancie      Rock     Fairy    600   50     100   \n",
       "796  719  DiancieMega Diancie      Rock     Fairy    700   50     160   \n",
       "797  720  HoopaHoopa Confined   Psychic     Ghost    600   80     110   \n",
       "798  720   HoopaHoopa Unbound   Psychic      Dark    680   80     160   \n",
       "799  721            Volcanion      Fire     Water    600   80     110   \n",
       "\n",
       "     Defense  Sp. Atk  Sp. Def  Speed  Generation  Legendary  \n",
       "156      100       95      125     85           1       True  \n",
       "157       85      125       90    100           1       True  \n",
       "158       90      125       85     90           1       True  \n",
       "162       90      154       90    130           1       True  \n",
       "163      100      154      100    130           1       True  \n",
       "..       ...      ...      ...    ...         ...        ...  \n",
       "795      150      100      150     50           6       True  \n",
       "796      110      160      110    110           6       True  \n",
       "797       60      150      130     70           6       True  \n",
       "798       60      170      130     80           6       True  \n",
       "799      120      130       90     70           6       True  \n",
       "\n",
       "[65 rows x 13 columns]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## How many legendary pokemon?\n",
    "df_pokemon[df_pokemon['Legendary']==True]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "bde470db",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "800"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## Compare to actual length of dataframe\n",
    "len(df_pokemon)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8d3caf00",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Example 2: filtering on a continuous variable\n",
    "\n",
    "- The `HP` column is a *continuous* variable.\n",
    "- Let's show only the rows for Pokemon with a `HP > 150`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "87a1eb47",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>#</th>\n",
       "      <th>Name</th>\n",
       "      <th>Type 1</th>\n",
       "      <th>Type 2</th>\n",
       "      <th>Total</th>\n",
       "      <th>HP</th>\n",
       "      <th>Attack</th>\n",
       "      <th>Defense</th>\n",
       "      <th>Sp. Atk</th>\n",
       "      <th>Sp. Def</th>\n",
       "      <th>Speed</th>\n",
       "      <th>Generation</th>\n",
       "      <th>Legendary</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>121</th>\n",
       "      <td>113</td>\n",
       "      <td>Chansey</td>\n",
       "      <td>Normal</td>\n",
       "      <td>NaN</td>\n",
       "      <td>450</td>\n",
       "      <td>250</td>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "      <td>35</td>\n",
       "      <td>105</td>\n",
       "      <td>50</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>155</th>\n",
       "      <td>143</td>\n",
       "      <td>Snorlax</td>\n",
       "      <td>Normal</td>\n",
       "      <td>NaN</td>\n",
       "      <td>540</td>\n",
       "      <td>160</td>\n",
       "      <td>110</td>\n",
       "      <td>65</td>\n",
       "      <td>65</td>\n",
       "      <td>110</td>\n",
       "      <td>30</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>217</th>\n",
       "      <td>202</td>\n",
       "      <td>Wobbuffet</td>\n",
       "      <td>Psychic</td>\n",
       "      <td>NaN</td>\n",
       "      <td>405</td>\n",
       "      <td>190</td>\n",
       "      <td>33</td>\n",
       "      <td>58</td>\n",
       "      <td>33</td>\n",
       "      <td>58</td>\n",
       "      <td>33</td>\n",
       "      <td>2</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>261</th>\n",
       "      <td>242</td>\n",
       "      <td>Blissey</td>\n",
       "      <td>Normal</td>\n",
       "      <td>NaN</td>\n",
       "      <td>540</td>\n",
       "      <td>255</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "      <td>75</td>\n",
       "      <td>135</td>\n",
       "      <td>55</td>\n",
       "      <td>2</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>351</th>\n",
       "      <td>321</td>\n",
       "      <td>Wailord</td>\n",
       "      <td>Water</td>\n",
       "      <td>NaN</td>\n",
       "      <td>500</td>\n",
       "      <td>170</td>\n",
       "      <td>90</td>\n",
       "      <td>45</td>\n",
       "      <td>90</td>\n",
       "      <td>45</td>\n",
       "      <td>60</td>\n",
       "      <td>3</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>655</th>\n",
       "      <td>594</td>\n",
       "      <td>Alomomola</td>\n",
       "      <td>Water</td>\n",
       "      <td>NaN</td>\n",
       "      <td>470</td>\n",
       "      <td>165</td>\n",
       "      <td>75</td>\n",
       "      <td>80</td>\n",
       "      <td>40</td>\n",
       "      <td>45</td>\n",
       "      <td>65</td>\n",
       "      <td>5</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       #       Name   Type 1 Type 2  Total   HP  Attack  Defense  Sp. Atk  \\\n",
       "121  113    Chansey   Normal    NaN    450  250       5        5       35   \n",
       "155  143    Snorlax   Normal    NaN    540  160     110       65       65   \n",
       "217  202  Wobbuffet  Psychic    NaN    405  190      33       58       33   \n",
       "261  242    Blissey   Normal    NaN    540  255      10       10       75   \n",
       "351  321    Wailord    Water    NaN    500  170      90       45       90   \n",
       "655  594  Alomomola    Water    NaN    470  165      75       80       40   \n",
       "\n",
       "     Sp. Def  Speed  Generation  Legendary  \n",
       "121      105     50           1      False  \n",
       "155      110     30           1      False  \n",
       "217       58     33           2      False  \n",
       "261      135     55           2      False  \n",
       "351       45     60           3      False  \n",
       "655       45     65           5      False  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_pokemon[df_pokemon['HP'] > 150]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "342b1c5c",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "800"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## Compare to actual length of df_pokemon\n",
    "len(df_pokemon)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2c707e0",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Interim summary: filtering syntax\n",
    "\n",
    "To filter a `DataFrame`, we can use the bracket (`[]`) syntax.:\n",
    "\n",
    "```python\n",
    "df_name[df_name['col_name']==CONDITION]\n",
    "```\n",
    "\n",
    "(Note that `==CONDITION` would be replaced with whatever **condition** you want to filter on.)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8a477e39",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Check-in\n",
    "\n",
    "Write a statement to **filter** `df_pokemon` to show only the pokemon where `Attack` is greater than or equal to 50. \n",
    "\n",
    "- How many rows are in this subset?  \n",
    "- What proportion of Pokemon meet this criteria?\n",
    "- What is the mean `HP` in this subset?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "f3fe61c6",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "### Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1d6f2720",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Solution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "eded1ec3",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "## First, write the filter statement\n",
    "df_subset = df_pokemon[df_pokemon['Attack'] >= 50]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "15e1c3a1",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "667"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "### How many rows are in this subset? \n",
    "len(df_subset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "b433d8f9",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.83375"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "### Proportion of dataset ... about half\n",
    "len(df_subset) / len(df_pokemon)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "bfe69b76",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "72.55472263868066"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## Mean HP\n",
    "df_subset['HP'].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8769132a",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Check-in\n",
    "\n",
    "Now write a statement to **filter** `df_pokemon` to show only the Pokemon where `Attack` is *less than* or equal to 50. \n",
    "\n",
    "- How many rows are in this subset?  \n",
    "- What proportion of Pokemon meet this criteria?\n",
    "- What is the mean `HP` in this subset?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "8918b6a5",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "## First, write the filter statement\n",
    "df_subset = df_pokemon[df_pokemon['Attack'] <= 50]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "b7889b12",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "170"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "### How many rows are in this subset? \n",
    "len(df_subset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "edadf730",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.2125"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "### Proportion of dataset ... about half\n",
    "len(df_subset) / len(df_pokemon)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "3e2e9363",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "53.72941176470588"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## Mean *median* house value\n",
    "df_subset['HP'].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9206fe58",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Check-in\n",
    "\n",
    "What is the mean `HP` of Pokemon that are `Legendary` vs. those that aren't?\n",
    "\n",
    "**Hint**: Create *two* subset `DataFrame`s, filtering for `== True` and `!= False`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "3402110a",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "### Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7b33a406",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Solution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "2854aee1",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "## First, write the filter statement\n",
    "df_legendary = df_pokemon[df_pokemon['Legendary'] == True]\n",
    "df_not_legendary = df_pokemon[df_pokemon['Legendary'] == False]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "0ca435e9",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "92.73846153846154\n",
      "67.18231292517007\n"
     ]
    }
   ],
   "source": [
    "## Mean for legendary\n",
    "print(df_legendary['HP'].mean())\n",
    "## Mean for not legendary\n",
    "print(df_not_legendary['HP'].mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dd02c796",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Interim summary\n",
    "\n",
    "- We can apply **conditionals** to a `DataFrame` to *filter* that `DataFrame`.\n",
    "   - \"Show me only observations that meet some criteria.\"\n",
    "- We also used this to look at the `mean` value of some column (`HP`) as a function of some other column (`Legendary`).\n",
    "- But there's actually an even easier way to do this!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f9a1c007",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Using `groupby`\n",
    "\n",
    "> The [`groupby` function](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) allows you to **split data** (i.e., along different categories) then **apply** some function to each split of that data (e.g., `mean`).\n",
    "\n",
    "Examples:\n",
    "\n",
    "- Calculate the `mean` `HP` for each category in `Type 1`.\n",
    "- Calculate the `sum` of `costs` for each `month`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "caec9384",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Simple example\n",
    "\n",
    "- Let's demo this in a simple `DataFrame`.\n",
    "- First, we create a `DataFrame` with information about `Reaction Time` in two different `Condition`s.\n",
    "- Are people faster or slower for `Congruent` vs. `Incongruent` conditions?\n",
    "   - E.g., supppose these are results from the [Stroop task](https://en.wikipedia.org/wiki/Stroop_effect). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "7e1e79bf",
   "metadata": {
    "scrolled": true,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Condition</th>\n",
       "      <th>Reaction Time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Congruent</td>\n",
       "      <td>500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Congruent</td>\n",
       "      <td>460</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Congruent</td>\n",
       "      <td>532</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Incongruent</td>\n",
       "      <td>730</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Incongruent</td>\n",
       "      <td>720</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Incongruent</td>\n",
       "      <td>670</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     Condition  Reaction Time\n",
       "0    Congruent            500\n",
       "1    Congruent            460\n",
       "2    Congruent            532\n",
       "3  Incongruent            730\n",
       "4  Incongruent            720\n",
       "5  Incongruent            670"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_rt = pd.DataFrame({'Condition': ['Congruent', 'Congruent', 'Congruent',\n",
    "                                    'Incongruent', 'Incongruent', 'Incongruent'],\n",
    "                          'Reaction Time': [500, 460, 532, 730, 720, 670]})\n",
    "df_rt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea5b7ce7",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Using `groupby`\n",
    "\n",
    "To calculate the `mean` `Reaction Time` in each condition, use:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "66269e65",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Reaction Time</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Condition</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Congruent</th>\n",
       "      <td>497.333333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Incongruent</th>\n",
       "      <td>706.666667</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             Reaction Time\n",
       "Condition                 \n",
       "Congruent       497.333333\n",
       "Incongruent     706.666667"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_rt.groupby(\"Condition\").mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ca251bc",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Using `groupby` and `describe`\n",
    "\n",
    "To get more statistics about each condition, use `describe`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "1b0ef464",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th colspan=\"8\" halign=\"left\">Reaction Time</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "      <th>mean</th>\n",
       "      <th>std</th>\n",
       "      <th>min</th>\n",
       "      <th>25%</th>\n",
       "      <th>50%</th>\n",
       "      <th>75%</th>\n",
       "      <th>max</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Condition</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Congruent</th>\n",
       "      <td>3.0</td>\n",
       "      <td>497.333333</td>\n",
       "      <td>36.073998</td>\n",
       "      <td>460.0</td>\n",
       "      <td>480.0</td>\n",
       "      <td>500.0</td>\n",
       "      <td>516.0</td>\n",
       "      <td>532.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Incongruent</th>\n",
       "      <td>3.0</td>\n",
       "      <td>706.666667</td>\n",
       "      <td>32.145503</td>\n",
       "      <td>670.0</td>\n",
       "      <td>695.0</td>\n",
       "      <td>720.0</td>\n",
       "      <td>725.0</td>\n",
       "      <td>730.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            Reaction Time                                                     \\\n",
       "                    count        mean        std    min    25%    50%    75%   \n",
       "Condition                                                                      \n",
       "Congruent             3.0  497.333333  36.073998  460.0  480.0  500.0  516.0   \n",
       "Incongruent           3.0  706.666667  32.145503  670.0  695.0  720.0  725.0   \n",
       "\n",
       "                    \n",
       "               max  \n",
       "Condition           \n",
       "Congruent    532.0  \n",
       "Incongruent  730.0  "
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_rt.groupby(\"Condition\").describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8f4a9dc4",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Check-in\n",
    "\n",
    "Use `groupby` to calculate the average `HP` for each category in `Legendary`. Which category has the highest average?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "3c795c91",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "### Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9707a1fd",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Solution\n",
    "\n",
    "**Note**: By default, this will calculate the `mean` of **all available columns** (including `HP`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "id": "d9d374d1",
   "metadata": {
    "scrolled": true,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>#</th>\n",
       "      <th>Total</th>\n",
       "      <th>HP</th>\n",
       "      <th>Attack</th>\n",
       "      <th>Defense</th>\n",
       "      <th>Sp. Atk</th>\n",
       "      <th>Sp. Def</th>\n",
       "      <th>Speed</th>\n",
       "      <th>Generation</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Legendary</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>False</th>\n",
       "      <td>353.315646</td>\n",
       "      <td>417.213605</td>\n",
       "      <td>67.182313</td>\n",
       "      <td>75.669388</td>\n",
       "      <td>71.559184</td>\n",
       "      <td>68.454422</td>\n",
       "      <td>68.892517</td>\n",
       "      <td>65.455782</td>\n",
       "      <td>3.284354</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>470.215385</td>\n",
       "      <td>637.384615</td>\n",
       "      <td>92.738462</td>\n",
       "      <td>116.676923</td>\n",
       "      <td>99.661538</td>\n",
       "      <td>122.184615</td>\n",
       "      <td>105.938462</td>\n",
       "      <td>100.184615</td>\n",
       "      <td>3.769231</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                    #       Total         HP      Attack    Defense  \\\n",
       "Legendary                                                             \n",
       "False      353.315646  417.213605  67.182313   75.669388  71.559184   \n",
       "True       470.215385  637.384615  92.738462  116.676923  99.661538   \n",
       "\n",
       "              Sp. Atk     Sp. Def       Speed  Generation  \n",
       "Legendary                                                  \n",
       "False       68.454422   68.892517   65.455782    3.284354  \n",
       "True       122.184615  105.938462  100.184615    3.769231  "
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_pokemon.groupby(\"Legendary\").mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "41ebc656",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Restricting output to a single column\n",
    "\n",
    "To look only a specific column, you can first **slice** only the columns you're interested in:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "id": "c907be09",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Legendary</th>\n",
       "      <th>HP</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>False</td>\n",
       "      <td>45</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>False</td>\n",
       "      <td>60</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Legendary  HP\n",
       "0      False  45\n",
       "1      False  60"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_pokemon[['Legendary', 'HP']].head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "86b163ee",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "Then we can call `groupby` on this simpler `DataFrame` object with fewer columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "id": "33b0c796",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>HP</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Legendary</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>False</th>\n",
       "      <td>67.182313</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>92.738462</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                  HP\n",
       "Legendary           \n",
       "False      67.182313\n",
       "True       92.738462"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_pokemon[['Legendary', 'HP']].groupby(\"Legendary\").mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7b7652e0",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Missing data!\n",
    "\n",
    "- In practice, **real-world** data is often messy.  \n",
    "- This includes **missing values**, which take on the value/label `NaN`.\n",
    "   - `NaN` = \"Not a Number\".\n",
    "- [Dealing with `NaN` values is one of the main challenges in CSS!](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92c70100",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Where do missing values come from?\n",
    "\n",
    "- Lots of reasons for [missing data](https://en.wikipedia.org/wiki/Missing_data)!\n",
    "  - Experiment crashed, so some trials are missing.\n",
    "  - Some people don't answer all questions on a survey.\n",
    "  - Some people \"drop\" out of a longitudinal study.  \n",
    "\n",
    "| Subject | Condition | RT |\n",
    "| ------- | ---------- | --- | \n",
    "| 1     | Congruent        | 500 |\n",
    "| 2     | Congruent      | 550 |\n",
    "| 3     | Incongruent      | 670 |\n",
    "| 4     | NaN      | NaN |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b25acec0",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Why does missing data matter?\n",
    "\n",
    "Again, [lots of reasons](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3668100/):\n",
    "\n",
    "1. If you're unaware of it, then you might be **overestimating** how many observations you have. (E.g., you think `N = 500`, but actually `N = 450`.)\n",
    "2. If missing values are **non-randomly distributed**, your results might be **biased**. (E.g., maybe the `Placebo` group is more likely to drop out of a longitudinal study?)\n",
    "3. Missing values may complicate your analysis. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e0c2a277",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### How to deal with missing data\n",
    "\n",
    "- **Identify** missing values.  \n",
    "- **Analyze** whether they're randomly or non-randomly distributed.  \n",
    "- **Decide** how to handle them:\n",
    "  - *Remove* any rows with missing data from all analyses?  \n",
    "  - *Ignore* rows with missing data for *some* analyses but not all?  \n",
    "  - Try to *impute* (\"fill in\") the missing data using other information?\n",
    "- Not an easy problem!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b503d234",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Identifying missing data\n",
    "\n",
    "- The first step is **identifying** missing values.\n",
    "- For this, you can use `isna`.\n",
    "   - Returns a vector of `True`/`False` values indicating whether a given observation is `NaN`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "id": "f2eed699",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>#</th>\n",
       "      <th>Name</th>\n",
       "      <th>Type 1</th>\n",
       "      <th>Type 2</th>\n",
       "      <th>Total</th>\n",
       "      <th>HP</th>\n",
       "      <th>Attack</th>\n",
       "      <th>Defense</th>\n",
       "      <th>Sp. Atk</th>\n",
       "      <th>Sp. Def</th>\n",
       "      <th>Speed</th>\n",
       "      <th>Generation</th>\n",
       "      <th>Legendary</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       #   Name  Type 1  Type 2  Total     HP  Attack  Defense  Sp. Atk  \\\n",
       "0  False  False   False   False  False  False   False    False    False   \n",
       "1  False  False   False   False  False  False   False    False    False   \n",
       "2  False  False   False   False  False  False   False    False    False   \n",
       "3  False  False   False   False  False  False   False    False    False   \n",
       "4  False  False   False    True  False  False   False    False    False   \n",
       "\n",
       "   Sp. Def  Speed  Generation  Legendary  \n",
       "0    False  False       False      False  \n",
       "1    False  False       False      False  \n",
       "2    False  False       False      False  \n",
       "3    False  False       False      False  \n",
       "4    False  False       False      False  "
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "### This means none of those columns\n",
    "### have NaN in top 5 rows\n",
    "df_pokemon.isna().head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "555826e4",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Which columns have `NaN`?\n",
    "\n",
    "We can discover whether any column has `NaN` using `isna().any()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "7417355f",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "#             False\n",
       "Name          False\n",
       "Type 1        False\n",
       "Type 2         True\n",
       "Total         False\n",
       "HP            False\n",
       "Attack        False\n",
       "Defense       False\n",
       "Sp. Atk       False\n",
       "Sp. Def       False\n",
       "Speed         False\n",
       "Generation    False\n",
       "Legendary     False\n",
       "dtype: bool"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "### Tells you which columns have NaN values\n",
    "df_pokemon.isna().any()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca0cd67b",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Inspecting `Type 2`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "74152c8a",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>#</th>\n",
       "      <th>Name</th>\n",
       "      <th>Type 1</th>\n",
       "      <th>Type 2</th>\n",
       "      <th>Total</th>\n",
       "      <th>HP</th>\n",
       "      <th>Attack</th>\n",
       "      <th>Defense</th>\n",
       "      <th>Sp. Atk</th>\n",
       "      <th>Sp. Def</th>\n",
       "      <th>Speed</th>\n",
       "      <th>Generation</th>\n",
       "      <th>Legendary</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>Charmander</td>\n",
       "      <td>Fire</td>\n",
       "      <td>NaN</td>\n",
       "      <td>309</td>\n",
       "      <td>39</td>\n",
       "      <td>52</td>\n",
       "      <td>43</td>\n",
       "      <td>60</td>\n",
       "      <td>50</td>\n",
       "      <td>65</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5</td>\n",
       "      <td>Charmeleon</td>\n",
       "      <td>Fire</td>\n",
       "      <td>NaN</td>\n",
       "      <td>405</td>\n",
       "      <td>58</td>\n",
       "      <td>64</td>\n",
       "      <td>58</td>\n",
       "      <td>80</td>\n",
       "      <td>65</td>\n",
       "      <td>80</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>7</td>\n",
       "      <td>Squirtle</td>\n",
       "      <td>Water</td>\n",
       "      <td>NaN</td>\n",
       "      <td>314</td>\n",
       "      <td>44</td>\n",
       "      <td>48</td>\n",
       "      <td>65</td>\n",
       "      <td>50</td>\n",
       "      <td>64</td>\n",
       "      <td>43</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>8</td>\n",
       "      <td>Wartortle</td>\n",
       "      <td>Water</td>\n",
       "      <td>NaN</td>\n",
       "      <td>405</td>\n",
       "      <td>59</td>\n",
       "      <td>63</td>\n",
       "      <td>80</td>\n",
       "      <td>65</td>\n",
       "      <td>80</td>\n",
       "      <td>58</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>9</td>\n",
       "      <td>Blastoise</td>\n",
       "      <td>Water</td>\n",
       "      <td>NaN</td>\n",
       "      <td>530</td>\n",
       "      <td>79</td>\n",
       "      <td>83</td>\n",
       "      <td>100</td>\n",
       "      <td>85</td>\n",
       "      <td>105</td>\n",
       "      <td>78</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    #        Name Type 1 Type 2  Total  HP  Attack  Defense  Sp. Atk  Sp. Def  \\\n",
       "4   4  Charmander   Fire    NaN    309  39      52       43       60       50   \n",
       "5   5  Charmeleon   Fire    NaN    405  58      64       58       80       65   \n",
       "9   7    Squirtle  Water    NaN    314  44      48       65       50       64   \n",
       "10  8   Wartortle  Water    NaN    405  59      63       80       65       80   \n",
       "11  9   Blastoise  Water    NaN    530  79      83      100       85      105   \n",
       "\n",
       "    Speed  Generation  Legendary  \n",
       "4      65           1      False  \n",
       "5      80           1      False  \n",
       "9      43           1      False  \n",
       "10     58           1      False  \n",
       "11     78           1      False  "
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "### These rows have NaN for \"total_bedrooms\"\n",
    "df_pokemon[df_pokemon['Type 2'].isna()].head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "id": "fe114860",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "386"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(df_pokemon[df_pokemon['Type 2'].isna()])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d46ddb16",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### `drop_na`\n",
    "\n",
    "One approach is to **drop** these observations altogether. You can do this using `dropna(column_name)`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "id": "023eae85",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "414"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_pokemon_subset = df_pokemon.dropna(subset = \"Type 2\")\n",
    "len(df_pokemon_subset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "id": "99ecbe49",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "#             False\n",
       "Name          False\n",
       "Type 1        False\n",
       "Type 2        False\n",
       "Total         False\n",
       "HP            False\n",
       "Attack        False\n",
       "Defense       False\n",
       "Sp. Atk       False\n",
       "Sp. Def       False\n",
       "Speed         False\n",
       "Generation    False\n",
       "Legendary     False\n",
       "dtype: bool"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "### No more NaN\n",
    "df_pokemon_subset.isna().any()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c6abdbb",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Filtering on `~isna`\n",
    "\n",
    "You can also **drop** observations by *filtering* on values that *aren't* `NaN`.\n",
    "\n",
    "- The `~` syntax is a way to write `not` in `pandas` syntax."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "id": "5b8047c1",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "414"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_pokemon_subset = df_pokemon[~df_pokemon['Type 2'].isna()]\n",
    "len(df_pokemon_subset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "id": "f5b7f1a3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "#             False\n",
       "Name          False\n",
       "Type 1        False\n",
       "Type 2        False\n",
       "Total         False\n",
       "HP            False\n",
       "Attack        False\n",
       "Defense       False\n",
       "Sp. Atk       False\n",
       "Sp. Def       False\n",
       "Speed         False\n",
       "Generation    False\n",
       "Legendary     False\n",
       "dtype: bool"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "### No more NaN\n",
    "df_pokemon_subset.isna().any()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10b3dcd5",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Do we *want* to remove `NaN`?\n",
    "\n",
    "- In this particular case, removing `NaN` may not be necessary.\n",
    "  - Lots of Pokeon just don't *have* a secondary Type. \n",
    "- But in other cases, `NaN` actually implies **missing data**, not just a feature of some observations.\n",
    "- In those cases, we might want to remove those observations."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9c9760c6",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Conclusion\n",
    "\n",
    "This concludes our unit on `pandas`.\n",
    "\n",
    "- `pandas` is central to data manipulation and data analysis in Python.  \n",
    "- It's a handy way to read `.csv` files (`read_csv`).  \n",
    "- As you'll see in CSS 2, `pandas` can also be used in conjunction with **visualization** and **analysis** packages."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}