Commit d74a1dea by Patryk Czarnik

gotowe notatniki

parent 9308b618
{
"cells": [
{
"cell_type": "markdown",
"id": "413742af-1218-433a-bea3-6684adb0517a",
"metadata": {},
"source": [
"Aby dokonywać \"biznesowej analizy danych\", użyjemy biblioteki `pandas`. Dodatkowo zaimportujemy też `numpy`, ale zwykle nie ma takiej potrzeby. Konwencją jest, że zaimportowanym modułom nadaje się skrótowe nazwy `pd` oraz `np`."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "833db353-396e-46a5-ba91-695ad955799a",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"id": "176ff729-bf4e-4c94-a002-a557fc26ed4f",
"metadata": {},
"source": [
"## Wczytanie danych\n",
"\n",
"Zwykle na początku wczytujemy dane z zewnętrznego źródła. Zwykle jest to plik csv, ale Pandas potrafi też czytać pliki Excel i wiele innych oraz pobierać dane z baz SQL."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "518a1c12-6c15-40aa-984d-96723ce0a770",
"metadata": {},
"outputs": [],
"source": [
"# to wczytuje dane w podstawowy sposób\n",
"# emps = pd.read_csv('pliki/emps.csv', sep=';')"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "f24a4023-5669-4411-8ffe-528d1cbf6276",
"metadata": {},
"outputs": [],
"source": [
"# teraz dodamy kolejne ustawienia:\n",
"emps = pd.read_csv('pliki/emps.csv', sep=';', index_col='employee_id', parse_dates=['hire_date'])"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "d1df7179-f3f2-4c0d-b26b-5cad84a0097a",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>job_title</th>\n",
" <th>salary</th>\n",
" <th>hire_date</th>\n",
" <th>department_name</th>\n",
" <th>address</th>\n",
" <th>postal_code</th>\n",
" <th>city</th>\n",
" <th>country</th>\n",
" </tr>\n",
" <tr>\n",
" <th>employee_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>100</th>\n",
" <td>Steven</td>\n",
" <td>King</td>\n",
" <td>President</td>\n",
" <td>24000</td>\n",
" <td>1997-06-17</td>\n",
" <td>Executive</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>101</th>\n",
" <td>Neena</td>\n",
" <td>Kochhar</td>\n",
" <td>Administration Vice President</td>\n",
" <td>17000</td>\n",
" <td>1999-09-21</td>\n",
" <td>Executive</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>102</th>\n",
" <td>Lex</td>\n",
" <td>De Haan</td>\n",
" <td>Administration Vice President</td>\n",
" <td>17000</td>\n",
" <td>2003-01-13</td>\n",
" <td>Executive</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>103</th>\n",
" <td>Alexander</td>\n",
" <td>Hunold</td>\n",
" <td>Programmer</td>\n",
" <td>9000</td>\n",
" <td>2000-01-03</td>\n",
" <td>IT</td>\n",
" <td>2014 Jabberwocky Rd</td>\n",
" <td>26192</td>\n",
" <td>Southlake</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>104</th>\n",
" <td>Bruce</td>\n",
" <td>Ernst</td>\n",
" <td>Programmer</td>\n",
" <td>6000</td>\n",
" <td>2001-05-21</td>\n",
" <td>IT</td>\n",
" <td>2014 Jabberwocky Rd</td>\n",
" <td>26192</td>\n",
" <td>Southlake</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>202</th>\n",
" <td>Pat</td>\n",
" <td>Fay</td>\n",
" <td>Marketing Representative</td>\n",
" <td>6000</td>\n",
" <td>2007-08-17</td>\n",
" <td>Marketing</td>\n",
" <td>147 Spadina Ave</td>\n",
" <td>M5V 2L7</td>\n",
" <td>Toronto</td>\n",
" <td>Canada</td>\n",
" </tr>\n",
" <tr>\n",
" <th>203</th>\n",
" <td>Susan</td>\n",
" <td>Mavris</td>\n",
" <td>Human Resources Representative</td>\n",
" <td>6500</td>\n",
" <td>2004-06-07</td>\n",
" <td>Human Resources</td>\n",
" <td>8204 Arthur St</td>\n",
" <td>NaN</td>\n",
" <td>London</td>\n",
" <td>United Kingdom</td>\n",
" </tr>\n",
" <tr>\n",
" <th>204</th>\n",
" <td>Hermann</td>\n",
" <td>Baer</td>\n",
" <td>Public Relations Representative</td>\n",
" <td>10000</td>\n",
" <td>2004-06-07</td>\n",
" <td>Public Relations</td>\n",
" <td>Schwanthalerstr. 7031</td>\n",
" <td>80925</td>\n",
" <td>Munich</td>\n",
" <td>Germany</td>\n",
" </tr>\n",
" <tr>\n",
" <th>205</th>\n",
" <td>Shelley</td>\n",
" <td>Higgins</td>\n",
" <td>Accounting Manager</td>\n",
" <td>12000</td>\n",
" <td>2004-06-07</td>\n",
" <td>Accounting</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>206</th>\n",
" <td>William</td>\n",
" <td>Gietz</td>\n",
" <td>Public Accountant</td>\n",
" <td>8300</td>\n",
" <td>2004-06-07</td>\n",
" <td>Accounting</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>107 rows × 10 columns</p>\n",
"</div>"
],
"text/plain": [
" first_name last_name job_title salary \\\n",
"employee_id \n",
"100 Steven King President 24000 \n",
"101 Neena Kochhar Administration Vice President 17000 \n",
"102 Lex De Haan Administration Vice President 17000 \n",
"103 Alexander Hunold Programmer 9000 \n",
"104 Bruce Ernst Programmer 6000 \n",
"... ... ... ... ... \n",
"202 Pat Fay Marketing Representative 6000 \n",
"203 Susan Mavris Human Resources Representative 6500 \n",
"204 Hermann Baer Public Relations Representative 10000 \n",
"205 Shelley Higgins Accounting Manager 12000 \n",
"206 William Gietz Public Accountant 8300 \n",
"\n",
" hire_date department_name address postal_code \\\n",
"employee_id \n",
"100 1997-06-17 Executive 2004 Charade Rd 98199 \n",
"101 1999-09-21 Executive 2004 Charade Rd 98199 \n",
"102 2003-01-13 Executive 2004 Charade Rd 98199 \n",
"103 2000-01-03 IT 2014 Jabberwocky Rd 26192 \n",
"104 2001-05-21 IT 2014 Jabberwocky Rd 26192 \n",
"... ... ... ... ... \n",
"202 2007-08-17 Marketing 147 Spadina Ave M5V 2L7 \n",
"203 2004-06-07 Human Resources 8204 Arthur St NaN \n",
"204 2004-06-07 Public Relations Schwanthalerstr. 7031 80925 \n",
"205 2004-06-07 Accounting 2004 Charade Rd 98199 \n",
"206 2004-06-07 Accounting 2004 Charade Rd 98199 \n",
"\n",
" city country \n",
"employee_id \n",
"100 Seattle United States of America \n",
"101 Seattle United States of America \n",
"102 Seattle United States of America \n",
"103 Southlake United States of America \n",
"104 Southlake United States of America \n",
"... ... ... \n",
"202 Toronto Canada \n",
"203 London United Kingdom \n",
"204 Munich Germany \n",
"205 Seattle United States of America \n",
"206 Seattle United States of America \n",
"\n",
"[107 rows x 10 columns]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps"
]
},
{
"cell_type": "markdown",
"id": "806bcbce-a21a-49a2-8348-e6f4f3462756",
"metadata": {},
"source": [
"Wczytamy jeszcze drugi plik..."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "d5947655-9698-42ad-996f-23d4271792c6",
"metadata": {},
"outputs": [],
"source": [
"sprzedaz = pd.read_csv('pliki/sprzedaz.csv', sep=',', parse_dates=['data'])"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "30fc9f3c-630b-4014-881c-de1aec07cb6a",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>data</th>\n",
" <th>miasto</th>\n",
" <th>sklep</th>\n",
" <th>kategoria</th>\n",
" <th>towar</th>\n",
" <th>cena</th>\n",
" <th>sztuk</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2014-11-23</td>\n",
" <td>Łódź</td>\n",
" <td>Wdowiak</td>\n",
" <td>meble</td>\n",
" <td>biurko</td>\n",
" <td>149.99</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2017-05-07</td>\n",
" <td>Radom</td>\n",
" <td>Czarnecki</td>\n",
" <td>wyposażenie szkolne</td>\n",
" <td>tablica</td>\n",
" <td>590.00</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2017-05-05</td>\n",
" <td>Kraków</td>\n",
" <td>Kozłowski</td>\n",
" <td>szkolno-biurowe</td>\n",
" <td>flamaster</td>\n",
" <td>0.99</td>\n",
" <td>51</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2016-10-19</td>\n",
" <td>Kraków</td>\n",
" <td>Wróbel</td>\n",
" <td>wyposażenie szkolne</td>\n",
" <td>gąbka</td>\n",
" <td>4.00</td>\n",
" <td>250</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2016-04-08</td>\n",
" <td>Poznań</td>\n",
" <td>Borowik</td>\n",
" <td>meble</td>\n",
" <td>biurko</td>\n",
" <td>149.99</td>\n",
" <td>9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9995</th>\n",
" <td>2016-05-22</td>\n",
" <td>Katowice</td>\n",
" <td>Gaińska</td>\n",
" <td>szkolno-biurowe</td>\n",
" <td>dziurkacz</td>\n",
" <td>7.50</td>\n",
" <td>178</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9996</th>\n",
" <td>2016-11-19</td>\n",
" <td>Kraków</td>\n",
" <td>Kozłowski</td>\n",
" <td>meble</td>\n",
" <td>biurko</td>\n",
" <td>149.99</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9997</th>\n",
" <td>2016-09-30</td>\n",
" <td>Łódź</td>\n",
" <td>Wdowiak</td>\n",
" <td>szkolno-biurowe</td>\n",
" <td>długopis</td>\n",
" <td>1.49</td>\n",
" <td>87</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9998</th>\n",
" <td>2015-05-01</td>\n",
" <td>Kraków</td>\n",
" <td>Kozłowski</td>\n",
" <td>meble</td>\n",
" <td>biurko</td>\n",
" <td>149.99</td>\n",
" <td>10</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9999</th>\n",
" <td>2016-08-26</td>\n",
" <td>Kraków</td>\n",
" <td>Kozłowski</td>\n",
" <td>wyposażenie szkolne</td>\n",
" <td>gąbka</td>\n",
" <td>4.00</td>\n",
" <td>152</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>10000 rows × 7 columns</p>\n",
"</div>"
],
"text/plain": [
" data miasto sklep kategoria towar cena \\\n",
"0 2014-11-23 Łódź Wdowiak meble biurko 149.99 \n",
"1 2017-05-07 Radom Czarnecki wyposażenie szkolne tablica 590.00 \n",
"2 2017-05-05 Kraków Kozłowski szkolno-biurowe flamaster 0.99 \n",
"3 2016-10-19 Kraków Wróbel wyposażenie szkolne gąbka 4.00 \n",
"4 2016-04-08 Poznań Borowik meble biurko 149.99 \n",
"... ... ... ... ... ... ... \n",
"9995 2016-05-22 Katowice Gaińska szkolno-biurowe dziurkacz 7.50 \n",
"9996 2016-11-19 Kraków Kozłowski meble biurko 149.99 \n",
"9997 2016-09-30 Łódź Wdowiak szkolno-biurowe długopis 1.49 \n",
"9998 2015-05-01 Kraków Kozłowski meble biurko 149.99 \n",
"9999 2016-08-26 Kraków Kozłowski wyposażenie szkolne gąbka 4.00 \n",
"\n",
" sztuk \n",
"0 4 \n",
"1 2 \n",
"2 51 \n",
"3 250 \n",
"4 9 \n",
"... ... \n",
"9995 178 \n",
"9996 7 \n",
"9997 87 \n",
"9998 10 \n",
"9999 152 \n",
"\n",
"[10000 rows x 7 columns]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sprzedaz"
]
},
{
"cell_type": "markdown",
"id": "6e55dad7-dc62-43a2-90e0-3199a86bba22",
"metadata": {},
"source": [
"## Typy danych itp.\n",
"\n",
"Tabela z danymi, coś, co odpowiada arkuszowi Excela albo tabeli bazodanowej, to jest `DataFrame`."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "32594f3d-21a1-4bcd-8c53-f0ad1c09926b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.frame.DataFrame"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(emps)"
]
},
{
"cell_type": "markdown",
"id": "30dad998-30cd-4f10-8787-9b00e4d667b8",
"metadata": {},
"source": [
"Pojedyncza kolumna, „seria danych” jest typu `Series`."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "68dab7ee-4ab8-44c0-a432-9576cfba60ec",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.series.Series"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(emps.last_name)"
]
},
{
"cell_type": "markdown",
"id": "dc3bb090-1949-4c02-9b28-8d03bdd9358d",
"metadata": {},
"source": [
"Jakie typu są kolumny?"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "3e3d2466-d010-4c0a-8808-c68986e531cc",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"data datetime64[ns]\n",
"miasto object\n",
"sklep object\n",
"kategoria object\n",
"towar object\n",
"cena float64\n",
"sztuk int64\n",
"dtype: object"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sprzedaz.dtypes"
]
},
{
"cell_type": "markdown",
"id": "3ecee7dc-6b81-4d06-9b33-86d26d35d3aa",
"metadata": {},
"source": [
"Nazwy kolumn:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "0bca2b80-503a-4ecd-8bcf-ca0424b8c56d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['data', 'miasto', 'sklep', 'kategoria', 'towar', 'cena', 'sztuk'], dtype='object')"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sprzedaz.columns"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "b55585f4-c7bb-4b8a-abab-dc381049770e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'sklep'"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sprzedaz.columns[2]"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "af56bc84-5d80-44a8-88d4-72d79e8453fa",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(107, 10)"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.shape"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "3a411a55-e813-4037-9f25-7482c72a1c4f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1070"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.size"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "56e657a0-5c17-4741-bcb7-41cddc570993",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"107"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(emps)"
]
},
{
"cell_type": "markdown",
"id": "f8cc6bb1-b7c8-4b56-a086-b4e0c05ecb5c",
"metadata": {},
"source": [
"## Indeksowanie\n",
"\n",
"czyli dostęp po współrzędnych.\n",
"\n",
"- ``.iloc`` - dostęp wg współrzędnych numerycznych, jak w `numpy`, numeracja od zera\n",
"- ``.loc`` - dostęp wg wartości indeksu i nazwy kolumny"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "4be9c4e3-5db4-4aef-ab9b-8848cf6f33f5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Steven'"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.iloc[0, 0]"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "4f8411c4-4735-4712-b541-1f6d3dd86efe",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"17000"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.iloc[2, 3]"
]
},
{
"cell_type": "markdown",
"id": "0efb33fa-aaff-4996-80bc-ac21f9a05ee5",
"metadata": {},
"source": [
"`DataFrame` i `Series` są „mutowalne”."
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "c12fa03f-4fbc-464f-8a59-cbb0a9ca2d35",
"metadata": {},
"outputs": [],
"source": [
"emps.iloc[0, 3] += 1"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "cc3281fc-f38b-4add-8850-11f5c8649b8a",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>job_title</th>\n",
" <th>salary</th>\n",
" <th>hire_date</th>\n",
" <th>department_name</th>\n",
" <th>address</th>\n",
" <th>postal_code</th>\n",
" <th>city</th>\n",
" <th>country</th>\n",
" </tr>\n",
" <tr>\n",
" <th>employee_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>100</th>\n",
" <td>Steven</td>\n",
" <td>King</td>\n",
" <td>President</td>\n",
" <td>24001</td>\n",
" <td>1997-06-17</td>\n",
" <td>Executive</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>101</th>\n",
" <td>Neena</td>\n",
" <td>Kochhar</td>\n",
" <td>Administration Vice President</td>\n",
" <td>17000</td>\n",
" <td>1999-09-21</td>\n",
" <td>Executive</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>102</th>\n",
" <td>Lex</td>\n",
" <td>De Haan</td>\n",
" <td>Administration Vice President</td>\n",
" <td>17000</td>\n",
" <td>2003-01-13</td>\n",
" <td>Executive</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>103</th>\n",
" <td>Alexander</td>\n",
" <td>Hunold</td>\n",
" <td>Programmer</td>\n",
" <td>9000</td>\n",
" <td>2000-01-03</td>\n",
" <td>IT</td>\n",
" <td>2014 Jabberwocky Rd</td>\n",
" <td>26192</td>\n",
" <td>Southlake</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>104</th>\n",
" <td>Bruce</td>\n",
" <td>Ernst</td>\n",
" <td>Programmer</td>\n",
" <td>6000</td>\n",
" <td>2001-05-21</td>\n",
" <td>IT</td>\n",
" <td>2014 Jabberwocky Rd</td>\n",
" <td>26192</td>\n",
" <td>Southlake</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" first_name last_name job_title salary \\\n",
"employee_id \n",
"100 Steven King President 24001 \n",
"101 Neena Kochhar Administration Vice President 17000 \n",
"102 Lex De Haan Administration Vice President 17000 \n",
"103 Alexander Hunold Programmer 9000 \n",
"104 Bruce Ernst Programmer 6000 \n",
"\n",
" hire_date department_name address postal_code \\\n",
"employee_id \n",
"100 1997-06-17 Executive 2004 Charade Rd 98199 \n",
"101 1999-09-21 Executive 2004 Charade Rd 98199 \n",
"102 2003-01-13 Executive 2004 Charade Rd 98199 \n",
"103 2000-01-03 IT 2014 Jabberwocky Rd 26192 \n",
"104 2001-05-21 IT 2014 Jabberwocky Rd 26192 \n",
"\n",
" city country \n",
"employee_id \n",
"100 Seattle United States of America \n",
"101 Seattle United States of America \n",
"102 Seattle United States of America \n",
"103 Southlake United States of America \n",
"104 Southlake United States of America "
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.head(5)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "f83593f7-7a4f-4501-b733-2aa8c30a4634",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>job_title</th>\n",
" <th>salary</th>\n",
" </tr>\n",
" <tr>\n",
" <th>employee_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>120</th>\n",
" <td>Matthew</td>\n",
" <td>Weiss</td>\n",
" <td>Stock Manager</td>\n",
" <td>8000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>121</th>\n",
" <td>Adam</td>\n",
" <td>Fripp</td>\n",
" <td>Stock Manager</td>\n",
" <td>8200</td>\n",
" </tr>\n",
" <tr>\n",
" <th>122</th>\n",
" <td>Payam</td>\n",
" <td>Kaufling</td>\n",
" <td>Stock Manager</td>\n",
" <td>7900</td>\n",
" </tr>\n",
" <tr>\n",
" <th>123</th>\n",
" <td>Shanta</td>\n",
" <td>Vollman</td>\n",
" <td>Stock Manager</td>\n",
" <td>6500</td>\n",
" </tr>\n",
" <tr>\n",
" <th>124</th>\n",
" <td>Kevin</td>\n",
" <td>Mourgos</td>\n",
" <td>Stock Manager</td>\n",
" <td>5800</td>\n",
" </tr>\n",
" <tr>\n",
" <th>125</th>\n",
" <td>Julia</td>\n",
" <td>Nayer</td>\n",
" <td>Stock Clerk</td>\n",
" <td>3200</td>\n",
" </tr>\n",
" <tr>\n",
" <th>126</th>\n",
" <td>Irene</td>\n",
" <td>Mikkilineni</td>\n",
" <td>Stock Clerk</td>\n",
" <td>2700</td>\n",
" </tr>\n",
" <tr>\n",
" <th>127</th>\n",
" <td>James</td>\n",
" <td>Landry</td>\n",
" <td>Stock Clerk</td>\n",
" <td>2400</td>\n",
" </tr>\n",
" <tr>\n",
" <th>128</th>\n",
" <td>Steven</td>\n",
" <td>Markle</td>\n",
" <td>Stock Clerk</td>\n",
" <td>2200</td>\n",
" </tr>\n",
" <tr>\n",
" <th>129</th>\n",
" <td>Laura</td>\n",
" <td>Bissot</td>\n",
" <td>Stock Clerk</td>\n",
" <td>3300</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" first_name last_name job_title salary\n",
"employee_id \n",
"120 Matthew Weiss Stock Manager 8000\n",
"121 Adam Fripp Stock Manager 8200\n",
"122 Payam Kaufling Stock Manager 7900\n",
"123 Shanta Vollman Stock Manager 6500\n",
"124 Kevin Mourgos Stock Manager 5800\n",
"125 Julia Nayer Stock Clerk 3200\n",
"126 Irene Mikkilineni Stock Clerk 2700\n",
"127 James Landry Stock Clerk 2400\n",
"128 Steven Markle Stock Clerk 2200\n",
"129 Laura Bissot Stock Clerk 3300"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.iloc[20:30, :4]"
]
},
{
"cell_type": "markdown",
"id": "44dcabe7-9e7e-4789-a55a-6ba6b4869dd4",
"metadata": {},
"source": [
"`.loc` to dostęp wg indeksu „biznesowego” i nazw kolumn"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "7506e308-23b4-453b-87dc-934040bb9516",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Kochhar'"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.loc[101, 'last_name']"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "79604102-9c56-402f-95b4-fbd6dc6d6a8c",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>job_title</th>\n",
" <th>salary</th>\n",
" </tr>\n",
" <tr>\n",
" <th>employee_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>102</th>\n",
" <td>Lex</td>\n",
" <td>De Haan</td>\n",
" <td>Administration Vice President</td>\n",
" <td>17000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>103</th>\n",
" <td>Alexander</td>\n",
" <td>Hunold</td>\n",
" <td>Programmer</td>\n",
" <td>9000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>104</th>\n",
" <td>Bruce</td>\n",
" <td>Ernst</td>\n",
" <td>Programmer</td>\n",
" <td>6000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>105</th>\n",
" <td>David</td>\n",
" <td>Austin</td>\n",
" <td>Programmer</td>\n",
" <td>4800</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" first_name last_name job_title salary\n",
"employee_id \n",
"102 Lex De Haan Administration Vice President 17000\n",
"103 Alexander Hunold Programmer 9000\n",
"104 Bruce Ernst Programmer 6000\n",
"105 David Austin Programmer 4800"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.loc[102:105, 'first_name':'salary']"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "20d4942e-0a49-42e2-8ea7-860b7cfb22cb",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>salary</th>\n",
" <th>city</th>\n",
" </tr>\n",
" <tr>\n",
" <th>employee_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>100</th>\n",
" <td>Steven</td>\n",
" <td>King</td>\n",
" <td>24001</td>\n",
" <td>Seattle</td>\n",
" </tr>\n",
" <tr>\n",
" <th>101</th>\n",
" <td>Neena</td>\n",
" <td>Kochhar</td>\n",
" <td>17000</td>\n",
" <td>Seattle</td>\n",
" </tr>\n",
" <tr>\n",
" <th>102</th>\n",
" <td>Lex</td>\n",
" <td>De Haan</td>\n",
" <td>17000</td>\n",
" <td>Seattle</td>\n",
" </tr>\n",
" <tr>\n",
" <th>103</th>\n",
" <td>Alexander</td>\n",
" <td>Hunold</td>\n",
" <td>9000</td>\n",
" <td>Southlake</td>\n",
" </tr>\n",
" <tr>\n",
" <th>104</th>\n",
" <td>Bruce</td>\n",
" <td>Ernst</td>\n",
" <td>6000</td>\n",
" <td>Southlake</td>\n",
" </tr>\n",
" <tr>\n",
" <th>105</th>\n",
" <td>David</td>\n",
" <td>Austin</td>\n",
" <td>4800</td>\n",
" <td>Southlake</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" first_name last_name salary city\n",
"employee_id \n",
"100 Steven King 24001 Seattle\n",
"101 Neena Kochhar 17000 Seattle\n",
"102 Lex De Haan 17000 Seattle\n",
"103 Alexander Hunold 9000 Southlake\n",
"104 Bruce Ernst 6000 Southlake\n",
"105 David Austin 4800 Southlake"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.loc[:105, ['first_name', 'last_name', 'salary', 'city']]"
]
},
{
"cell_type": "markdown",
"id": "73a9600a-9078-4bb0-a376-cffd7b402482",
"metadata": {},
"source": [
"Odczyt całej wybranej kolumny jest jeszcze prostszy:\n",
"- notacja obiektowa, dostępna tylko jeśli w nazwie kolumny nie ma spacji ani innych znaków specjalnych:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "30c45466-5e94-49bb-b330-4388189f5f14",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"employee_id\n",
"100 24001\n",
"101 17000\n",
"102 17000\n",
"103 9000\n",
"104 6000\n",
" ... \n",
"202 6000\n",
"203 6500\n",
"204 10000\n",
"205 12000\n",
"206 8300\n",
"Name: salary, Length: 107, dtype: int64"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.salary"
]
},
{
"cell_type": "markdown",
"id": "5a858350-06b3-4ad3-a726-7bcc9e2ded36",
"metadata": {},
"source": [
"- notacja słownikowa"
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "ac767f05-1b0f-43f6-b11b-3b3809636e7f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"employee_id\n",
"100 1997-06-17\n",
"101 1999-09-21\n",
"102 2003-01-13\n",
"103 2000-01-03\n",
"104 2001-05-21\n",
" ... \n",
"202 2007-08-17\n",
"203 2004-06-07\n",
"204 2004-06-07\n",
"205 2004-06-07\n",
"206 2004-06-07\n",
"Name: hire_date, Length: 107, dtype: datetime64[ns]"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps['hire_date']"
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "71eadaec-b8a8-4d44-918e-a3062549d319",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>last_name</th>\n",
" <th>hire_date</th>\n",
" </tr>\n",
" <tr>\n",
" <th>employee_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>100</th>\n",
" <td>King</td>\n",
" <td>1997-06-17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>101</th>\n",
" <td>Kochhar</td>\n",
" <td>1999-09-21</td>\n",
" </tr>\n",
" <tr>\n",
" <th>102</th>\n",
" <td>De Haan</td>\n",
" <td>2003-01-13</td>\n",
" </tr>\n",
" <tr>\n",
" <th>103</th>\n",
" <td>Hunold</td>\n",
" <td>2000-01-03</td>\n",
" </tr>\n",
" <tr>\n",
" <th>104</th>\n",
" <td>Ernst</td>\n",
" <td>2001-05-21</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>202</th>\n",
" <td>Fay</td>\n",
" <td>2007-08-17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>203</th>\n",
" <td>Mavris</td>\n",
" <td>2004-06-07</td>\n",
" </tr>\n",
" <tr>\n",
" <th>204</th>\n",
" <td>Baer</td>\n",
" <td>2004-06-07</td>\n",
" </tr>\n",
" <tr>\n",
" <th>205</th>\n",
" <td>Higgins</td>\n",
" <td>2004-06-07</td>\n",
" </tr>\n",
" <tr>\n",
" <th>206</th>\n",
" <td>Gietz</td>\n",
" <td>2004-06-07</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>107 rows × 2 columns</p>\n",
"</div>"
],
"text/plain": [
" last_name hire_date\n",
"employee_id \n",
"100 King 1997-06-17\n",
"101 Kochhar 1999-09-21\n",
"102 De Haan 2003-01-13\n",
"103 Hunold 2000-01-03\n",
"104 Ernst 2001-05-21\n",
"... ... ...\n",
"202 Fay 2007-08-17\n",
"203 Mavris 2004-06-07\n",
"204 Baer 2004-06-07\n",
"205 Higgins 2004-06-07\n",
"206 Gietz 2004-06-07\n",
"\n",
"[107 rows x 2 columns]"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps[['last_name', 'hire_date']]"
]
},
{
"cell_type": "markdown",
"id": "50c95e65-80ab-476e-aa37-0cff3fc6f9b0",
"metadata": {},
"source": [
"Iteracja po wszystkich wierszach\n",
"(w praktyce rzadko stosowane, jeśli już, to w programie `.py`, a nie w Jupyter Notebook)."
]
},
{
"cell_type": "code",
"execution_count": 41,
"id": "0e7725bb-bfac-404f-8e16-25033562a435",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Osoba Steven King zarabia 24001\n",
"Osoba Neena Kochhar zarabia 17000\n",
"Osoba Lex De Haan zarabia 17000\n",
"Osoba Alexander Hunold zarabia 9000\n",
"Osoba Bruce Ernst zarabia 6000\n",
"Osoba David Austin zarabia 4800\n",
"Osoba Valli Pataballa zarabia 4800\n",
"Osoba Diana Lorentz zarabia 4200\n",
"Osoba Nancy Greenberg zarabia 12000\n",
"Osoba Daniel Faviet zarabia 9000\n",
"Osoba John Chen zarabia 8200\n",
"Osoba Ismael Sciarra zarabia 7700\n",
"Osoba Jose Manuel Urman zarabia 7800\n",
"Osoba Luis Popp zarabia 6900\n",
"Osoba Den Raphaely zarabia 11000\n",
"Osoba Alexander Khoo zarabia 3100\n",
"Osoba Shelli Baida zarabia 2900\n",
"Osoba Sigal Tobias zarabia 2800\n",
"Osoba Guy Himuro zarabia 2600\n",
"Osoba Karen Colmenares zarabia 2500\n",
"Osoba Matthew Weiss zarabia 8000\n",
"Osoba Adam Fripp zarabia 8200\n",
"Osoba Payam Kaufling zarabia 7900\n",
"Osoba Shanta Vollman zarabia 6500\n",
"Osoba Kevin Mourgos zarabia 5800\n",
"Osoba Julia Nayer zarabia 3200\n",
"Osoba Irene Mikkilineni zarabia 2700\n",
"Osoba James Landry zarabia 2400\n",
"Osoba Steven Markle zarabia 2200\n",
"Osoba Laura Bissot zarabia 3300\n",
"Osoba Mozhe Atkinson zarabia 2800\n",
"Osoba James Marlow zarabia 2500\n",
"Osoba TJ Olson zarabia 2100\n",
"Osoba Jason Mallin zarabia 3300\n",
"Osoba Michael Rogers zarabia 2900\n",
"Osoba Ki Gee zarabia 2400\n",
"Osoba Hazel Philtanker zarabia 2200\n",
"Osoba Renske Ladwig zarabia 3600\n",
"Osoba Stephen Stiles zarabia 3200\n",
"Osoba John Seo zarabia 2700\n",
"Osoba Joshua Patel zarabia 2500\n",
"Osoba Trenna Rajs zarabia 3500\n",
"Osoba Curtis Davies zarabia 3100\n",
"Osoba Randall Matos zarabia 2600\n",
"Osoba Peter Vargas zarabia 2500\n",
"Osoba John Russell zarabia 14000\n",
"Osoba Karen Partners zarabia 13500\n",
"Osoba Alberto Errazuriz zarabia 12000\n",
"Osoba Gerald Cambrault zarabia 11000\n",
"Osoba Eleni Zlotkey zarabia 10500\n",
"Osoba Peter Tucker zarabia 10000\n",
"Osoba David Bernstein zarabia 9500\n",
"Osoba Peter Hall zarabia 9000\n",
"Osoba Christopher Olsen zarabia 8000\n",
"Osoba Nanette Cambrault zarabia 7500\n",
"Osoba Oliver Tuvault zarabia 7000\n",
"Osoba Janette King zarabia 10000\n",
"Osoba Patrick Sully zarabia 9500\n",
"Osoba Allan McEwen zarabia 9000\n",
"Osoba Lindsey Smith zarabia 8000\n",
"Osoba Louise Doran zarabia 7500\n",
"Osoba Sarath Sewall zarabia 7000\n",
"Osoba Clara Vishney zarabia 10500\n",
"Osoba Danielle Greene zarabia 9500\n",
"Osoba Mattea Marvins zarabia 7200\n",
"Osoba David Lee zarabia 6800\n",
"Osoba Sundar Ande zarabia 6400\n",
"Osoba Amit Banda zarabia 6200\n",
"Osoba Lisa Ozer zarabia 11500\n",
"Osoba Harrison Bloom zarabia 10000\n",
"Osoba Tayler Fox zarabia 9600\n",
"Osoba William Smith zarabia 7400\n",
"Osoba Elizabeth Bates zarabia 7300\n",
"Osoba Sundita Kumar zarabia 6100\n",
"Osoba Ellen Abel zarabia 11000\n",
"Osoba Alyssa Hutton zarabia 8800\n",
"Osoba Jonathon Taylor zarabia 8600\n",
"Osoba Jack Livingston zarabia 8400\n",
"Osoba Kimberely Grant zarabia 7000\n",
"Osoba Charles Johnson zarabia 6200\n",
"Osoba Winston Taylor zarabia 3200\n",
"Osoba Jean Fleaur zarabia 3100\n",
"Osoba Martha Sullivan zarabia 2500\n",
"Osoba Girard Geoni zarabia 2800\n",
"Osoba Nandita Sarchand zarabia 4200\n",
"Osoba Alexis Bull zarabia 4100\n",
"Osoba Julia Dellinger zarabia 3400\n",
"Osoba Anthony Cabrio zarabia 3000\n",
"Osoba Kelly Chung zarabia 3800\n",
"Osoba Jennifer Dilly zarabia 3600\n",
"Osoba Timothy Gates zarabia 2900\n",
"Osoba Randall Perkins zarabia 2500\n",
"Osoba Sarah Bell zarabia 4000\n",
"Osoba Britney Everett zarabia 3900\n",
"Osoba Samuel McCain zarabia 3200\n",
"Osoba Vance Jones zarabia 2800\n",
"Osoba Alana Walsh zarabia 3100\n",
"Osoba Kevin Feeney zarabia 3000\n",
"Osoba Donald OConnell zarabia 2600\n",
"Osoba Douglas Grant zarabia 2600\n",
"Osoba Jennifer Whalen zarabia 4400\n",
"Osoba Michael Hartstein zarabia 13000\n",
"Osoba Pat Fay zarabia 6000\n",
"Osoba Susan Mavris zarabia 6500\n",
"Osoba Hermann Baer zarabia 10000\n",
"Osoba Shelley Higgins zarabia 12000\n",
"Osoba William Gietz zarabia 8300\n"
]
}
],
"source": [
"for idx, row in emps.iterrows():\n",
" print(f'Osoba {row.first_name} {row.last_name} zarabia {row[\"salary\"]}')"
]
},
{
"cell_type": "markdown",
"id": "c35cd855-26e2-4b5a-b614-5aeab4ac6422",
"metadata": {},
"source": [
"## Filtrowanie danych\n",
"\n",
"zwn warunek logiczny"
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "b499577f-04a8-4779-a699-27ad6bd65a89",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>job_title</th>\n",
" <th>salary</th>\n",
" <th>hire_date</th>\n",
" <th>department_name</th>\n",
" <th>address</th>\n",
" <th>postal_code</th>\n",
" <th>city</th>\n",
" <th>country</th>\n",
" </tr>\n",
" <tr>\n",
" <th>employee_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>100</th>\n",
" <td>Steven</td>\n",
" <td>King</td>\n",
" <td>President</td>\n",
" <td>24001</td>\n",
" <td>1997-06-17</td>\n",
" <td>Executive</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>101</th>\n",
" <td>Neena</td>\n",
" <td>Kochhar</td>\n",
" <td>Administration Vice President</td>\n",
" <td>17000</td>\n",
" <td>1999-09-21</td>\n",
" <td>Executive</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>102</th>\n",
" <td>Lex</td>\n",
" <td>De Haan</td>\n",
" <td>Administration Vice President</td>\n",
" <td>17000</td>\n",
" <td>2003-01-13</td>\n",
" <td>Executive</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" first_name last_name job_title salary \\\n",
"employee_id \n",
"100 Steven King President 24001 \n",
"101 Neena Kochhar Administration Vice President 17000 \n",
"102 Lex De Haan Administration Vice President 17000 \n",
"\n",
" hire_date department_name address postal_code city \\\n",
"employee_id \n",
"100 1997-06-17 Executive 2004 Charade Rd 98199 Seattle \n",
"101 1999-09-21 Executive 2004 Charade Rd 98199 Seattle \n",
"102 2003-01-13 Executive 2004 Charade Rd 98199 Seattle \n",
"\n",
" country \n",
"employee_id \n",
"100 United States of America \n",
"101 United States of America \n",
"102 United States of America "
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps[emps.salary >= 15000]"
]
},
{
"cell_type": "markdown",
"id": "12666367-22b8-4cdd-8cf1-3d963ee418a1",
"metadata": {},
"source": [
"Technicznie operacja `emps.salary >= 15000` daje w wyniku serię wartości True/False"
]
},
{
"cell_type": "code",
"execution_count": 45,
"id": "b884e0ff-a189-4cec-ba17-720bb43e654d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"employee_id\n",
"100 True\n",
"101 True\n",
"102 True\n",
"103 False\n",
"104 False\n",
" ... \n",
"202 False\n",
"203 False\n",
"204 False\n",
"205 False\n",
"206 False\n",
"Name: salary, Length: 107, dtype: bool"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.salary >= 15000"
]
},
{
"cell_type": "code",
"execution_count": 46,
"id": "ffe6b0e2-6295-451a-a4fb-35652d3e842a",
"metadata": {},
"outputs": [],
"source": [
"warunki = emps.salary >= 15000"
]
},
{
"cell_type": "markdown",
"id": "2867f3a0-d8a5-4f4d-966b-f10b48a918c7",
"metadata": {},
"source": [
"Gdy do nawiasów kwadratowych przekażemy taką serię, to w wyniku dostajemy te rekordy, dla których na odp pozycji było True."
]
},
{
"cell_type": "code",
"execution_count": 47,
"id": "d8249bbf-e26b-457d-bb55-d3eb9153859a",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>job_title</th>\n",
" <th>salary</th>\n",
" <th>hire_date</th>\n",
" <th>department_name</th>\n",
" <th>address</th>\n",
" <th>postal_code</th>\n",
" <th>city</th>\n",
" <th>country</th>\n",
" </tr>\n",
" <tr>\n",
" <th>employee_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>100</th>\n",
" <td>Steven</td>\n",
" <td>King</td>\n",
" <td>President</td>\n",
" <td>24001</td>\n",
" <td>1997-06-17</td>\n",
" <td>Executive</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>101</th>\n",
" <td>Neena</td>\n",
" <td>Kochhar</td>\n",
" <td>Administration Vice President</td>\n",
" <td>17000</td>\n",
" <td>1999-09-21</td>\n",
" <td>Executive</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>102</th>\n",
" <td>Lex</td>\n",
" <td>De Haan</td>\n",
" <td>Administration Vice President</td>\n",
" <td>17000</td>\n",
" <td>2003-01-13</td>\n",
" <td>Executive</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" first_name last_name job_title salary \\\n",
"employee_id \n",
"100 Steven King President 24001 \n",
"101 Neena Kochhar Administration Vice President 17000 \n",
"102 Lex De Haan Administration Vice President 17000 \n",
"\n",
" hire_date department_name address postal_code city \\\n",
"employee_id \n",
"100 1997-06-17 Executive 2004 Charade Rd 98199 Seattle \n",
"101 1999-09-21 Executive 2004 Charade Rd 98199 Seattle \n",
"102 2003-01-13 Executive 2004 Charade Rd 98199 Seattle \n",
"\n",
" country \n",
"employee_id \n",
"100 United States of America \n",
"101 United States of America \n",
"102 United States of America "
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps[warunki]"
]
},
{
"cell_type": "markdown",
"id": "6a1bd071-6a02-49ba-a88a-87523b4f0872",
"metadata": {},
"source": [
"### Złożone warunki logiczne\n",
"\n",
"Tylko za pomocą operatorów ``&`` i ``|``"
]
},
{
"cell_type": "code",
"execution_count": 48,
"id": "c8fc1a87-2aab-40d6-8a52-18386b0ae922",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>job_title</th>\n",
" <th>salary</th>\n",
" <th>hire_date</th>\n",
" <th>department_name</th>\n",
" <th>address</th>\n",
" <th>postal_code</th>\n",
" <th>city</th>\n",
" <th>country</th>\n",
" </tr>\n",
" <tr>\n",
" <th>employee_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>103</th>\n",
" <td>Alexander</td>\n",
" <td>Hunold</td>\n",
" <td>Programmer</td>\n",
" <td>9000</td>\n",
" <td>2000-01-03</td>\n",
" <td>IT</td>\n",
" <td>2014 Jabberwocky Rd</td>\n",
" <td>26192</td>\n",
" <td>Southlake</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>104</th>\n",
" <td>Bruce</td>\n",
" <td>Ernst</td>\n",
" <td>Programmer</td>\n",
" <td>6000</td>\n",
" <td>2001-05-21</td>\n",
" <td>IT</td>\n",
" <td>2014 Jabberwocky Rd</td>\n",
" <td>26192</td>\n",
" <td>Southlake</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" first_name last_name job_title salary hire_date \\\n",
"employee_id \n",
"103 Alexander Hunold Programmer 9000 2000-01-03 \n",
"104 Bruce Ernst Programmer 6000 2001-05-21 \n",
"\n",
" department_name address postal_code city \\\n",
"employee_id \n",
"103 IT 2014 Jabberwocky Rd 26192 Southlake \n",
"104 IT 2014 Jabberwocky Rd 26192 Southlake \n",
"\n",
" country \n",
"employee_id \n",
"103 United States of America \n",
"104 United States of America "
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps[(emps.job_title == 'Programmer') & (emps.salary >= 5000)]"
]
},
{
"cell_type": "markdown",
"id": "05ac332e-00ab-48e6-bb60-c0c188f0ffa5",
"metadata": {},
"source": [
"## Funkcje argegujące\n",
"\n",
"Statystyki itp.\n",
"\n",
"Najłatwiej wywołać funkcję na pojedynczej kolumnie:"
]
},
{
"cell_type": "code",
"execution_count": 57,
"id": "67d95ad3-04a1-41f4-ac16-124e8c0c02e2",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"6461.691588785046"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.salary.mean()"
]
},
{
"cell_type": "code",
"execution_count": 53,
"id": "e31c1359-33c7-4724-9547-fd17460cb51f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(2100, 24001, 6461.691588785046, 6200.0, 691401, 107, 3909.408070359112)"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.salary.min(), emps.salary.max(), emps.salary.mean(), emps.salary.median(), emps.salary.sum(), emps.salary.count(), emps.salary.std()"
]
},
{
"cell_type": "markdown",
"id": "614c91fb-35a8-4859-a276-d23b1344bcaf",
"metadata": {},
"source": [
"Łącząc technikę filtrowania z wyliczaniem statystyk, można:"
]
},
{
"cell_type": "code",
"execution_count": 54,
"id": "06eab5e5-3393-4900-beea-0829baf098a8",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"5760.0"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps[emps.job_title == 'Programmer'].salary.mean()"
]
},
{
"cell_type": "markdown",
"id": "c33b801b-a4f2-4666-a42a-7a6f4167d7b5",
"metadata": {},
"source": [
"Ciekawostka - można też tak:"
]
},
{
"cell_type": "code",
"execution_count": 55,
"id": "83bda502-f8e7-49da-97fc-6dc0a74ef68b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"5760.0"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.salary[emps.job_title == 'Programmer'].mean()"
]
},
{
"cell_type": "markdown",
"id": "5b6c0a5f-1c90-4af3-9c36-b367dff5a60a",
"metadata": {},
"source": [
"### Funkcje liczące kilka rzeczy jednocześnie"
]
},
{
"cell_type": "code",
"execution_count": 58,
"id": "c02553b9-a581-4552-9e46-1218d7cb8fd6",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 107.000000\n",
"mean 6461.691589\n",
"std 3909.408070\n",
"min 2100.000000\n",
"25% 3100.000000\n",
"50% 6200.000000\n",
"75% 8900.000000\n",
"max 24001.000000\n",
"Name: salary, dtype: float64"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.salary.describe()"
]
},
{
"cell_type": "code",
"execution_count": 59,
"id": "e302a438-f917-4131-b388-999b982d5583",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 107.000000\n",
"mean 6461.691589\n",
"std 3909.408070\n",
"min 2100.000000\n",
"10% 2560.000000\n",
"20% 2900.000000\n",
"50% 6200.000000\n",
"max 24001.000000\n",
"Name: salary, dtype: float64"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.salary.describe(percentiles=[0.1, 0.2, 0.5])"
]
},
{
"cell_type": "code",
"execution_count": 60,
"id": "9a38a54d-e31e-4907-a9c4-f1bd7bb971b6",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 107.000000\n",
"mean 6461.691589\n",
"std 3909.408070\n",
"min 2100.000000\n",
"0% 2100.000000\n",
"10% 2560.000000\n",
"20% 2900.000000\n",
"30% 3200.000000\n",
"40% 4040.000000\n",
"50% 6200.000000\n",
"60% 7260.000000\n",
"70% 8200.000000\n",
"80% 9500.000000\n",
"90% 11000.000000\n",
"max 24001.000000\n",
"Name: salary, dtype: float64"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.salary.describe(percentiles=np.arange(0, 1, 0.1))"
]
},
{
"cell_type": "code",
"execution_count": 61,
"id": "c34e86d4-541d-416e-94c7-0483be5bd9e9",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 106\n",
"unique 7\n",
"top South San Francisco\n",
"freq 45\n",
"Name: city, dtype: object"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.city.describe()"
]
},
{
"cell_type": "code",
"execution_count": 62,
"id": "8dd0410f-9b35-48a1-91cc-333916f79ef8",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"city\n",
"South San Francisco 45\n",
"Oxford 34\n",
"Seattle 18\n",
"Southlake 5\n",
"Toronto 2\n",
"London 1\n",
"Munich 1\n",
"Name: count, dtype: int64"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.city.value_counts()"
]
},
{
"cell_type": "markdown",
"id": "5cc30484-9027-48e7-b4ad-223ad27090f9",
"metadata": {},
"source": [
"Operacja `agg` pozwala obliczyć kilka funkcji agregujących dla tego samego zestawu danych.\n",
"\n",
"Szczególnie użyteczna w połączeniu z grupowaniem, o którym za chwilę..."
]
},
{
"cell_type": "code",
"execution_count": 63,
"id": "6447e85e-2078-4667-b536-b49fce6f7e4b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"min 2100.000000\n",
"mean 6461.691589\n",
"max 24001.000000\n",
"Name: salary, dtype: float64"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.salary.agg(['min', 'mean', 'max'])"
]
},
{
"cell_type": "code",
"execution_count": 64,
"id": "b49cd299-4d7d-4dae-90e2-13e4a11afc99",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 5.0\n",
"min 4200.0\n",
"mean 5760.0\n",
"median 4800.0\n",
"max 9000.0\n",
"Name: salary, dtype: float64"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps[emps.job_title == 'Programmer'].salary.agg(['count', 'min', 'mean', 'median', 'max'])"
]
},
{
"cell_type": "markdown",
"id": "28a14de8-ddb8-4f16-a2be-cfc300bdf2f2",
"metadata": {},
"source": [
"## Praca z przykładem sprzedaż"
]
},
{
"cell_type": "markdown",
"id": "65096c60-0917-4939-8aa0-95de69a4eafd",
"metadata": {},
"source": [
"We wczytanej tabeli mamy kolumny `cena` oraz `sztuk`, a dopiero ich iloczyn zawiera info o wartości transakcji.\n",
"\n",
"Do tabeli dodamy nową kolumnę `wartosc` , która będzie zawierać iloczyn."
]
},
{
"cell_type": "code",
"execution_count": 66,
"id": "a335695f-9e7f-4977-94be-484404276027",
"metadata": {},
"outputs": [],
"source": [
"sprzedaz['wartosc'] = sprzedaz.cena * sprzedaz.sztuk"
]
},
{
"cell_type": "code",
"execution_count": 67,
"id": "cfa4857f-99bb-4e89-a7e4-6c916efdcb11",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>data</th>\n",
" <th>miasto</th>\n",
" <th>sklep</th>\n",
" <th>kategoria</th>\n",
" <th>towar</th>\n",
" <th>cena</th>\n",
" <th>sztuk</th>\n",
" <th>wartosc</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2014-11-23</td>\n",
" <td>Łódź</td>\n",
" <td>Wdowiak</td>\n",
" <td>meble</td>\n",
" <td>biurko</td>\n",
" <td>149.99</td>\n",
" <td>4</td>\n",
" <td>599.96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2017-05-07</td>\n",
" <td>Radom</td>\n",
" <td>Czarnecki</td>\n",
" <td>wyposażenie szkolne</td>\n",
" <td>tablica</td>\n",
" <td>590.00</td>\n",
" <td>2</td>\n",
" <td>1180.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2017-05-05</td>\n",
" <td>Kraków</td>\n",
" <td>Kozłowski</td>\n",
" <td>szkolno-biurowe</td>\n",
" <td>flamaster</td>\n",
" <td>0.99</td>\n",
" <td>51</td>\n",
" <td>50.49</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2016-10-19</td>\n",
" <td>Kraków</td>\n",
" <td>Wróbel</td>\n",
" <td>wyposażenie szkolne</td>\n",
" <td>gąbka</td>\n",
" <td>4.00</td>\n",
" <td>250</td>\n",
" <td>1000.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2016-04-08</td>\n",
" <td>Poznań</td>\n",
" <td>Borowik</td>\n",
" <td>meble</td>\n",
" <td>biurko</td>\n",
" <td>149.99</td>\n",
" <td>9</td>\n",
" <td>1349.91</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9995</th>\n",
" <td>2016-05-22</td>\n",
" <td>Katowice</td>\n",
" <td>Gaińska</td>\n",
" <td>szkolno-biurowe</td>\n",
" <td>dziurkacz</td>\n",
" <td>7.50</td>\n",
" <td>178</td>\n",
" <td>1335.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9996</th>\n",
" <td>2016-11-19</td>\n",
" <td>Kraków</td>\n",
" <td>Kozłowski</td>\n",
" <td>meble</td>\n",
" <td>biurko</td>\n",
" <td>149.99</td>\n",
" <td>7</td>\n",
" <td>1049.93</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9997</th>\n",
" <td>2016-09-30</td>\n",
" <td>Łódź</td>\n",
" <td>Wdowiak</td>\n",
" <td>szkolno-biurowe</td>\n",
" <td>długopis</td>\n",
" <td>1.49</td>\n",
" <td>87</td>\n",
" <td>129.63</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9998</th>\n",
" <td>2015-05-01</td>\n",
" <td>Kraków</td>\n",
" <td>Kozłowski</td>\n",
" <td>meble</td>\n",
" <td>biurko</td>\n",
" <td>149.99</td>\n",
" <td>10</td>\n",
" <td>1499.90</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9999</th>\n",
" <td>2016-08-26</td>\n",
" <td>Kraków</td>\n",
" <td>Kozłowski</td>\n",
" <td>wyposażenie szkolne</td>\n",
" <td>gąbka</td>\n",
" <td>4.00</td>\n",
" <td>152</td>\n",
" <td>608.00</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>10000 rows × 8 columns</p>\n",
"</div>"
],
"text/plain": [
" data miasto sklep kategoria towar cena \\\n",
"0 2014-11-23 Łódź Wdowiak meble biurko 149.99 \n",
"1 2017-05-07 Radom Czarnecki wyposażenie szkolne tablica 590.00 \n",
"2 2017-05-05 Kraków Kozłowski szkolno-biurowe flamaster 0.99 \n",
"3 2016-10-19 Kraków Wróbel wyposażenie szkolne gąbka 4.00 \n",
"4 2016-04-08 Poznań Borowik meble biurko 149.99 \n",
"... ... ... ... ... ... ... \n",
"9995 2016-05-22 Katowice Gaińska szkolno-biurowe dziurkacz 7.50 \n",
"9996 2016-11-19 Kraków Kozłowski meble biurko 149.99 \n",
"9997 2016-09-30 Łódź Wdowiak szkolno-biurowe długopis 1.49 \n",
"9998 2015-05-01 Kraków Kozłowski meble biurko 149.99 \n",
"9999 2016-08-26 Kraków Kozłowski wyposażenie szkolne gąbka 4.00 \n",
"\n",
" sztuk wartosc \n",
"0 4 599.96 \n",
"1 2 1180.00 \n",
"2 51 50.49 \n",
"3 250 1000.00 \n",
"4 9 1349.91 \n",
"... ... ... \n",
"9995 178 1335.00 \n",
"9996 7 1049.93 \n",
"9997 87 129.63 \n",
"9998 10 1499.90 \n",
"9999 152 608.00 \n",
"\n",
"[10000 rows x 8 columns]"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sprzedaz"
]
},
{
"cell_type": "markdown",
"id": "baf91a7e-573e-4335-91b2-041cd147df3f",
"metadata": {},
"source": [
"### Zadania:\n",
"1. Oblicz sumę wartości transakcji w całym pliku"
]
},
{
"cell_type": "code",
"execution_count": 69,
"id": "a8dc418a-d810-4cdc-8d01-4ba730318706",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"8049567.3"
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sprzedaz.wartosc.sum()"
]
},
{
"cell_type": "markdown",
"id": "505a1ff6-1ce6-4d6a-b818-8ae3b31b00c1",
"metadata": {},
"source": [
"2. Oblicz sumę wartości transakcji w Katowicach"
]
},
{
"cell_type": "code",
"execution_count": 70,
"id": "b284f678-9e34-4cfe-a215-92527a366ed9",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1456316.08"
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sprzedaz[sprzedaz.miasto == 'Katowice'].wartosc.sum()"
]
},
{
"cell_type": "markdown",
"id": "dd64c472-980a-488d-b529-1103b78f92c4",
"metadata": {},
"source": [
"3. Oblicz liczbę transakcji, sumę wartości (i jeśli dasz radę sumaryczną liczbę sztuk) dotyczących towaru biurko w Katowicach"
]
},
{
"cell_type": "code",
"execution_count": 72,
"id": "bf3ce0ea-4885-4e87-aaac-abc30aaf2b29",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 248.0\n",
"sum 391473.9\n",
"Name: wartosc, dtype: float64"
]
},
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sprzedaz[(sprzedaz.towar == 'biurko') & (sprzedaz.miasto == 'Katowice')].wartosc.agg(['count', 'sum'])"
]
},
{
"cell_type": "markdown",
"id": "753d3b1d-64b7-4d46-8498-dffe519e2819",
"metadata": {},
"source": [
"Operację `agg` można też zastosować dla `DataFrame` i przekazać **słownik**, który mówi, jakiew funkcje mają być liczone dla jakich kolumn."
]
},
{
"cell_type": "code",
"execution_count": 75,
"id": "a77019c4-8e4d-4065-b99b-3495e5da0ca5",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sztuk</th>\n",
" <th>wartosc</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>sum</th>\n",
" <td>2610.0</td>\n",
" <td>391473.9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>NaN</td>\n",
" <td>248.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" sztuk wartosc\n",
"sum 2610.0 391473.9\n",
"count NaN 248.0"
]
},
"execution_count": 75,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sprzedaz[(sprzedaz.towar == 'biurko') & (sprzedaz.miasto == 'Katowice')].agg({'sztuk': ['sum'], 'wartosc': ['count', 'sum']})"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "147e601a-49b2-4c20-96d3-e65b98acbca7",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "d8a0074a",
"metadata": {},
"outputs": [],
"source": [
"import pandas"
]
},
{
"cell_type": "markdown",
"id": "a7661422",
"metadata": {},
"source": [
"Dwie najważniejsze struktury danych: `Series` i `DataFrame`.\n",
"\n",
"`Series` - seria danych tego samego typu, indeksowana za pomocą liczb od 0 (jak lista), ale opcjonalnie także za pomocą innego indeksu. Seria odpowiada pojedynczej kolumnie w Excelu lub bazie danych.\n",
"\n",
"W praktyce serii rzadko używa się samodzielnie. Najczęściej powstają one w wyniku odczytania kolumny z DataFrame lub w wyniku obliczeń."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "b53feca6",
"metadata": {},
"outputs": [],
"source": [
"imiona = pandas.Series(['Ala', 'Ola', 'Jan', 'Andrzej'])"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "00a0eba4",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 Ala\n",
"1 Ola\n",
"2 Jan\n",
"3 Andrzej\n",
"dtype: object"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"imiona"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "5e76f8e2",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 20\n",
"1 30\n",
"2 40\n",
"3 50\n",
"dtype: int64"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"liczby = pandas.Series(range(20, 60, 10))\n",
"liczby"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "17d4a896",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"35.0"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"liczby.mean()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "48a880e9",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 1\n",
"1 3\n",
"2 5\n",
"3 7\n",
"4 9\n",
" ... \n",
"4995 9991\n",
"4996 9993\n",
"4997 9995\n",
"4998 9997\n",
"4999 9999\n",
"Length: 5000, dtype: int64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"duzo = pandas.Series(range(1, 10001, 2))\n",
"duzo"
]
},
{
"cell_type": "markdown",
"id": "5c1e7868",
"metadata": {},
"source": [
"Domyślnym indeksem dla serii są liczby całkowite od 0 (jak w Pythonowych listach).\n",
"Ale można też określić własny indeks - wtedy serii można będzie używać też jak słownika, który pod określonymi kluczami ma zapisane jakieś wartości."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "d55a6195",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"PL Polska\n",
"CZ Czechy\n",
"FR Francja\n",
"DE Niemcy\n",
"UA Ukraina\n",
"RU Rosja\n",
"dtype: object"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"kraje = pandas.Series(['Polska', 'Czechy', 'Francja', 'Niemcy', 'Ukraina', 'Rosja'],\n",
" index=['PL', 'CZ', 'FR', 'DE', 'UA', 'RU'])\n",
"kraje"
]
},
{
"cell_type": "markdown",
"id": "1db17181",
"metadata": {},
"source": [
"Dostęp \"słownikowy\":"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "4c4ca0e2",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Niemcy'"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"kraje['DE']"
]
},
{
"cell_type": "markdown",
"id": "2f6e2eab",
"metadata": {},
"source": [
"Nadal możliwy jest dostęp po numerycznym indeksie, jak w zwykłych listach:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "9c25aef5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Polska'"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"kraje[0]"
]
},
{
"cell_type": "markdown",
"id": "cad1ab9f",
"metadata": {},
"source": [
"A co gdyby `index`em były liczby całkowite? To ważniejszy będzie `index`, a już nie działa odczyt wg numeru pozycji."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "618f280c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1 poniedziałek\n",
"2 wtorek\n",
"3 środa\n",
"4 czwartek\n",
"5 piątek\n",
"6 sobota\n",
"7 niedziela\n",
"dtype: object"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dni_tygodnia = pandas.Series(['poniedziałek', 'wtorek', 'środa', 'czwartek', 'piątek', 'sobota', 'niedziela'],\n",
" index=[1, 2, 3, 4, 5, 6, 7])\n",
"dni_tygodnia"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "c7707bc0",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'poniedziałek'"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dni_tygodnia[1]"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "3e5aad15",
"metadata": {},
"outputs": [],
"source": [
"# dni_tygodnia[0]\n",
"# KeyError"
]
},
{
"cell_type": "markdown",
"id": "6133a321",
"metadata": {},
"source": [
"`DataFrame` to jest tabela z danymi. Myślimy o niej jak o arkuszu Excela lub tabeli w bazie danych. `DataFrame` składa się z wielu kolumn-serii , które sa powiązane wspólnym indeksem.\n",
"\n",
"DataFrame można tworzyć na różne sposoby.\n",
"\n",
"Wierszowo - podajemy **listę** wierszy. Każdy wiersz jako sekwencję elementów: krotka lub lista."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "e2ac4092",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Ala</td>\n",
" <td>20</td>\n",
" <td>k</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Ola</td>\n",
" <td>30</td>\n",
" <td>k</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Jan</td>\n",
" <td>40</td>\n",
" <td>m</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Andrzej</td>\n",
" <td>50</td>\n",
" <td>m</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 Ala 20 k\n",
"1 Ola 30 k\n",
"2 Jan 40 m\n",
"3 Andrzej 50 m"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1 = pandas.DataFrame([('Ala', 20, 'k'),\n",
" ('Ola', 30, 'k'),\n",
" ('Jan', 40, 'm'),\n",
" ('Andrzej', 50, 'm')])\n",
"df1"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "cfe252c6",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 20\n",
"1 30\n",
"2 40\n",
"3 50\n",
"Name: 1, dtype: int64"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1[1]"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "43dab2a0",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 Ola\n",
"1 30\n",
"2 k\n",
"Name: 1, dtype: object"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1.loc[1]"
]
},
{
"cell_type": "markdown",
"id": "dc4c030e",
"metadata": {},
"source": [
"Dodatkowo można podać własny indeks i nazwy kolumn."
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "2f1651c1",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>imię</th>\n",
" <th>wiek</th>\n",
" <th>płeć</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>101</th>\n",
" <td>Ala</td>\n",
" <td>20</td>\n",
" <td>k</td>\n",
" </tr>\n",
" <tr>\n",
" <th>102</th>\n",
" <td>Ola</td>\n",
" <td>30</td>\n",
" <td>k</td>\n",
" </tr>\n",
" <tr>\n",
" <th>103</th>\n",
" <td>Jan</td>\n",
" <td>40</td>\n",
" <td>m</td>\n",
" </tr>\n",
" <tr>\n",
" <th>104</th>\n",
" <td>Andrzej</td>\n",
" <td>50</td>\n",
" <td>m</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" imię wiek płeć\n",
"101 Ala 20 k\n",
"102 Ola 30 k\n",
"103 Jan 40 m\n",
"104 Andrzej 50 m"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = pandas.DataFrame([['Ala', 20, 'k'],\n",
" ['Ola', 30, 'k'],\n",
" ['Jan', 40, 'm'],\n",
" ['Andrzej', 50, 'm']],\n",
" columns=['imię', 'wiek', 'płeć'],\n",
" index=[101, 102, 103, 104])\n",
"df2"
]
},
{
"cell_type": "markdown",
"id": "210e4c21",
"metadata": {},
"source": [
"Podawanie danych kolumnowo. Przekazujemy **słownik**, gdzie kluczem jest nazwa kolumny, a wartością seria danych w tej kolumnie."
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "b014269e",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>imie</th>\n",
" <th>wiek</th>\n",
" <th>plec</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>101</th>\n",
" <td>Ala</td>\n",
" <td>20</td>\n",
" <td>k</td>\n",
" </tr>\n",
" <tr>\n",
" <th>102</th>\n",
" <td>Ola</td>\n",
" <td>30</td>\n",
" <td>k</td>\n",
" </tr>\n",
" <tr>\n",
" <th>103</th>\n",
" <td>Jan</td>\n",
" <td>40</td>\n",
" <td>m</td>\n",
" </tr>\n",
" <tr>\n",
" <th>104</th>\n",
" <td>Andrzej</td>\n",
" <td>50</td>\n",
" <td>m</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" imie wiek plec\n",
"101 Ala 20 k\n",
"102 Ola 30 k\n",
"103 Jan 40 m\n",
"104 Andrzej 50 m"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3 = pandas.DataFrame({\n",
" 'imie': ['Ala', 'Ola', 'Jan', 'Andrzej'],\n",
" 'wiek': [20, 30, 40, 50],\n",
" 'plec': ['k', 'k', 'm', 'm'],\n",
"}, index=[101, 102, 103, 104])\n",
"df3"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "1d698936",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>imie</th>\n",
" <th>wiek</th>\n",
" <th>plec</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Ala</td>\n",
" <td>20</td>\n",
" <td>k</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Ola</td>\n",
" <td>30</td>\n",
" <td>k</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Jan</td>\n",
" <td>40</td>\n",
" <td>m</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Andrzej</td>\n",
" <td>50</td>\n",
" <td>m</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" imie wiek plec\n",
"0 Ala 20 k\n",
"1 Ola 30 k\n",
"2 Jan 40 m\n",
"3 Andrzej 50 m"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lista = ['k', 'k', 'm', 'm']\n",
"df4 = pandas.DataFrame({\n",
" 'imie': imiona,\n",
" 'wiek': range(20, 60, 10),\n",
" 'plec': lista})\n",
"df4"
]
},
{
"cell_type": "markdown",
"id": "6c01a478",
"metadata": {},
"source": [
"Kolumna z `DataFrame` jest typu `Series`."
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "648b89a0",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 20\n",
"1 30\n",
"2 40\n",
"3 50\n",
"Name: wiek, dtype: int64"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df4.wiek"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "913e271e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.series.Series"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(df4.wiek)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fec0926d",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "31b629c0",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "91de1be9",
"metadata": {},
"outputs": [],
"source": [
"emps = pd.read_csv('emps.csv', sep=';', index_col=0, parse_dates=['hire_date'])"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "bd20b79f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>job_title</th>\n",
" <th>salary</th>\n",
" <th>hire_date</th>\n",
" <th>department_name</th>\n",
" <th>address</th>\n",
" <th>postal_code</th>\n",
" <th>city</th>\n",
" <th>country</th>\n",
" </tr>\n",
" <tr>\n",
" <th>employee_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>100</th>\n",
" <td>Steven</td>\n",
" <td>King</td>\n",
" <td>President</td>\n",
" <td>24000</td>\n",
" <td>1997-06-17</td>\n",
" <td>Executive</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>101</th>\n",
" <td>Neena</td>\n",
" <td>Kochhar</td>\n",
" <td>Administration Vice President</td>\n",
" <td>17000</td>\n",
" <td>1999-09-21</td>\n",
" <td>Executive</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>102</th>\n",
" <td>Lex</td>\n",
" <td>De Haan</td>\n",
" <td>Administration Vice President</td>\n",
" <td>17000</td>\n",
" <td>2003-01-13</td>\n",
" <td>Executive</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>103</th>\n",
" <td>Alexander</td>\n",
" <td>Hunold</td>\n",
" <td>Programmer</td>\n",
" <td>9000</td>\n",
" <td>2000-01-03</td>\n",
" <td>IT</td>\n",
" <td>2014 Jabberwocky Rd</td>\n",
" <td>26192</td>\n",
" <td>Southlake</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>104</th>\n",
" <td>Bruce</td>\n",
" <td>Ernst</td>\n",
" <td>Programmer</td>\n",
" <td>6000</td>\n",
" <td>2001-05-21</td>\n",
" <td>IT</td>\n",
" <td>2014 Jabberwocky Rd</td>\n",
" <td>26192</td>\n",
" <td>Southlake</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>202</th>\n",
" <td>Pat</td>\n",
" <td>Fay</td>\n",
" <td>Marketing Representative</td>\n",
" <td>6000</td>\n",
" <td>2007-08-17</td>\n",
" <td>Marketing</td>\n",
" <td>147 Spadina Ave</td>\n",
" <td>M5V 2L7</td>\n",
" <td>Toronto</td>\n",
" <td>Canada</td>\n",
" </tr>\n",
" <tr>\n",
" <th>203</th>\n",
" <td>Susan</td>\n",
" <td>Mavris</td>\n",
" <td>Human Resources Representative</td>\n",
" <td>6500</td>\n",
" <td>2004-06-07</td>\n",
" <td>Human Resources</td>\n",
" <td>8204 Arthur St</td>\n",
" <td>NaN</td>\n",
" <td>London</td>\n",
" <td>United Kingdom</td>\n",
" </tr>\n",
" <tr>\n",
" <th>204</th>\n",
" <td>Hermann</td>\n",
" <td>Baer</td>\n",
" <td>Public Relations Representative</td>\n",
" <td>10000</td>\n",
" <td>2004-06-07</td>\n",
" <td>Public Relations</td>\n",
" <td>Schwanthalerstr. 7031</td>\n",
" <td>80925</td>\n",
" <td>Munich</td>\n",
" <td>Germany</td>\n",
" </tr>\n",
" <tr>\n",
" <th>205</th>\n",
" <td>Shelley</td>\n",
" <td>Higgins</td>\n",
" <td>Accounting Manager</td>\n",
" <td>12000</td>\n",
" <td>2004-06-07</td>\n",
" <td>Accounting</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>206</th>\n",
" <td>William</td>\n",
" <td>Gietz</td>\n",
" <td>Public Accountant</td>\n",
" <td>8300</td>\n",
" <td>2004-06-07</td>\n",
" <td>Accounting</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>107 rows × 10 columns</p>\n",
"</div>"
],
"text/plain": [
" first_name last_name job_title salary \n",
"employee_id \n",
"100 Steven King President 24000 \\\n",
"101 Neena Kochhar Administration Vice President 17000 \n",
"102 Lex De Haan Administration Vice President 17000 \n",
"103 Alexander Hunold Programmer 9000 \n",
"104 Bruce Ernst Programmer 6000 \n",
"... ... ... ... ... \n",
"202 Pat Fay Marketing Representative 6000 \n",
"203 Susan Mavris Human Resources Representative 6500 \n",
"204 Hermann Baer Public Relations Representative 10000 \n",
"205 Shelley Higgins Accounting Manager 12000 \n",
"206 William Gietz Public Accountant 8300 \n",
"\n",
" hire_date department_name address postal_code \n",
"employee_id \n",
"100 1997-06-17 Executive 2004 Charade Rd 98199 \\\n",
"101 1999-09-21 Executive 2004 Charade Rd 98199 \n",
"102 2003-01-13 Executive 2004 Charade Rd 98199 \n",
"103 2000-01-03 IT 2014 Jabberwocky Rd 26192 \n",
"104 2001-05-21 IT 2014 Jabberwocky Rd 26192 \n",
"... ... ... ... ... \n",
"202 2007-08-17 Marketing 147 Spadina Ave M5V 2L7 \n",
"203 2004-06-07 Human Resources 8204 Arthur St NaN \n",
"204 2004-06-07 Public Relations Schwanthalerstr. 7031 80925 \n",
"205 2004-06-07 Accounting 2004 Charade Rd 98199 \n",
"206 2004-06-07 Accounting 2004 Charade Rd 98199 \n",
"\n",
" city country \n",
"employee_id \n",
"100 Seattle United States of America \n",
"101 Seattle United States of America \n",
"102 Seattle United States of America \n",
"103 Southlake United States of America \n",
"104 Southlake United States of America \n",
"... ... ... \n",
"202 Toronto Canada \n",
"203 London United Kingdom \n",
"204 Munich Germany \n",
"205 Seattle United States of America \n",
"206 Seattle United States of America \n",
"\n",
"[107 rows x 10 columns]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "c322c181",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"first_name object\n",
"last_name object\n",
"job_title object\n",
"salary int64\n",
"hire_date datetime64[ns]\n",
"department_name object\n",
"address object\n",
"postal_code object\n",
"city object\n",
"country object\n",
"dtype: object"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.dtypes"
]
},
{
"cell_type": "markdown",
"id": "a1834896",
"metadata": {},
"source": [
"## str"
]
},
{
"cell_type": "markdown",
"id": "4bb1ac94",
"metadata": {},
"source": [
"Gdy w kolumnie mamy dane typu tekstowego i chcielibyśmy wykonać metodę zdefiniowaną w klasie `str`, to\n",
"\n",
"0. nie mogę wywołać tej metody bezpośrednio na kolumnie, bo jest ona typu `Series`, a nie `str`"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "9a446169",
"metadata": {},
"outputs": [],
"source": [
"#ERR emps.first_name.upper()"
]
},
{
"cell_type": "markdown",
"id": "49879794",
"metadata": {},
"source": [
"1. teoretycznie możemy użyć ogólnej techniki `apply`"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "56cf00f0",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"employee_id\n",
"100 STEVEN\n",
"101 NEENA\n",
"102 LEX\n",
"103 ALEXANDER\n",
"104 BRUCE\n",
" ... \n",
"202 PAT\n",
"203 SUSAN\n",
"204 HERMANN\n",
"205 SHELLEY\n",
"206 WILLIAM\n",
"Name: first_name, Length: 107, dtype: object"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.first_name.apply(str.upper)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "41a767f3",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"employee_id\n",
"100 STEVEN\n",
"101 NEENA\n",
"102 LEX\n",
"103 ALEXANDER\n",
"104 BRUCE\n",
" ... \n",
"202 PAT\n",
"203 SUSAN\n",
"204 HERMANN\n",
"205 SHELLEY\n",
"206 WILLIAM\n",
"Name: first_name, Length: 107, dtype: object"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.first_name.apply(lambda napis: napis.upper())"
]
},
{
"cell_type": "markdown",
"id": "5f7fe5d3",
"metadata": {},
"source": [
"2. Mogę z kolumny (serii) pobrać atrybutem `.str`, całą kolumnę, ale wzbogaconą o metody klasy `str`.\n",
"\n",
"Wywołanie takiej metody, np. `upper` działa od razu dla wszystkich danych, dla całej serii danych."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "9186577b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<pandas.core.strings.accessor.StringMethods at 0x7fecd0790fd0>"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.first_name.str"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "88104210",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"employee_id\n",
"100 STEVEN\n",
"101 NEENA\n",
"102 LEX\n",
"103 ALEXANDER\n",
"104 BRUCE\n",
" ... \n",
"202 PAT\n",
"203 SUSAN\n",
"204 HERMANN\n",
"205 SHELLEY\n",
"206 WILLIAM\n",
"Name: first_name, Length: 107, dtype: object"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.first_name.str.upper()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "7ba29572",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"employee_id\n",
"100 King\n",
"101 KochhAr\n",
"102 De HAAn\n",
"103 Hunold\n",
"104 Ernst\n",
" ... \n",
"202 FAy\n",
"203 MAvris\n",
"204 BAer\n",
"205 Higgins\n",
"206 Gietz\n",
"Name: last_name, Length: 107, dtype: object"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.last_name.str.replace('a', 'A')"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "1333ee53",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"employee_id\n",
"100 True\n",
"101 True\n",
"102 False\n",
"103 False\n",
"104 False\n",
" ... \n",
"202 False\n",
"203 False\n",
"204 False\n",
"205 False\n",
"206 False\n",
"Name: last_name, Length: 107, dtype: bool"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.last_name.str.startswith('K')"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "9ccf2ee0",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>job_title</th>\n",
" <th>salary</th>\n",
" <th>hire_date</th>\n",
" <th>department_name</th>\n",
" <th>address</th>\n",
" <th>postal_code</th>\n",
" <th>city</th>\n",
" <th>country</th>\n",
" </tr>\n",
" <tr>\n",
" <th>employee_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>100</th>\n",
" <td>Steven</td>\n",
" <td>King</td>\n",
" <td>President</td>\n",
" <td>24000</td>\n",
" <td>1997-06-17</td>\n",
" <td>Executive</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>101</th>\n",
" <td>Neena</td>\n",
" <td>Kochhar</td>\n",
" <td>Administration Vice President</td>\n",
" <td>17000</td>\n",
" <td>1999-09-21</td>\n",
" <td>Executive</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>115</th>\n",
" <td>Alexander</td>\n",
" <td>Khoo</td>\n",
" <td>Purchasing Clerk</td>\n",
" <td>3100</td>\n",
" <td>2005-05-18</td>\n",
" <td>Purchasing</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>122</th>\n",
" <td>Payam</td>\n",
" <td>Kaufling</td>\n",
" <td>Stock Manager</td>\n",
" <td>7900</td>\n",
" <td>2005-05-01</td>\n",
" <td>Shipping</td>\n",
" <td>2011 Interiors Blvd</td>\n",
" <td>99236</td>\n",
" <td>South San Francisco</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>156</th>\n",
" <td>Janette</td>\n",
" <td>King</td>\n",
" <td>Sales Representative</td>\n",
" <td>10000</td>\n",
" <td>2006-01-30</td>\n",
" <td>Sales</td>\n",
" <td>Magdalen Centre, The Oxford Science Park</td>\n",
" <td>OX9 9ZB</td>\n",
" <td>Oxford</td>\n",
" <td>United Kingdom</td>\n",
" </tr>\n",
" <tr>\n",
" <th>173</th>\n",
" <td>Sundita</td>\n",
" <td>Kumar</td>\n",
" <td>Sales Representative</td>\n",
" <td>6100</td>\n",
" <td>2010-04-21</td>\n",
" <td>Sales</td>\n",
" <td>Magdalen Centre, The Oxford Science Park</td>\n",
" <td>OX9 9ZB</td>\n",
" <td>Oxford</td>\n",
" <td>United Kingdom</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" first_name last_name job_title salary \n",
"employee_id \n",
"100 Steven King President 24000 \\\n",
"101 Neena Kochhar Administration Vice President 17000 \n",
"115 Alexander Khoo Purchasing Clerk 3100 \n",
"122 Payam Kaufling Stock Manager 7900 \n",
"156 Janette King Sales Representative 10000 \n",
"173 Sundita Kumar Sales Representative 6100 \n",
"\n",
" hire_date department_name \n",
"employee_id \n",
"100 1997-06-17 Executive \\\n",
"101 1999-09-21 Executive \n",
"115 2005-05-18 Purchasing \n",
"122 2005-05-01 Shipping \n",
"156 2006-01-30 Sales \n",
"173 2010-04-21 Sales \n",
"\n",
" address postal_code \n",
"employee_id \n",
"100 2004 Charade Rd 98199 \\\n",
"101 2004 Charade Rd 98199 \n",
"115 2004 Charade Rd 98199 \n",
"122 2011 Interiors Blvd 99236 \n",
"156 Magdalen Centre, The Oxford Science Park OX9 9ZB \n",
"173 Magdalen Centre, The Oxford Science Park OX9 9ZB \n",
"\n",
" city country \n",
"employee_id \n",
"100 Seattle United States of America \n",
"101 Seattle United States of America \n",
"115 Seattle United States of America \n",
"122 South San Francisco United States of America \n",
"156 Oxford United Kingdom \n",
"173 Oxford United Kingdom "
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps[emps.last_name.str.startswith('K')]"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "1a1864e9",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"employee_id\n",
"100 Kin\n",
"101 Koc\n",
"102 De \n",
"103 Hun\n",
"104 Ern\n",
" ... \n",
"202 Fay\n",
"203 Mav\n",
"204 Bae\n",
"205 Hig\n",
"206 Gie\n",
"Name: last_name, Length: 107, dtype: object"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.last_name.str[0:3]"
]
},
{
"cell_type": "markdown",
"id": "988e67fb",
"metadata": {},
"source": [
"## dt\n",
"\n",
"Gdy kolumna jest typu `datetime`, za pomocą analogicznego atrybutu `dt` możemy dostać się do operacji typowych dla daty i czasu."
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "84cc371e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"employee_id\n",
"100 1997\n",
"101 1999\n",
"102 2003\n",
"103 2000\n",
"104 2001\n",
" ... \n",
"202 2007\n",
"203 2004\n",
"204 2004\n",
"205 2004\n",
"206 2004\n",
"Name: hire_date, Length: 107, dtype: int32"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.hire_date.dt.year"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "dc00f638",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"employee_id\n",
"100 Tuesday, 168 dzień roku 1997, 17.06\n",
"101 Tuesday, 264 dzień roku 1999, 21.09\n",
"102 Monday, 013 dzień roku 2003, 13.01\n",
"103 Monday, 003 dzień roku 2000, 03.01\n",
"104 Monday, 141 dzień roku 2001, 21.05\n",
" ... \n",
"202 Friday, 229 dzień roku 2007, 17.08\n",
"203 Monday, 159 dzień roku 2004, 07.06\n",
"204 Monday, 159 dzień roku 2004, 07.06\n",
"205 Monday, 159 dzień roku 2004, 07.06\n",
"206 Monday, 159 dzień roku 2004, 07.06\n",
"Name: hire_date, Length: 107, dtype: object"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.hire_date.dt.strftime('%A, %j dzień roku %Y, %d.%m')"
]
},
{
"cell_type": "markdown",
"id": "65f97078",
"metadata": {},
"source": [
"Przykład zastosowania: pogrupujemy pracowników wg roku zatrudnienia i dla każdego roku obliczymy liczbę oraz średnie zarobki."
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "a9d58f2a",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>salary</th>\n",
" </tr>\n",
" <tr>\n",
" <th>hire_date</th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1987</th>\n",
" <td>4400.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1997</th>\n",
" <td>24000.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1998</th>\n",
" <td>7800.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1999</th>\n",
" <td>17000.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2000</th>\n",
" <td>6950.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2001</th>\n",
" <td>6000.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2003</th>\n",
" <td>17000.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2004</th>\n",
" <td>9828.571429</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2005</th>\n",
" <td>4525.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2006</th>\n",
" <td>8600.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2007</th>\n",
" <td>6460.714286</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2008</th>\n",
" <td>4740.909091</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2009</th>\n",
" <td>4938.888889</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2010</th>\n",
" <td>4525.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2011</th>\n",
" <td>4200.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" salary\n",
"hire_date \n",
"1987 4400.000000\n",
"1997 24000.000000\n",
"1998 7800.000000\n",
"1999 17000.000000\n",
"2000 6950.000000\n",
"2001 6000.000000\n",
"2003 17000.000000\n",
"2004 9828.571429\n",
"2005 4525.000000\n",
"2006 8600.000000\n",
"2007 6460.714286\n",
"2008 4740.909091\n",
"2009 4938.888889\n",
"2010 4525.000000\n",
"2011 4200.000000"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.groupby(emps.hire_date.dt.year).agg({'salary': 'mean'})"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
This source diff could not be displayed because it is too large. You can view the blob instead.
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "6b0ac66f",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "b8eebdc0",
"metadata": {},
"outputs": [],
"source": [
"emps = pd.read_csv('emps.csv', sep=';', index_col='employee_id', parse_dates=['hire_date'])"
]
},
{
"cell_type": "markdown",
"id": "0d3b7157",
"metadata": {},
"source": [
"## `nsmallest` i `nlargest`"
]
},
{
"cell_type": "markdown",
"id": "a89b83a7",
"metadata": {},
"source": [
"Istnieją dedykowane operacje `nlargest` i `nsmallest` do odczytu określonej liczby największych/najmniejszych wartości. Nie trzeba sortować, aby to uzyskać.\n",
"\n",
"W przypadku serii uruchamia się tak:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "13ebac18",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"employee_id\n",
"132 2100\n",
"128 2200\n",
"136 2200\n",
"Name: salary, dtype: int64"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.salary.nsmallest(3)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "7adcf052",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"employee_id\n",
"136 2011-02-06\n",
"Name: hire_date, dtype: datetime64[ns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.hire_date.nlargest(1)"
]
},
{
"cell_type": "markdown",
"id": "427e2715",
"metadata": {},
"source": [
"Wynikiem jest serai, a dostęp do elementu wymaga:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "6bd9e19e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Timestamp('2011-02-06 00:00:00')"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.hire_date.nlargest(1).iloc[0]"
]
},
{
"cell_type": "markdown",
"id": "911b109f",
"metadata": {},
"source": [
"Gdy potrzebujemy pojedynczej wartości, to mamy też operacje `min` i `max`, których używa się łatwiej."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "4652ba91",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Timestamp('2011-02-06 00:00:00')"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.hire_date.max()"
]
},
{
"cell_type": "markdown",
"id": "602eeb75",
"metadata": {},
"source": [
"W przypadku tabeli, podaje się kryterium sortowania - zwykle nazwę kolumny."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "f06d4003",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>job_title</th>\n",
" <th>salary</th>\n",
" <th>hire_date</th>\n",
" <th>department_name</th>\n",
" <th>address</th>\n",
" <th>postal_code</th>\n",
" <th>city</th>\n",
" <th>country</th>\n",
" </tr>\n",
" <tr>\n",
" <th>employee_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>100</th>\n",
" <td>Steven</td>\n",
" <td>King</td>\n",
" <td>President</td>\n",
" <td>24000</td>\n",
" <td>1997-06-17</td>\n",
" <td>Executive</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>101</th>\n",
" <td>Neena</td>\n",
" <td>Kochhar</td>\n",
" <td>Administration Vice President</td>\n",
" <td>17000</td>\n",
" <td>1999-09-21</td>\n",
" <td>Executive</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>102</th>\n",
" <td>Lex</td>\n",
" <td>De Haan</td>\n",
" <td>Administration Vice President</td>\n",
" <td>17000</td>\n",
" <td>2003-01-13</td>\n",
" <td>Executive</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>145</th>\n",
" <td>John</td>\n",
" <td>Russell</td>\n",
" <td>Sales Manager</td>\n",
" <td>14000</td>\n",
" <td>2006-10-01</td>\n",
" <td>Sales</td>\n",
" <td>Magdalen Centre, The Oxford Science Park</td>\n",
" <td>OX9 9ZB</td>\n",
" <td>Oxford</td>\n",
" <td>United Kingdom</td>\n",
" </tr>\n",
" <tr>\n",
" <th>146</th>\n",
" <td>Karen</td>\n",
" <td>Partners</td>\n",
" <td>Sales Manager</td>\n",
" <td>13500</td>\n",
" <td>2007-01-05</td>\n",
" <td>Sales</td>\n",
" <td>Magdalen Centre, The Oxford Science Park</td>\n",
" <td>OX9 9ZB</td>\n",
" <td>Oxford</td>\n",
" <td>United Kingdom</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" first_name last_name job_title salary \\\n",
"employee_id \n",
"100 Steven King President 24000 \n",
"101 Neena Kochhar Administration Vice President 17000 \n",
"102 Lex De Haan Administration Vice President 17000 \n",
"145 John Russell Sales Manager 14000 \n",
"146 Karen Partners Sales Manager 13500 \n",
"\n",
" hire_date department_name \\\n",
"employee_id \n",
"100 1997-06-17 Executive \n",
"101 1999-09-21 Executive \n",
"102 2003-01-13 Executive \n",
"145 2006-10-01 Sales \n",
"146 2007-01-05 Sales \n",
"\n",
" address postal_code city \\\n",
"employee_id \n",
"100 2004 Charade Rd 98199 Seattle \n",
"101 2004 Charade Rd 98199 Seattle \n",
"102 2004 Charade Rd 98199 Seattle \n",
"145 Magdalen Centre, The Oxford Science Park OX9 9ZB Oxford \n",
"146 Magdalen Centre, The Oxford Science Park OX9 9ZB Oxford \n",
"\n",
" country \n",
"employee_id \n",
"100 United States of America \n",
"101 United States of America \n",
"102 United States of America \n",
"145 United Kingdom \n",
"146 United Kingdom "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.nlargest(5, 'salary')"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "6e863554",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>job_title</th>\n",
" <th>salary</th>\n",
" <th>hire_date</th>\n",
" <th>department_name</th>\n",
" <th>address</th>\n",
" <th>postal_code</th>\n",
" <th>city</th>\n",
" <th>country</th>\n",
" </tr>\n",
" <tr>\n",
" <th>employee_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>132</th>\n",
" <td>TJ</td>\n",
" <td>Olson</td>\n",
" <td>Stock Clerk</td>\n",
" <td>2100</td>\n",
" <td>2009-04-10</td>\n",
" <td>Shipping</td>\n",
" <td>2011 Interiors Blvd</td>\n",
" <td>99236</td>\n",
" <td>South San Francisco</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" first_name last_name job_title salary hire_date \\\n",
"employee_id \n",
"132 TJ Olson Stock Clerk 2100 2009-04-10 \n",
"\n",
" department_name address postal_code \\\n",
"employee_id \n",
"132 Shipping 2011 Interiors Blvd 99236 \n",
"\n",
" city country \n",
"employee_id \n",
"132 South San Francisco United States of America "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.nsmallest(1, 'salary')"
]
},
{
"cell_type": "markdown",
"id": "1015bd60",
"metadata": {},
"source": [
"Domyślnie wynik zawiera tyle elementów, ile wynosi argument, nawet jeśli część rekordów o identycnej wartości zostanie odrzucona. (Lex De Haan też zarabia 17000)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "f29ecfb2",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>job_title</th>\n",
" <th>salary</th>\n",
" <th>hire_date</th>\n",
" <th>department_name</th>\n",
" <th>address</th>\n",
" <th>postal_code</th>\n",
" <th>city</th>\n",
" <th>country</th>\n",
" </tr>\n",
" <tr>\n",
" <th>employee_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>100</th>\n",
" <td>Steven</td>\n",
" <td>King</td>\n",
" <td>President</td>\n",
" <td>24000</td>\n",
" <td>1997-06-17</td>\n",
" <td>Executive</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>101</th>\n",
" <td>Neena</td>\n",
" <td>Kochhar</td>\n",
" <td>Administration Vice President</td>\n",
" <td>17000</td>\n",
" <td>1999-09-21</td>\n",
" <td>Executive</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" first_name last_name job_title salary \\\n",
"employee_id \n",
"100 Steven King President 24000 \n",
"101 Neena Kochhar Administration Vice President 17000 \n",
"\n",
" hire_date department_name address postal_code city \\\n",
"employee_id \n",
"100 1997-06-17 Executive 2004 Charade Rd 98199 Seattle \n",
"101 1999-09-21 Executive 2004 Charade Rd 98199 Seattle \n",
"\n",
" country \n",
"employee_id \n",
"100 United States of America \n",
"101 United States of America "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.nlargest(2, 'salary')"
]
},
{
"cell_type": "markdown",
"id": "0ef0b7be",
"metadata": {},
"source": [
"Tym zachowaniem można sterować za pomocą parametru `keep`.\n",
"- `first` (wartość domyślna) - zostawia początkowe (przed sortowaniem) rekordy, usuwa ostatnie (nadmiarowe)\n",
"- `last` - zostawia ostatnie, usuwa początkowe\n",
"- `all` - zostawia w wynikach wszystkie rekordy o tej samej wartości nawet, jeśli sposoduje to zwrócenie większej liczby rekordów, niż zadana. Jak wynik *ex equo* w sporcie."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "46dfa402",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>job_title</th>\n",
" <th>salary</th>\n",
" <th>hire_date</th>\n",
" <th>department_name</th>\n",
" <th>address</th>\n",
" <th>postal_code</th>\n",
" <th>city</th>\n",
" <th>country</th>\n",
" </tr>\n",
" <tr>\n",
" <th>employee_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>100</th>\n",
" <td>Steven</td>\n",
" <td>King</td>\n",
" <td>President</td>\n",
" <td>24000</td>\n",
" <td>1997-06-17</td>\n",
" <td>Executive</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>101</th>\n",
" <td>Neena</td>\n",
" <td>Kochhar</td>\n",
" <td>Administration Vice President</td>\n",
" <td>17000</td>\n",
" <td>1999-09-21</td>\n",
" <td>Executive</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" first_name last_name job_title salary \\\n",
"employee_id \n",
"100 Steven King President 24000 \n",
"101 Neena Kochhar Administration Vice President 17000 \n",
"\n",
" hire_date department_name address postal_code city \\\n",
"employee_id \n",
"100 1997-06-17 Executive 2004 Charade Rd 98199 Seattle \n",
"101 1999-09-21 Executive 2004 Charade Rd 98199 Seattle \n",
"\n",
" country \n",
"employee_id \n",
"100 United States of America \n",
"101 United States of America "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.nlargest(2, 'salary', keep='first')"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "a532758d",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>job_title</th>\n",
" <th>salary</th>\n",
" <th>hire_date</th>\n",
" <th>department_name</th>\n",
" <th>address</th>\n",
" <th>postal_code</th>\n",
" <th>city</th>\n",
" <th>country</th>\n",
" </tr>\n",
" <tr>\n",
" <th>employee_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>100</th>\n",
" <td>Steven</td>\n",
" <td>King</td>\n",
" <td>President</td>\n",
" <td>24000</td>\n",
" <td>1997-06-17</td>\n",
" <td>Executive</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>102</th>\n",
" <td>Lex</td>\n",
" <td>De Haan</td>\n",
" <td>Administration Vice President</td>\n",
" <td>17000</td>\n",
" <td>2003-01-13</td>\n",
" <td>Executive</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" first_name last_name job_title salary \\\n",
"employee_id \n",
"100 Steven King President 24000 \n",
"102 Lex De Haan Administration Vice President 17000 \n",
"\n",
" hire_date department_name address postal_code city \\\n",
"employee_id \n",
"100 1997-06-17 Executive 2004 Charade Rd 98199 Seattle \n",
"102 2003-01-13 Executive 2004 Charade Rd 98199 Seattle \n",
"\n",
" country \n",
"employee_id \n",
"100 United States of America \n",
"102 United States of America "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.nlargest(2, 'salary', keep='last')"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "79da17eb",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>job_title</th>\n",
" <th>salary</th>\n",
" <th>hire_date</th>\n",
" <th>department_name</th>\n",
" <th>address</th>\n",
" <th>postal_code</th>\n",
" <th>city</th>\n",
" <th>country</th>\n",
" </tr>\n",
" <tr>\n",
" <th>employee_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>100</th>\n",
" <td>Steven</td>\n",
" <td>King</td>\n",
" <td>President</td>\n",
" <td>24000</td>\n",
" <td>1997-06-17</td>\n",
" <td>Executive</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>101</th>\n",
" <td>Neena</td>\n",
" <td>Kochhar</td>\n",
" <td>Administration Vice President</td>\n",
" <td>17000</td>\n",
" <td>1999-09-21</td>\n",
" <td>Executive</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>102</th>\n",
" <td>Lex</td>\n",
" <td>De Haan</td>\n",
" <td>Administration Vice President</td>\n",
" <td>17000</td>\n",
" <td>2003-01-13</td>\n",
" <td>Executive</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" first_name last_name job_title salary \\\n",
"employee_id \n",
"100 Steven King President 24000 \n",
"101 Neena Kochhar Administration Vice President 17000 \n",
"102 Lex De Haan Administration Vice President 17000 \n",
"\n",
" hire_date department_name address postal_code city \\\n",
"employee_id \n",
"100 1997-06-17 Executive 2004 Charade Rd 98199 Seattle \n",
"101 1999-09-21 Executive 2004 Charade Rd 98199 Seattle \n",
"102 2003-01-13 Executive 2004 Charade Rd 98199 Seattle \n",
"\n",
" country \n",
"employee_id \n",
"100 United States of America \n",
"101 United States of America \n",
"102 United States of America "
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.nlargest(2, 'salary', keep='all')"
]
},
{
"cell_type": "markdown",
"id": "b7aa1f12",
"metadata": {},
"source": [
"## Ranking\n",
"\n",
"Ranking to ponumerowanie rekordów w kolejności wynikającej z sortowania. Np. pracownik o najniższej pensji będzie miał ranking 1, a o najwyższej będzie miał ranking 107.\n",
"\n",
"Metodę rank można wywołać zarówno na pojedynczej serii, jak i na całym DF.\n",
"\n",
"„Na której pozycji znajazłby się dany rekord (wartość w serii), gdyby posortować dane”, ale bez sortowania."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "1d661266",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"employee_id\n",
"100 107.0\n",
"101 105.5\n",
"102 105.5\n",
"103 82.5\n",
"104 51.5\n",
" ... \n",
"202 51.5\n",
"203 57.5\n",
"204 90.5\n",
"205 100.0\n",
"206 77.0\n",
"Name: salary, Length: 107, dtype: float64"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.salary.rank()"
]
},
{
"cell_type": "markdown",
"id": "d93d2fb5",
"metadata": {},
"source": [
"Które miejsce zajmuje pracownik pod względem kolumny X?"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "26a9f7af",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>job_title</th>\n",
" <th>salary</th>\n",
" <th>hire_date</th>\n",
" <th>department_name</th>\n",
" <th>address</th>\n",
" <th>postal_code</th>\n",
" <th>city</th>\n",
" <th>country</th>\n",
" </tr>\n",
" <tr>\n",
" <th>employee_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>100</th>\n",
" <td>94.5</td>\n",
" <td>52.5</td>\n",
" <td>14.0</td>\n",
" <td>107.0</td>\n",
" <td>2.0</td>\n",
" <td>5.0</td>\n",
" <td>11.5</td>\n",
" <td>15.5</td>\n",
" <td>45.5</td>\n",
" <td>72.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>101</th>\n",
" <td>75.0</td>\n",
" <td>54.0</td>\n",
" <td>8.5</td>\n",
" <td>105.5</td>\n",
" <td>4.0</td>\n",
" <td>5.0</td>\n",
" <td>11.5</td>\n",
" <td>15.5</td>\n",
" <td>45.5</td>\n",
" <td>72.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>102</th>\n",
" <td>61.0</td>\n",
" <td>21.0</td>\n",
" <td>8.5</td>\n",
" <td>105.5</td>\n",
" <td>12.0</td>\n",
" <td>5.0</td>\n",
" <td>11.5</td>\n",
" <td>15.5</td>\n",
" <td>45.5</td>\n",
" <td>72.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>103</th>\n",
" <td>4.5</td>\n",
" <td>46.0</td>\n",
" <td>17.0</td>\n",
" <td>82.5</td>\n",
" <td>5.0</td>\n",
" <td>16.0</td>\n",
" <td>68.0</td>\n",
" <td>3.0</td>\n",
" <td>102.0</td>\n",
" <td>72.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>104</th>\n",
" <td>12.0</td>\n",
" <td>25.0</td>\n",
" <td>17.0</td>\n",
" <td>51.5</td>\n",
" <td>11.0</td>\n",
" <td>16.0</td>\n",
" <td>68.0</td>\n",
" <td>3.0</td>\n",
" <td>102.0</td>\n",
" <td>72.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>202</th>\n",
" <td>77.0</td>\n",
" <td>29.0</td>\n",
" <td>13.0</td>\n",
" <td>51.5</td>\n",
" <td>51.0</td>\n",
" <td>19.5</td>\n",
" <td>1.5</td>\n",
" <td>70.5</td>\n",
" <td>105.5</td>\n",
" <td>1.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>203</th>\n",
" <td>98.0</td>\n",
" <td>66.0</td>\n",
" <td>11.0</td>\n",
" <td>57.5</td>\n",
" <td>14.5</td>\n",
" <td>13.0</td>\n",
" <td>71.0</td>\n",
" <td>NaN</td>\n",
" <td>1.0</td>\n",
" <td>21.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>204</th>\n",
" <td>34.0</td>\n",
" <td>5.0</td>\n",
" <td>21.0</td>\n",
" <td>90.5</td>\n",
" <td>14.5</td>\n",
" <td>21.0</td>\n",
" <td>106.0</td>\n",
" <td>6.0</td>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>205</th>\n",
" <td>90.0</td>\n",
" <td>44.0</td>\n",
" <td>6.0</td>\n",
" <td>100.0</td>\n",
" <td>14.5</td>\n",
" <td>1.5</td>\n",
" <td>11.5</td>\n",
" <td>15.5</td>\n",
" <td>45.5</td>\n",
" <td>72.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>206</th>\n",
" <td>105.5</td>\n",
" <td>37.0</td>\n",
" <td>20.0</td>\n",
" <td>77.0</td>\n",
" <td>14.5</td>\n",
" <td>1.5</td>\n",
" <td>11.5</td>\n",
" <td>15.5</td>\n",
" <td>45.5</td>\n",
" <td>72.5</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>107 rows × 10 columns</p>\n",
"</div>"
],
"text/plain": [
" first_name last_name job_title salary hire_date \\\n",
"employee_id \n",
"100 94.5 52.5 14.0 107.0 2.0 \n",
"101 75.0 54.0 8.5 105.5 4.0 \n",
"102 61.0 21.0 8.5 105.5 12.0 \n",
"103 4.5 46.0 17.0 82.5 5.0 \n",
"104 12.0 25.0 17.0 51.5 11.0 \n",
"... ... ... ... ... ... \n",
"202 77.0 29.0 13.0 51.5 51.0 \n",
"203 98.0 66.0 11.0 57.5 14.5 \n",
"204 34.0 5.0 21.0 90.5 14.5 \n",
"205 90.0 44.0 6.0 100.0 14.5 \n",
"206 105.5 37.0 20.0 77.0 14.5 \n",
"\n",
" department_name address postal_code city country \n",
"employee_id \n",
"100 5.0 11.5 15.5 45.5 72.5 \n",
"101 5.0 11.5 15.5 45.5 72.5 \n",
"102 5.0 11.5 15.5 45.5 72.5 \n",
"103 16.0 68.0 3.0 102.0 72.5 \n",
"104 16.0 68.0 3.0 102.0 72.5 \n",
"... ... ... ... ... ... \n",
"202 19.5 1.5 70.5 105.5 1.5 \n",
"203 13.0 71.0 NaN 1.0 21.0 \n",
"204 21.0 106.0 6.0 2.0 3.0 \n",
"205 1.5 11.5 15.5 45.5 72.5 \n",
"206 1.5 11.5 15.5 45.5 72.5 \n",
"\n",
"[107 rows x 10 columns]"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.rank()"
]
},
{
"cell_type": "markdown",
"id": "f60e0b1c",
"metadata": {},
"source": [
"Aby lepiej przyjrzeć się działaniu rankingu i różnym opcjom, tworzymy nową tabelę, która zawiera mniej kolumn i dodamy ranking wg salary w nowej kolumnie."
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "8834e3a9",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>salary</th>\n",
" <th>job_title</th>\n",
" <th>city</th>\n",
" <th>rank1</th>\n",
" </tr>\n",
" <tr>\n",
" <th>employee_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>100</th>\n",
" <td>Steven</td>\n",
" <td>King</td>\n",
" <td>24000</td>\n",
" <td>President</td>\n",
" <td>Seattle</td>\n",
" <td>107.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>101</th>\n",
" <td>Neena</td>\n",
" <td>Kochhar</td>\n",
" <td>17000</td>\n",
" <td>Administration Vice President</td>\n",
" <td>Seattle</td>\n",
" <td>105.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>102</th>\n",
" <td>Lex</td>\n",
" <td>De Haan</td>\n",
" <td>17000</td>\n",
" <td>Administration Vice President</td>\n",
" <td>Seattle</td>\n",
" <td>105.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>103</th>\n",
" <td>Alexander</td>\n",
" <td>Hunold</td>\n",
" <td>9000</td>\n",
" <td>Programmer</td>\n",
" <td>Southlake</td>\n",
" <td>82.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>104</th>\n",
" <td>Bruce</td>\n",
" <td>Ernst</td>\n",
" <td>6000</td>\n",
" <td>Programmer</td>\n",
" <td>Southlake</td>\n",
" <td>51.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>202</th>\n",
" <td>Pat</td>\n",
" <td>Fay</td>\n",
" <td>6000</td>\n",
" <td>Marketing Representative</td>\n",
" <td>Toronto</td>\n",
" <td>51.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>203</th>\n",
" <td>Susan</td>\n",
" <td>Mavris</td>\n",
" <td>6500</td>\n",
" <td>Human Resources Representative</td>\n",
" <td>London</td>\n",
" <td>57.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>204</th>\n",
" <td>Hermann</td>\n",
" <td>Baer</td>\n",
" <td>10000</td>\n",
" <td>Public Relations Representative</td>\n",
" <td>Munich</td>\n",
" <td>90.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>205</th>\n",
" <td>Shelley</td>\n",
" <td>Higgins</td>\n",
" <td>12000</td>\n",
" <td>Accounting Manager</td>\n",
" <td>Seattle</td>\n",
" <td>100.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>206</th>\n",
" <td>William</td>\n",
" <td>Gietz</td>\n",
" <td>8300</td>\n",
" <td>Public Accountant</td>\n",
" <td>Seattle</td>\n",
" <td>77.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>107 rows × 6 columns</p>\n",
"</div>"
],
"text/plain": [
" first_name last_name salary job_title \\\n",
"employee_id \n",
"100 Steven King 24000 President \n",
"101 Neena Kochhar 17000 Administration Vice President \n",
"102 Lex De Haan 17000 Administration Vice President \n",
"103 Alexander Hunold 9000 Programmer \n",
"104 Bruce Ernst 6000 Programmer \n",
"... ... ... ... ... \n",
"202 Pat Fay 6000 Marketing Representative \n",
"203 Susan Mavris 6500 Human Resources Representative \n",
"204 Hermann Baer 10000 Public Relations Representative \n",
"205 Shelley Higgins 12000 Accounting Manager \n",
"206 William Gietz 8300 Public Accountant \n",
"\n",
" city rank1 \n",
"employee_id \n",
"100 Seattle 107.0 \n",
"101 Seattle 105.5 \n",
"102 Seattle 105.5 \n",
"103 Southlake 82.5 \n",
"104 Southlake 51.5 \n",
"... ... ... \n",
"202 Toronto 51.5 \n",
"203 London 57.5 \n",
"204 Munich 90.5 \n",
"205 Seattle 100.0 \n",
"206 Seattle 77.0 \n",
"\n",
"[107 rows x 6 columns]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps1 = emps[['first_name', 'last_name', 'salary', 'job_title', 'city']].copy()\n",
"emps1['rank1'] = emps1.salary.rank()\n",
"emps1"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "bb4680d6",
"metadata": {},
"outputs": [],
"source": [
"emps1['rank2'] = emps1.salary.rank(ascending=False)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "d130d6a6",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>salary</th>\n",
" <th>job_title</th>\n",
" <th>city</th>\n",
" <th>rank1</th>\n",
" <th>rank2</th>\n",
" </tr>\n",
" <tr>\n",
" <th>employee_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>100</th>\n",
" <td>Steven</td>\n",
" <td>King</td>\n",
" <td>24000</td>\n",
" <td>President</td>\n",
" <td>Seattle</td>\n",
" <td>107.0</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>101</th>\n",
" <td>Neena</td>\n",
" <td>Kochhar</td>\n",
" <td>17000</td>\n",
" <td>Administration Vice President</td>\n",
" <td>Seattle</td>\n",
" <td>105.5</td>\n",
" <td>2.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>102</th>\n",
" <td>Lex</td>\n",
" <td>De Haan</td>\n",
" <td>17000</td>\n",
" <td>Administration Vice President</td>\n",
" <td>Seattle</td>\n",
" <td>105.5</td>\n",
" <td>2.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>103</th>\n",
" <td>Alexander</td>\n",
" <td>Hunold</td>\n",
" <td>9000</td>\n",
" <td>Programmer</td>\n",
" <td>Southlake</td>\n",
" <td>82.5</td>\n",
" <td>25.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>104</th>\n",
" <td>Bruce</td>\n",
" <td>Ernst</td>\n",
" <td>6000</td>\n",
" <td>Programmer</td>\n",
" <td>Southlake</td>\n",
" <td>51.5</td>\n",
" <td>56.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>202</th>\n",
" <td>Pat</td>\n",
" <td>Fay</td>\n",
" <td>6000</td>\n",
" <td>Marketing Representative</td>\n",
" <td>Toronto</td>\n",
" <td>51.5</td>\n",
" <td>56.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>203</th>\n",
" <td>Susan</td>\n",
" <td>Mavris</td>\n",
" <td>6500</td>\n",
" <td>Human Resources Representative</td>\n",
" <td>London</td>\n",
" <td>57.5</td>\n",
" <td>50.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>204</th>\n",
" <td>Hermann</td>\n",
" <td>Baer</td>\n",
" <td>10000</td>\n",
" <td>Public Relations Representative</td>\n",
" <td>Munich</td>\n",
" <td>90.5</td>\n",
" <td>17.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>205</th>\n",
" <td>Shelley</td>\n",
" <td>Higgins</td>\n",
" <td>12000</td>\n",
" <td>Accounting Manager</td>\n",
" <td>Seattle</td>\n",
" <td>100.0</td>\n",
" <td>8.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>206</th>\n",
" <td>William</td>\n",
" <td>Gietz</td>\n",
" <td>8300</td>\n",
" <td>Public Accountant</td>\n",
" <td>Seattle</td>\n",
" <td>77.0</td>\n",
" <td>31.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>107 rows × 7 columns</p>\n",
"</div>"
],
"text/plain": [
" first_name last_name salary job_title \\\n",
"employee_id \n",
"100 Steven King 24000 President \n",
"101 Neena Kochhar 17000 Administration Vice President \n",
"102 Lex De Haan 17000 Administration Vice President \n",
"103 Alexander Hunold 9000 Programmer \n",
"104 Bruce Ernst 6000 Programmer \n",
"... ... ... ... ... \n",
"202 Pat Fay 6000 Marketing Representative \n",
"203 Susan Mavris 6500 Human Resources Representative \n",
"204 Hermann Baer 10000 Public Relations Representative \n",
"205 Shelley Higgins 12000 Accounting Manager \n",
"206 William Gietz 8300 Public Accountant \n",
"\n",
" city rank1 rank2 \n",
"employee_id \n",
"100 Seattle 107.0 1.0 \n",
"101 Seattle 105.5 2.5 \n",
"102 Seattle 105.5 2.5 \n",
"103 Southlake 82.5 25.5 \n",
"104 Southlake 51.5 56.5 \n",
"... ... ... ... \n",
"202 Toronto 51.5 56.5 \n",
"203 London 57.5 50.5 \n",
"204 Munich 90.5 17.5 \n",
"205 Seattle 100.0 8.0 \n",
"206 Seattle 77.0 31.0 \n",
"\n",
"[107 rows x 7 columns]"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps1"
]
},
{
"cell_type": "markdown",
"id": "a8fbae65",
"metadata": {},
"source": [
"Aby obliczyć ranking, nie jest konieczne sortowanie danych.\n",
"\n",
"Jednak dla zwiększenia czytelności przykładu, posortuję teraz tabelę malejąco."
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "f3685d62",
"metadata": {},
"outputs": [],
"source": [
"emps3 = emps[['first_name', 'last_name', 'salary', 'job_title', 'city']].sort_values('salary', ascending=False)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "2f32b4cd",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>salary</th>\n",
" <th>job_title</th>\n",
" <th>city</th>\n",
" </tr>\n",
" <tr>\n",
" <th>employee_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>100</th>\n",
" <td>Steven</td>\n",
" <td>King</td>\n",
" <td>24000</td>\n",
" <td>President</td>\n",
" <td>Seattle</td>\n",
" </tr>\n",
" <tr>\n",
" <th>102</th>\n",
" <td>Lex</td>\n",
" <td>De Haan</td>\n",
" <td>17000</td>\n",
" <td>Administration Vice President</td>\n",
" <td>Seattle</td>\n",
" </tr>\n",
" <tr>\n",
" <th>101</th>\n",
" <td>Neena</td>\n",
" <td>Kochhar</td>\n",
" <td>17000</td>\n",
" <td>Administration Vice President</td>\n",
" <td>Seattle</td>\n",
" </tr>\n",
" <tr>\n",
" <th>145</th>\n",
" <td>John</td>\n",
" <td>Russell</td>\n",
" <td>14000</td>\n",
" <td>Sales Manager</td>\n",
" <td>Oxford</td>\n",
" </tr>\n",
" <tr>\n",
" <th>146</th>\n",
" <td>Karen</td>\n",
" <td>Partners</td>\n",
" <td>13500</td>\n",
" <td>Sales Manager</td>\n",
" <td>Oxford</td>\n",
" </tr>\n",
" <tr>\n",
" <th>201</th>\n",
" <td>Michael</td>\n",
" <td>Hartstein</td>\n",
" <td>13000</td>\n",
" <td>Marketing Manager</td>\n",
" <td>Toronto</td>\n",
" </tr>\n",
" <tr>\n",
" <th>147</th>\n",
" <td>Alberto</td>\n",
" <td>Errazuriz</td>\n",
" <td>12000</td>\n",
" <td>Sales Manager</td>\n",
" <td>Oxford</td>\n",
" </tr>\n",
" <tr>\n",
" <th>108</th>\n",
" <td>Nancy</td>\n",
" <td>Greenberg</td>\n",
" <td>12000</td>\n",
" <td>Finance Manager</td>\n",
" <td>Seattle</td>\n",
" </tr>\n",
" <tr>\n",
" <th>205</th>\n",
" <td>Shelley</td>\n",
" <td>Higgins</td>\n",
" <td>12000</td>\n",
" <td>Accounting Manager</td>\n",
" <td>Seattle</td>\n",
" </tr>\n",
" <tr>\n",
" <th>168</th>\n",
" <td>Lisa</td>\n",
" <td>Ozer</td>\n",
" <td>11500</td>\n",
" <td>Sales Representative</td>\n",
" <td>Oxford</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" first_name last_name salary job_title \\\n",
"employee_id \n",
"100 Steven King 24000 President \n",
"102 Lex De Haan 17000 Administration Vice President \n",
"101 Neena Kochhar 17000 Administration Vice President \n",
"145 John Russell 14000 Sales Manager \n",
"146 Karen Partners 13500 Sales Manager \n",
"201 Michael Hartstein 13000 Marketing Manager \n",
"147 Alberto Errazuriz 12000 Sales Manager \n",
"108 Nancy Greenberg 12000 Finance Manager \n",
"205 Shelley Higgins 12000 Accounting Manager \n",
"168 Lisa Ozer 11500 Sales Representative \n",
"\n",
" city \n",
"employee_id \n",
"100 Seattle \n",
"102 Seattle \n",
"101 Seattle \n",
"145 Oxford \n",
"146 Oxford \n",
"201 Toronto \n",
"147 Oxford \n",
"108 Seattle \n",
"205 Seattle \n",
"168 Oxford "
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps3.head(10)"
]
},
{
"cell_type": "markdown",
"id": "c420bd25",
"metadata": {},
"source": [
"Metoda `rank` posiada parametr `method`, który zmienia strategię uzyskiwania wyniku w przypadku, gdy wiele rekordów ma taką samą wartość. Domyślna wartość `'average'`."
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "fa519380",
"metadata": {},
"outputs": [],
"source": [
"emps3['rank'] = emps3.salary.rank()\n",
"emps3['rank_avg'] = emps3.salary.rank(ascending=False)\n",
"emps3['rank_min'] = emps3.salary.rank(ascending=False, method='min')\n",
"emps3['rank_max'] = emps3.salary.rank(ascending=False, method='max')\n",
"emps3['rank_first'] = emps3.salary.rank(ascending=False, method='first')\n",
"emps3['rank_dense'] = emps3.salary.rank(ascending=False, method='dense')"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "861f9d01",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>salary</th>\n",
" <th>job_title</th>\n",
" <th>city</th>\n",
" <th>rank</th>\n",
" <th>rank_avg</th>\n",
" <th>rank_min</th>\n",
" <th>rank_max</th>\n",
" <th>rank_first</th>\n",
" <th>rank_dense</th>\n",
" </tr>\n",
" <tr>\n",
" <th>employee_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>100</th>\n",
" <td>Steven</td>\n",
" <td>King</td>\n",
" <td>24000</td>\n",
" <td>President</td>\n",
" <td>Seattle</td>\n",
" <td>107.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>102</th>\n",
" <td>Lex</td>\n",
" <td>De Haan</td>\n",
" <td>17000</td>\n",
" <td>Administration Vice President</td>\n",
" <td>Seattle</td>\n",
" <td>105.5</td>\n",
" <td>2.5</td>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>101</th>\n",
" <td>Neena</td>\n",
" <td>Kochhar</td>\n",
" <td>17000</td>\n",
" <td>Administration Vice President</td>\n",
" <td>Seattle</td>\n",
" <td>105.5</td>\n",
" <td>2.5</td>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>145</th>\n",
" <td>John</td>\n",
" <td>Russell</td>\n",
" <td>14000</td>\n",
" <td>Sales Manager</td>\n",
" <td>Oxford</td>\n",
" <td>104.0</td>\n",
" <td>4.0</td>\n",
" <td>4.0</td>\n",
" <td>4.0</td>\n",
" <td>4.0</td>\n",
" <td>3.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>146</th>\n",
" <td>Karen</td>\n",
" <td>Partners</td>\n",
" <td>13500</td>\n",
" <td>Sales Manager</td>\n",
" <td>Oxford</td>\n",
" <td>103.0</td>\n",
" <td>5.0</td>\n",
" <td>5.0</td>\n",
" <td>5.0</td>\n",
" <td>5.0</td>\n",
" <td>4.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>201</th>\n",
" <td>Michael</td>\n",
" <td>Hartstein</td>\n",
" <td>13000</td>\n",
" <td>Marketing Manager</td>\n",
" <td>Toronto</td>\n",
" <td>102.0</td>\n",
" <td>6.0</td>\n",
" <td>6.0</td>\n",
" <td>6.0</td>\n",
" <td>6.0</td>\n",
" <td>5.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>147</th>\n",
" <td>Alberto</td>\n",
" <td>Errazuriz</td>\n",
" <td>12000</td>\n",
" <td>Sales Manager</td>\n",
" <td>Oxford</td>\n",
" <td>100.0</td>\n",
" <td>8.0</td>\n",
" <td>7.0</td>\n",
" <td>9.0</td>\n",
" <td>7.0</td>\n",
" <td>6.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>108</th>\n",
" <td>Nancy</td>\n",
" <td>Greenberg</td>\n",
" <td>12000</td>\n",
" <td>Finance Manager</td>\n",
" <td>Seattle</td>\n",
" <td>100.0</td>\n",
" <td>8.0</td>\n",
" <td>7.0</td>\n",
" <td>9.0</td>\n",
" <td>8.0</td>\n",
" <td>6.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>205</th>\n",
" <td>Shelley</td>\n",
" <td>Higgins</td>\n",
" <td>12000</td>\n",
" <td>Accounting Manager</td>\n",
" <td>Seattle</td>\n",
" <td>100.0</td>\n",
" <td>8.0</td>\n",
" <td>7.0</td>\n",
" <td>9.0</td>\n",
" <td>9.0</td>\n",
" <td>6.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>168</th>\n",
" <td>Lisa</td>\n",
" <td>Ozer</td>\n",
" <td>11500</td>\n",
" <td>Sales Representative</td>\n",
" <td>Oxford</td>\n",
" <td>98.0</td>\n",
" <td>10.0</td>\n",
" <td>10.0</td>\n",
" <td>10.0</td>\n",
" <td>10.0</td>\n",
" <td>7.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" first_name last_name salary job_title \\\n",
"employee_id \n",
"100 Steven King 24000 President \n",
"102 Lex De Haan 17000 Administration Vice President \n",
"101 Neena Kochhar 17000 Administration Vice President \n",
"145 John Russell 14000 Sales Manager \n",
"146 Karen Partners 13500 Sales Manager \n",
"201 Michael Hartstein 13000 Marketing Manager \n",
"147 Alberto Errazuriz 12000 Sales Manager \n",
"108 Nancy Greenberg 12000 Finance Manager \n",
"205 Shelley Higgins 12000 Accounting Manager \n",
"168 Lisa Ozer 11500 Sales Representative \n",
"\n",
" city rank rank_avg rank_min rank_max rank_first \\\n",
"employee_id \n",
"100 Seattle 107.0 1.0 1.0 1.0 1.0 \n",
"102 Seattle 105.5 2.5 2.0 3.0 2.0 \n",
"101 Seattle 105.5 2.5 2.0 3.0 3.0 \n",
"145 Oxford 104.0 4.0 4.0 4.0 4.0 \n",
"146 Oxford 103.0 5.0 5.0 5.0 5.0 \n",
"201 Toronto 102.0 6.0 6.0 6.0 6.0 \n",
"147 Oxford 100.0 8.0 7.0 9.0 7.0 \n",
"108 Seattle 100.0 8.0 7.0 9.0 8.0 \n",
"205 Seattle 100.0 8.0 7.0 9.0 9.0 \n",
"168 Oxford 98.0 10.0 10.0 10.0 10.0 \n",
"\n",
" rank_dense \n",
"employee_id \n",
"100 1.0 \n",
"102 2.0 \n",
"101 2.0 \n",
"145 3.0 \n",
"146 4.0 \n",
"201 5.0 \n",
"147 6.0 \n",
"108 6.0 \n",
"205 6.0 \n",
"168 7.0 "
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps3.head(10)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7fe47e32",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
This source diff could not be displayed because it is too large. You can view the blob instead.
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "c6e7911e",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "c994c616",
"metadata": {},
"outputs": [],
"source": [
"dane1 = {lit : [f'{lit}{cyf}' for cyf in range(0, 6)] for lit in ['A', 'B', 'C', 'D']}\n",
"dane2 = {lit : [f'{lit.lower()}{cyf}' for cyf in range(0, 6)] for lit in ['A', 'B', 'C', 'D']}\n",
"dane3 = {lit : [f'{lit.lower()}{cyf}' for cyf in range(2, 10, 2)] for lit in ['A', 'B', 'D', 'E']}"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "1d1f0a5f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5'],\n",
" 'B': ['B0', 'B1', 'B2', 'B3', 'B4', 'B5'],\n",
" 'C': ['C0', 'C1', 'C2', 'C3', 'C4', 'C5'],\n",
" 'D': ['D0', 'D1', 'D2', 'D3', 'D4', 'D5']}"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dane1"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "62957759",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" <th>C</th>\n",
" <th>D</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A0</td>\n",
" <td>B0</td>\n",
" <td>C0</td>\n",
" <td>D0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A1</td>\n",
" <td>B1</td>\n",
" <td>C1</td>\n",
" <td>D1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>A2</td>\n",
" <td>B2</td>\n",
" <td>C2</td>\n",
" <td>D2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>A3</td>\n",
" <td>B3</td>\n",
" <td>C3</td>\n",
" <td>D3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>A4</td>\n",
" <td>B4</td>\n",
" <td>C4</td>\n",
" <td>D4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>A5</td>\n",
" <td>B5</td>\n",
" <td>C5</td>\n",
" <td>D5</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" A B C D\n",
"0 A0 B0 C0 D0\n",
"1 A1 B1 C1 D1\n",
"2 A2 B2 C2 D2\n",
"3 A3 B3 C3 D3\n",
"4 A4 B4 C4 D4\n",
"5 A5 B5 C5 D5"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1 = pd.DataFrame(dane1)\n",
"df1"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "caeade3e",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" <th>C</th>\n",
" <th>D</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>a0</td>\n",
" <td>b0</td>\n",
" <td>c0</td>\n",
" <td>d0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>a1</td>\n",
" <td>b1</td>\n",
" <td>c1</td>\n",
" <td>d1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>a2</td>\n",
" <td>b2</td>\n",
" <td>c2</td>\n",
" <td>d2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>a3</td>\n",
" <td>b3</td>\n",
" <td>c3</td>\n",
" <td>d3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>a4</td>\n",
" <td>b4</td>\n",
" <td>c4</td>\n",
" <td>d4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>a5</td>\n",
" <td>b5</td>\n",
" <td>c5</td>\n",
" <td>d5</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" A B C D\n",
"0 a0 b0 c0 d0\n",
"1 a1 b1 c1 d1\n",
"2 a2 b2 c2 d2\n",
"3 a3 b3 c3 d3\n",
"4 a4 b4 c4 d4\n",
"5 a5 b5 c5 d5"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = pd.DataFrame(dane2)\n",
"df2"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "e3464b88",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" <th>D</th>\n",
" <th>E</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>a2</td>\n",
" <td>b2</td>\n",
" <td>d2</td>\n",
" <td>e2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>a4</td>\n",
" <td>b4</td>\n",
" <td>d4</td>\n",
" <td>e4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>a6</td>\n",
" <td>b6</td>\n",
" <td>d6</td>\n",
" <td>e6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>a8</td>\n",
" <td>b8</td>\n",
" <td>d8</td>\n",
" <td>e8</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" A B D E\n",
"2 a2 b2 d2 e2\n",
"4 a4 b4 d4 e4\n",
"6 a6 b6 d6 e6\n",
"8 a8 b8 d8 e8"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3 = pd.DataFrame(dane3, index=range(2, 10, 2))\n",
"df3"
]
},
{
"cell_type": "markdown",
"id": "da1456ea",
"metadata": {},
"source": [
"## append\n",
"W starszych wersjach Pandas istniała operacja `append`, która dodawała kolejne wiersze na końcu za istniejącymi.\n",
"\n",
"`df1.append(df2)`"
]
},
{
"cell_type": "markdown",
"id": "65ca264b",
"metadata": {},
"source": [
"## concat\n",
"Sklejanie danych z wielu DF, tworząc w wyniku większy (dłuższy albo szerszy) DF."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "7c0eaf56",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" <th>C</th>\n",
" <th>D</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A0</td>\n",
" <td>B0</td>\n",
" <td>C0</td>\n",
" <td>D0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A1</td>\n",
" <td>B1</td>\n",
" <td>C1</td>\n",
" <td>D1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>A2</td>\n",
" <td>B2</td>\n",
" <td>C2</td>\n",
" <td>D2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>A3</td>\n",
" <td>B3</td>\n",
" <td>C3</td>\n",
" <td>D3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>A4</td>\n",
" <td>B4</td>\n",
" <td>C4</td>\n",
" <td>D4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>A5</td>\n",
" <td>B5</td>\n",
" <td>C5</td>\n",
" <td>D5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>a0</td>\n",
" <td>b0</td>\n",
" <td>c0</td>\n",
" <td>d0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>a1</td>\n",
" <td>b1</td>\n",
" <td>c1</td>\n",
" <td>d1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>a2</td>\n",
" <td>b2</td>\n",
" <td>c2</td>\n",
" <td>d2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>a3</td>\n",
" <td>b3</td>\n",
" <td>c3</td>\n",
" <td>d3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>a4</td>\n",
" <td>b4</td>\n",
" <td>c4</td>\n",
" <td>d4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>a5</td>\n",
" <td>b5</td>\n",
" <td>c5</td>\n",
" <td>d5</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" A B C D\n",
"0 A0 B0 C0 D0\n",
"1 A1 B1 C1 D1\n",
"2 A2 B2 C2 D2\n",
"3 A3 B3 C3 D3\n",
"4 A4 B4 C4 D4\n",
"5 A5 B5 C5 D5\n",
"0 a0 b0 c0 d0\n",
"1 a1 b1 c1 d1\n",
"2 a2 b2 c2 d2\n",
"3 a3 b3 c3 d3\n",
"4 a4 b4 c4 d4\n",
"5 a5 b5 c5 d5"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat([df1, df2])"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "61928f35",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" <th>C</th>\n",
" <th>D</th>\n",
" <th>E</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A0</td>\n",
" <td>B0</td>\n",
" <td>C0</td>\n",
" <td>D0</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A1</td>\n",
" <td>B1</td>\n",
" <td>C1</td>\n",
" <td>D1</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>A2</td>\n",
" <td>B2</td>\n",
" <td>C2</td>\n",
" <td>D2</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>A3</td>\n",
" <td>B3</td>\n",
" <td>C3</td>\n",
" <td>D3</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>A4</td>\n",
" <td>B4</td>\n",
" <td>C4</td>\n",
" <td>D4</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>A5</td>\n",
" <td>B5</td>\n",
" <td>C5</td>\n",
" <td>D5</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>a0</td>\n",
" <td>b0</td>\n",
" <td>c0</td>\n",
" <td>d0</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>a1</td>\n",
" <td>b1</td>\n",
" <td>c1</td>\n",
" <td>d1</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>a2</td>\n",
" <td>b2</td>\n",
" <td>c2</td>\n",
" <td>d2</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>a3</td>\n",
" <td>b3</td>\n",
" <td>c3</td>\n",
" <td>d3</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>a4</td>\n",
" <td>b4</td>\n",
" <td>c4</td>\n",
" <td>d4</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>a5</td>\n",
" <td>b5</td>\n",
" <td>c5</td>\n",
" <td>d5</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>a2</td>\n",
" <td>b2</td>\n",
" <td>NaN</td>\n",
" <td>d2</td>\n",
" <td>e2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>a4</td>\n",
" <td>b4</td>\n",
" <td>NaN</td>\n",
" <td>d4</td>\n",
" <td>e4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>a6</td>\n",
" <td>b6</td>\n",
" <td>NaN</td>\n",
" <td>d6</td>\n",
" <td>e6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>a8</td>\n",
" <td>b8</td>\n",
" <td>NaN</td>\n",
" <td>d8</td>\n",
" <td>e8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A0</td>\n",
" <td>B0</td>\n",
" <td>C0</td>\n",
" <td>D0</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A1</td>\n",
" <td>B1</td>\n",
" <td>C1</td>\n",
" <td>D1</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>A2</td>\n",
" <td>B2</td>\n",
" <td>C2</td>\n",
" <td>D2</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>A3</td>\n",
" <td>B3</td>\n",
" <td>C3</td>\n",
" <td>D3</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>A4</td>\n",
" <td>B4</td>\n",
" <td>C4</td>\n",
" <td>D4</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>A5</td>\n",
" <td>B5</td>\n",
" <td>C5</td>\n",
" <td>D5</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" A B C D E\n",
"0 A0 B0 C0 D0 NaN\n",
"1 A1 B1 C1 D1 NaN\n",
"2 A2 B2 C2 D2 NaN\n",
"3 A3 B3 C3 D3 NaN\n",
"4 A4 B4 C4 D4 NaN\n",
"5 A5 B5 C5 D5 NaN\n",
"0 a0 b0 c0 d0 NaN\n",
"1 a1 b1 c1 d1 NaN\n",
"2 a2 b2 c2 d2 NaN\n",
"3 a3 b3 c3 d3 NaN\n",
"4 a4 b4 c4 d4 NaN\n",
"5 a5 b5 c5 d5 NaN\n",
"2 a2 b2 NaN d2 e2\n",
"4 a4 b4 NaN d4 e4\n",
"6 a6 b6 NaN d6 e6\n",
"8 a8 b8 NaN d8 e8\n",
"0 A0 B0 C0 D0 NaN\n",
"1 A1 B1 C1 D1 NaN\n",
"2 A2 B2 C2 D2 NaN\n",
"3 A3 B3 C3 D3 NaN\n",
"4 A4 B4 C4 D4 NaN\n",
"5 A5 B5 C5 D5 NaN"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Z tym, że tutaj można przekazać więcej elementów:\n",
"pd.concat([df1, df2, df3, df1])"
]
},
{
"cell_type": "markdown",
"id": "09f8f087",
"metadata": {},
"source": [
"Domyślną osią jest oś `0`, co oznacza, że dane są dołączane pod spodem."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "ed9b7d6f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" <th>C</th>\n",
" <th>D</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A0</td>\n",
" <td>B0</td>\n",
" <td>C0</td>\n",
" <td>D0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A1</td>\n",
" <td>B1</td>\n",
" <td>C1</td>\n",
" <td>D1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>A2</td>\n",
" <td>B2</td>\n",
" <td>C2</td>\n",
" <td>D2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>A3</td>\n",
" <td>B3</td>\n",
" <td>C3</td>\n",
" <td>D3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>A4</td>\n",
" <td>B4</td>\n",
" <td>C4</td>\n",
" <td>D4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>A5</td>\n",
" <td>B5</td>\n",
" <td>C5</td>\n",
" <td>D5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>a0</td>\n",
" <td>b0</td>\n",
" <td>c0</td>\n",
" <td>d0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>a1</td>\n",
" <td>b1</td>\n",
" <td>c1</td>\n",
" <td>d1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>a2</td>\n",
" <td>b2</td>\n",
" <td>c2</td>\n",
" <td>d2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>a3</td>\n",
" <td>b3</td>\n",
" <td>c3</td>\n",
" <td>d3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>a4</td>\n",
" <td>b4</td>\n",
" <td>c4</td>\n",
" <td>d4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>a5</td>\n",
" <td>b5</td>\n",
" <td>c5</td>\n",
" <td>d5</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" A B C D\n",
"0 A0 B0 C0 D0\n",
"1 A1 B1 C1 D1\n",
"2 A2 B2 C2 D2\n",
"3 A3 B3 C3 D3\n",
"4 A4 B4 C4 D4\n",
"5 A5 B5 C5 D5\n",
"0 a0 b0 c0 d0\n",
"1 a1 b1 c1 d1\n",
"2 a2 b2 c2 d2\n",
"3 a3 b3 c3 d3\n",
"4 a4 b4 c4 d4\n",
"5 a5 b5 c5 d5"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat([df1, df2], axis=0)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "9192ca48",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" <th>C</th>\n",
" <th>D</th>\n",
" <th>E</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A0</td>\n",
" <td>B0</td>\n",
" <td>C0</td>\n",
" <td>D0</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A1</td>\n",
" <td>B1</td>\n",
" <td>C1</td>\n",
" <td>D1</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>A2</td>\n",
" <td>B2</td>\n",
" <td>C2</td>\n",
" <td>D2</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>A3</td>\n",
" <td>B3</td>\n",
" <td>C3</td>\n",
" <td>D3</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>A4</td>\n",
" <td>B4</td>\n",
" <td>C4</td>\n",
" <td>D4</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>A5</td>\n",
" <td>B5</td>\n",
" <td>C5</td>\n",
" <td>D5</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>a2</td>\n",
" <td>b2</td>\n",
" <td>NaN</td>\n",
" <td>d2</td>\n",
" <td>e2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>a4</td>\n",
" <td>b4</td>\n",
" <td>NaN</td>\n",
" <td>d4</td>\n",
" <td>e4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>a6</td>\n",
" <td>b6</td>\n",
" <td>NaN</td>\n",
" <td>d6</td>\n",
" <td>e6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>a8</td>\n",
" <td>b8</td>\n",
" <td>NaN</td>\n",
" <td>d8</td>\n",
" <td>e8</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" A B C D E\n",
"0 A0 B0 C0 D0 NaN\n",
"1 A1 B1 C1 D1 NaN\n",
"2 A2 B2 C2 D2 NaN\n",
"3 A3 B3 C3 D3 NaN\n",
"4 A4 B4 C4 D4 NaN\n",
"5 A5 B5 C5 D5 NaN\n",
"2 a2 b2 NaN d2 e2\n",
"4 a4 b4 NaN d4 e4\n",
"6 a6 b6 NaN d6 e6\n",
"8 a8 b8 NaN d8 e8"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat([df1, df3], axis=0)"
]
},
{
"cell_type": "markdown",
"id": "408fc196",
"metadata": {},
"source": [
"Gdy podamy `axis=1`, to dane są umieszczane obok.\n",
"Teraz indeksy wierszy decydują, gdzie które rekordy się dopasują."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "5a788e3c",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" <th>C</th>\n",
" <th>D</th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" <th>C</th>\n",
" <th>D</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A0</td>\n",
" <td>B0</td>\n",
" <td>C0</td>\n",
" <td>D0</td>\n",
" <td>a0</td>\n",
" <td>b0</td>\n",
" <td>c0</td>\n",
" <td>d0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A1</td>\n",
" <td>B1</td>\n",
" <td>C1</td>\n",
" <td>D1</td>\n",
" <td>a1</td>\n",
" <td>b1</td>\n",
" <td>c1</td>\n",
" <td>d1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>A2</td>\n",
" <td>B2</td>\n",
" <td>C2</td>\n",
" <td>D2</td>\n",
" <td>a2</td>\n",
" <td>b2</td>\n",
" <td>c2</td>\n",
" <td>d2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>A3</td>\n",
" <td>B3</td>\n",
" <td>C3</td>\n",
" <td>D3</td>\n",
" <td>a3</td>\n",
" <td>b3</td>\n",
" <td>c3</td>\n",
" <td>d3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>A4</td>\n",
" <td>B4</td>\n",
" <td>C4</td>\n",
" <td>D4</td>\n",
" <td>a4</td>\n",
" <td>b4</td>\n",
" <td>c4</td>\n",
" <td>d4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>A5</td>\n",
" <td>B5</td>\n",
" <td>C5</td>\n",
" <td>D5</td>\n",
" <td>a5</td>\n",
" <td>b5</td>\n",
" <td>c5</td>\n",
" <td>d5</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" A B C D A B C D\n",
"0 A0 B0 C0 D0 a0 b0 c0 d0\n",
"1 A1 B1 C1 D1 a1 b1 c1 d1\n",
"2 A2 B2 C2 D2 a2 b2 c2 d2\n",
"3 A3 B3 C3 D3 a3 b3 c3 d3\n",
"4 A4 B4 C4 D4 a4 b4 c4 d4\n",
"5 A5 B5 C5 D5 a5 b5 c5 d5"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat([df1, df2], axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "25386e54",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" <th>C</th>\n",
" <th>D</th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" <th>D</th>\n",
" <th>E</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A0</td>\n",
" <td>B0</td>\n",
" <td>C0</td>\n",
" <td>D0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A1</td>\n",
" <td>B1</td>\n",
" <td>C1</td>\n",
" <td>D1</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>A2</td>\n",
" <td>B2</td>\n",
" <td>C2</td>\n",
" <td>D2</td>\n",
" <td>a2</td>\n",
" <td>b2</td>\n",
" <td>d2</td>\n",
" <td>e2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>A3</td>\n",
" <td>B3</td>\n",
" <td>C3</td>\n",
" <td>D3</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>A4</td>\n",
" <td>B4</td>\n",
" <td>C4</td>\n",
" <td>D4</td>\n",
" <td>a4</td>\n",
" <td>b4</td>\n",
" <td>d4</td>\n",
" <td>e4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>A5</td>\n",
" <td>B5</td>\n",
" <td>C5</td>\n",
" <td>D5</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>a6</td>\n",
" <td>b6</td>\n",
" <td>d6</td>\n",
" <td>e6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>a8</td>\n",
" <td>b8</td>\n",
" <td>d8</td>\n",
" <td>e8</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" A B C D A B D E\n",
"0 A0 B0 C0 D0 NaN NaN NaN NaN\n",
"1 A1 B1 C1 D1 NaN NaN NaN NaN\n",
"2 A2 B2 C2 D2 a2 b2 d2 e2\n",
"3 A3 B3 C3 D3 NaN NaN NaN NaN\n",
"4 A4 B4 C4 D4 a4 b4 d4 e4\n",
"5 A5 B5 C5 D5 NaN NaN NaN NaN\n",
"6 NaN NaN NaN NaN a6 b6 d6 e6\n",
"8 NaN NaN NaN NaN a8 b8 d8 e8"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat([df1, df3], axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "a835a43a",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>A_x</th>\n",
" <th>B_x</th>\n",
" <th>C</th>\n",
" <th>D_x</th>\n",
" <th>A_y</th>\n",
" <th>B_y</th>\n",
" <th>D_y</th>\n",
" <th>E</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A0</td>\n",
" <td>B0</td>\n",
" <td>C0</td>\n",
" <td>D0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A1</td>\n",
" <td>B1</td>\n",
" <td>C1</td>\n",
" <td>D1</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>A2</td>\n",
" <td>B2</td>\n",
" <td>C2</td>\n",
" <td>D2</td>\n",
" <td>a2</td>\n",
" <td>b2</td>\n",
" <td>d2</td>\n",
" <td>e2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>A3</td>\n",
" <td>B3</td>\n",
" <td>C3</td>\n",
" <td>D3</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>A4</td>\n",
" <td>B4</td>\n",
" <td>C4</td>\n",
" <td>D4</td>\n",
" <td>a4</td>\n",
" <td>b4</td>\n",
" <td>d4</td>\n",
" <td>e4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>A5</td>\n",
" <td>B5</td>\n",
" <td>C5</td>\n",
" <td>D5</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>a6</td>\n",
" <td>b6</td>\n",
" <td>d6</td>\n",
" <td>e6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>a8</td>\n",
" <td>b8</td>\n",
" <td>d8</td>\n",
" <td>e8</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" A_x B_x C D_x A_y B_y D_y E\n",
"0 A0 B0 C0 D0 NaN NaN NaN NaN\n",
"1 A1 B1 C1 D1 NaN NaN NaN NaN\n",
"2 A2 B2 C2 D2 a2 b2 d2 e2\n",
"3 A3 B3 C3 D3 NaN NaN NaN NaN\n",
"4 A4 B4 C4 D4 a4 b4 d4 e4\n",
"5 A5 B5 C5 D5 NaN NaN NaN NaN\n",
"6 NaN NaN NaN NaN a6 b6 d6 e6\n",
"8 NaN NaN NaN NaN a8 b8 d8 e8"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1.merge(df3, left_index=True, right_index=True, how='outer')"
]
},
{
"cell_type": "markdown",
"id": "cb8edd95",
"metadata": {},
"source": [
"Jak widać, w opracji concat dane są dopasowywane na podstawie indeksów (axis=1) lub nazw kolumn (axis=0)"
]
},
{
"cell_type": "markdown",
"id": "fb005fdf",
"metadata": {},
"source": [
"Złączenie krzyżowe / iloczyn kartezjański, w SQL CROSS JOIN - tworzy wszystkie możliwe kombinacje rekordów z lewej i prawej tabeli."
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "b0cacdb1",
"metadata": {},
"outputs": [],
"source": [
"a = pd.DataFrame({\n",
" \"litera\": ['A','B','C','D'],\n",
" \"imie\": ['Ala','Basia','Celina','Dorota'],\n",
"}, index=[1,2,3,4])\n",
"b = pd.DataFrame({\n",
" \"litera\": ['M','N','P'],\n",
" \"imie\": ['Michał','Norbert','Patryk'],\n",
"}, index=[11,12,13])"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "9edea132",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>litera</th>\n",
" <th>imie</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A</td>\n",
" <td>Ala</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>B</td>\n",
" <td>Basia</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>C</td>\n",
" <td>Celina</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>D</td>\n",
" <td>Dorota</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" litera imie\n",
"1 A Ala\n",
"2 B Basia\n",
"3 C Celina\n",
"4 D Dorota"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"a"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "e3c6914a",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>litera</th>\n",
" <th>imie</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>M</td>\n",
" <td>Michał</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>N</td>\n",
" <td>Norbert</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>P</td>\n",
" <td>Patryk</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" litera imie\n",
"11 M Michał\n",
"12 N Norbert\n",
"13 P Patryk"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"b"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "a34723c5",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>litera_x</th>\n",
" <th>imie_x</th>\n",
" <th>litera_y</th>\n",
" <th>imie_y</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A</td>\n",
" <td>Ala</td>\n",
" <td>M</td>\n",
" <td>Michał</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A</td>\n",
" <td>Ala</td>\n",
" <td>N</td>\n",
" <td>Norbert</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>A</td>\n",
" <td>Ala</td>\n",
" <td>P</td>\n",
" <td>Patryk</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>B</td>\n",
" <td>Basia</td>\n",
" <td>M</td>\n",
" <td>Michał</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>B</td>\n",
" <td>Basia</td>\n",
" <td>N</td>\n",
" <td>Norbert</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>B</td>\n",
" <td>Basia</td>\n",
" <td>P</td>\n",
" <td>Patryk</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>C</td>\n",
" <td>Celina</td>\n",
" <td>M</td>\n",
" <td>Michał</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>C</td>\n",
" <td>Celina</td>\n",
" <td>N</td>\n",
" <td>Norbert</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>C</td>\n",
" <td>Celina</td>\n",
" <td>P</td>\n",
" <td>Patryk</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>D</td>\n",
" <td>Dorota</td>\n",
" <td>M</td>\n",
" <td>Michał</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>D</td>\n",
" <td>Dorota</td>\n",
" <td>N</td>\n",
" <td>Norbert</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>D</td>\n",
" <td>Dorota</td>\n",
" <td>P</td>\n",
" <td>Patryk</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" litera_x imie_x litera_y imie_y\n",
"0 A Ala M Michał\n",
"1 A Ala N Norbert\n",
"2 A Ala P Patryk\n",
"3 B Basia M Michał\n",
"4 B Basia N Norbert\n",
"5 B Basia P Patryk\n",
"6 C Celina M Michał\n",
"7 C Celina N Norbert\n",
"8 C Celina P Patryk\n",
"9 D Dorota M Michał\n",
"10 D Dorota N Norbert\n",
"11 D Dorota P Patryk"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"a.merge(b, how='cross')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d820be7d",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "b513146a",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"from datetime import datetime, date"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "e00dfe57",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"datetime.date(2023, 5, 30)"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"date.today()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "15a50eef",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2023-05-30\n"
]
}
],
"source": [
"print(date.today())"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "1bafeaf9",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"datetime.datetime(2023, 5, 30, 11, 41, 33, 423933)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"datetime.now()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "deaed5fe",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2023-05-30 11:41:33.524192\n"
]
}
],
"source": [
"print(datetime.now())"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "ba0d0479",
"metadata": {},
"outputs": [],
"source": [
"dt = datetime.now()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "91d549e7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"41"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dt.minute"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "92f86fde",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'30.05.2023 11:41:33'"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dt.strftime('%d.%m.%Y %H:%M:%S')"
]
},
{
"cell_type": "markdown",
"id": "67d95001",
"metadata": {},
"source": [
"Przypomnijmy sobie, że w języku Python mamy konstrukcję `range`, a w Numpy mamy `arange` - kontrukcje służące do generowania serii liczb.`"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "593e7394",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([10. , 10.5, 11. , 11.5, 12. , 12.5, 13. , 13.5, 14. , 14.5])"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.arange(10, 15, 0.5)"
]
},
{
"cell_type": "markdown",
"id": "763f9368",
"metadata": {},
"source": [
"Pandas oferuje analogiczne rozwiązanie do generowania \"punktów w czasie\".\n",
"Różnica względem `range` - przedział `date_range` jest obustronnie domknięty."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "a3ec6293",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DatetimeIndex(['2022-07-01', '2022-07-02', '2022-07-03', '2022-07-04',\n",
" '2022-07-05', '2022-07-06', '2022-07-07', '2022-07-08',\n",
" '2022-07-09', '2022-07-10', '2022-07-11', '2022-07-12',\n",
" '2022-07-13', '2022-07-14', '2022-07-15', '2022-07-16',\n",
" '2022-07-17', '2022-07-18', '2022-07-19', '2022-07-20',\n",
" '2022-07-21', '2022-07-22', '2022-07-23', '2022-07-24',\n",
" '2022-07-25', '2022-07-26', '2022-07-27', '2022-07-28',\n",
" '2022-07-29', '2022-07-30', '2022-07-31'],\n",
" dtype='datetime64[ns]', freq='D')"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.date_range('2022-07-01', '2022-07-31')"
]
},
{
"cell_type": "markdown",
"id": "f45df065",
"metadata": {},
"source": [
"Domyślnie stosowana jest częstotliwość `D` czyli jednego dnia. Ale można użyć innych, do czego służą specjalne symbole.\n",
"\n",
"Inne literki: `D M W Q Y H` https://stackoverflow.com/questions/35339139/what-values-are-valid-in-pandas-freq-tags"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "a31ad36d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DatetimeIndex(['2022-01-31', '2022-02-28', '2022-03-31', '2022-04-30',\n",
" '2022-05-31', '2022-06-30', '2022-07-31', '2022-08-31',\n",
" '2022-09-30', '2022-10-31', '2022-11-30', '2022-12-31'],\n",
" dtype='datetime64[ns]', freq='M')"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.date_range('2022-01-01', '2022-12-31', freq='M')"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "7288f27d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DatetimeIndex(['2022-01-01', '2022-02-01', '2022-03-01', '2022-04-01',\n",
" '2022-05-01', '2022-06-01', '2022-07-01', '2022-08-01',\n",
" '2022-09-01', '2022-10-01', '2022-11-01', '2022-12-01'],\n",
" dtype='datetime64[ns]', freq='MS')"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.date_range('2022-01-01', '2022-12-31', freq='MS')"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "4e1f48bf",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DatetimeIndex(['2022-01-31', '2022-02-28', '2022-03-31', '2022-04-30',\n",
" '2022-05-31', '2022-06-30', '2022-07-31', '2022-08-31',\n",
" '2022-09-30', '2022-10-31', '2022-11-30'],\n",
" dtype='datetime64[ns]', freq='M')"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.date_range('2022-01-15', '2022-12-15', freq='M')"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "0b86b8d7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DatetimeIndex(['2022-02-01', '2022-03-01', '2022-04-01', '2022-05-01',\n",
" '2022-06-01', '2022-07-01', '2022-08-01', '2022-09-01',\n",
" '2022-10-01', '2022-11-01', '2022-12-01'],\n",
" dtype='datetime64[ns]', freq='MS')"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.date_range('2022-01-15', '2022-12-15', freq='MS')"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "aef3049d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DatetimeIndex(['2022-01-02', '2022-01-09', '2022-01-16', '2022-01-23',\n",
" '2022-01-30', '2022-02-06', '2022-02-13', '2022-02-20',\n",
" '2022-02-27', '2022-03-06', '2022-03-13', '2022-03-20',\n",
" '2022-03-27', '2022-04-03', '2022-04-10', '2022-04-17',\n",
" '2022-04-24', '2022-05-01', '2022-05-08', '2022-05-15',\n",
" '2022-05-22', '2022-05-29', '2022-06-05', '2022-06-12',\n",
" '2022-06-19', '2022-06-26', '2022-07-03', '2022-07-10',\n",
" '2022-07-17', '2022-07-24', '2022-07-31', '2022-08-07',\n",
" '2022-08-14', '2022-08-21', '2022-08-28', '2022-09-04',\n",
" '2022-09-11', '2022-09-18', '2022-09-25', '2022-10-02',\n",
" '2022-10-09', '2022-10-16', '2022-10-23', '2022-10-30',\n",
" '2022-11-06', '2022-11-13', '2022-11-20', '2022-11-27',\n",
" '2022-12-04', '2022-12-11', '2022-12-18', '2022-12-25'],\n",
" dtype='datetime64[ns]', freq='W-SUN')"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.date_range('2022-01-01', '2022-12-31', freq='W')"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "62da9f7b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DatetimeIndex(['2022-01-03', '2022-01-10', '2022-01-17', '2022-01-24',\n",
" '2022-01-31', '2022-02-07', '2022-02-14', '2022-02-21',\n",
" '2022-02-28', '2022-03-07', '2022-03-14', '2022-03-21',\n",
" '2022-03-28', '2022-04-04', '2022-04-11', '2022-04-18',\n",
" '2022-04-25', '2022-05-02', '2022-05-09', '2022-05-16',\n",
" '2022-05-23', '2022-05-30', '2022-06-06', '2022-06-13',\n",
" '2022-06-20', '2022-06-27', '2022-07-04', '2022-07-11',\n",
" '2022-07-18', '2022-07-25', '2022-08-01', '2022-08-08',\n",
" '2022-08-15', '2022-08-22', '2022-08-29', '2022-09-05',\n",
" '2022-09-12', '2022-09-19', '2022-09-26', '2022-10-03',\n",
" '2022-10-10', '2022-10-17', '2022-10-24', '2022-10-31',\n",
" '2022-11-07', '2022-11-14', '2022-11-21', '2022-11-28',\n",
" '2022-12-05', '2022-12-12', '2022-12-19', '2022-12-26'],\n",
" dtype='datetime64[ns]', freq='W-MON')"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.date_range('2022-01-01', '2022-12-31', freq='W-MON')"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "ed4fafa2",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DatetimeIndex(['2022-03-31', '2022-06-30', '2022-09-30', '2022-12-31'], dtype='datetime64[ns]', freq='Q-DEC')"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.date_range('2022-01-01', '2022-12-31', freq='Q')"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "fe836e03",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DatetimeIndex(['2022-01-01', '2022-04-01', '2022-07-01', '2022-10-01'], dtype='datetime64[ns]', freq='QS-JAN')"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.date_range('2022-01-01', '2022-12-31', freq='QS')"
]
},
{
"cell_type": "markdown",
"id": "4d53bda8",
"metadata": {},
"source": [
"Zamiast daty końcowej można podać oczekiwanę liczbę elementów."
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "58e61194",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DatetimeIndex(['2022-01-01', '2022-02-01', '2022-03-01', '2022-04-01',\n",
" '2022-05-01', '2022-06-01'],\n",
" dtype='datetime64[ns]', freq='MS')"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.date_range('2022-01-01', periods=6, freq='MS')"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "5641c244",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DatetimeIndex(['2023-05-30', '2023-05-31', '2023-06-01', '2023-06-02',\n",
" '2023-06-03', '2023-06-04', '2023-06-05', '2023-06-06',\n",
" '2023-06-07', '2023-06-08'],\n",
" dtype='datetime64[ns]', freq='D')"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.date_range(date.today(), periods=10)"
]
},
{
"cell_type": "markdown",
"id": "3f39623e",
"metadata": {},
"source": [
"Dziesięć niedziel od dzisiaj:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "5f47eabb",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DatetimeIndex(['2023-06-04', '2023-06-11', '2023-06-18', '2023-06-25',\n",
" '2023-07-02', '2023-07-09', '2023-07-16', '2023-07-23',\n",
" '2023-07-30', '2023-08-06'],\n",
" dtype='datetime64[ns]', freq='W-SUN')"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.date_range(date.today(), periods=10, freq='W')"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "1ca5af2e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DatetimeIndex(['2022-05-01 00:00:00', '2022-05-01 01:00:00',\n",
" '2022-05-01 02:00:00', '2022-05-01 03:00:00',\n",
" '2022-05-01 04:00:00', '2022-05-01 05:00:00',\n",
" '2022-05-01 06:00:00', '2022-05-01 07:00:00',\n",
" '2022-05-01 08:00:00', '2022-05-01 09:00:00',\n",
" '2022-05-01 10:00:00', '2022-05-01 11:00:00',\n",
" '2022-05-01 12:00:00', '2022-05-01 13:00:00',\n",
" '2022-05-01 14:00:00', '2022-05-01 15:00:00',\n",
" '2022-05-01 16:00:00', '2022-05-01 17:00:00',\n",
" '2022-05-01 18:00:00', '2022-05-01 19:00:00',\n",
" '2022-05-01 20:00:00', '2022-05-01 21:00:00',\n",
" '2022-05-01 22:00:00', '2022-05-01 23:00:00',\n",
" '2022-05-02 00:00:00', '2022-05-02 01:00:00',\n",
" '2022-05-02 02:00:00', '2022-05-02 03:00:00',\n",
" '2022-05-02 04:00:00', '2022-05-02 05:00:00',\n",
" '2022-05-02 06:00:00', '2022-05-02 07:00:00',\n",
" '2022-05-02 08:00:00', '2022-05-02 09:00:00',\n",
" '2022-05-02 10:00:00', '2022-05-02 11:00:00',\n",
" '2022-05-02 12:00:00', '2022-05-02 13:00:00',\n",
" '2022-05-02 14:00:00', '2022-05-02 15:00:00',\n",
" '2022-05-02 16:00:00', '2022-05-02 17:00:00',\n",
" '2022-05-02 18:00:00', '2022-05-02 19:00:00',\n",
" '2022-05-02 20:00:00', '2022-05-02 21:00:00',\n",
" '2022-05-02 22:00:00', '2022-05-02 23:00:00',\n",
" '2022-05-03 00:00:00'],\n",
" dtype='datetime64[ns]', freq='H')"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.date_range('2022-05-01', '2022-05-03', freq='H')"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "b78c1e27",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DatetimeIndex(['2023-05-30 11:41:35.224442', '2023-05-30 12:11:35.224442',\n",
" '2023-05-30 12:41:35.224442', '2023-05-30 13:11:35.224442',\n",
" '2023-05-30 13:41:35.224442', '2023-05-30 14:11:35.224442',\n",
" '2023-05-30 14:41:35.224442', '2023-05-30 15:11:35.224442',\n",
" '2023-05-30 15:41:35.224442', '2023-05-30 16:11:35.224442',\n",
" '2023-05-30 16:41:35.224442', '2023-05-30 17:11:35.224442',\n",
" '2023-05-30 17:41:35.224442', '2023-05-30 18:11:35.224442',\n",
" '2023-05-30 18:41:35.224442', '2023-05-30 19:11:35.224442'],\n",
" dtype='datetime64[ns]', freq='30T')"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 30 minut\n",
"pd.date_range(datetime.now(), periods=16, freq='30T')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
{
"cells": [
{
"cell_type": "markdown",
"id": "f9eebfea",
"metadata": {},
"source": [
"# Dostęp do baz danych SQL w Pythonie\n",
"\n",
"\n",
"## Moduł sqlite3\n",
"\n",
"Jest to element biblioteki standardowej Pythona. Pozwala korzystać z baz danych SQLite. Zbiór danych SQLite zapisany jest w formie zwykłego pliku. Nie trzeba instalować niczego dodatkowego na swoim komputerze, a można wykonywać zapytania SQL."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "e1de17a4",
"metadata": {},
"outputs": [],
"source": [
"import sqlite3 as sql"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "a5abd4fc",
"metadata": {},
"outputs": [],
"source": [
"connection = sql.connect('hr.db')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "a11857e9",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<sqlite3.Connection at 0x7fc1ba894740>"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"connection"
]
},
{
"cell_type": "markdown",
"id": "bab3a1ba",
"metadata": {},
"source": [
"Najłatwiej wykonać zapytanie bezpośrednio na obiekcie connection. Wynikiem tego polecenia jest obiekt kursora."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "f1612439",
"metadata": {},
"outputs": [],
"source": [
"kursor = connection.execute('SELECT * FROM employees')"
]
},
{
"cell_type": "markdown",
"id": "813afc5f",
"metadata": {},
"source": [
"Za pomocą kursora można odczytywać rekordy pojedynczo, porcjami, albo od razu wszystkie.\n",
"\n",
"Odczyt wszystkich wyników w postaci listy. Każdy rekord na tej liście jest zapisany jako tupla."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "0f6704f0",
"metadata": {},
"outputs": [],
"source": [
"lista = kursor.fetchall()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "2b242430",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[(100,\n",
" 'Steven',\n",
" 'King',\n",
" 'SKING',\n",
" '515.123.4567',\n",
" '1987-06-17',\n",
" 'AD_PRES',\n",
" 24000,\n",
" None,\n",
" None,\n",
" 90),\n",
" (101,\n",
" 'Neena',\n",
" 'Kochhar',\n",
" 'NKOCHHAR',\n",
" '515.123.4568',\n",
" '1989-09-21',\n",
" 'AD_VP',\n",
" 17000,\n",
" None,\n",
" 100,\n",
" 90),\n",
" (102,\n",
" 'Lex',\n",
" 'De Haan',\n",
" 'LDEHAAN',\n",
" '515.123.4569',\n",
" '1993-01-13',\n",
" 'AD_VP',\n",
" 17000,\n",
" None,\n",
" 100,\n",
" 90),\n",
" (103,\n",
" 'Alexander',\n",
" 'Hunold',\n",
" 'AHUNOLD',\n",
" '590.423.4567',\n",
" '1990-01-03',\n",
" 'IT_PROG',\n",
" 9000,\n",
" None,\n",
" 102,\n",
" 60),\n",
" (104,\n",
" 'Bruce',\n",
" 'Ernst',\n",
" 'BERNST',\n",
" '590.423.4568',\n",
" '1991-05-21',\n",
" 'IT_PROG',\n",
" 6000,\n",
" None,\n",
" 103,\n",
" 60)]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lista[0:5]"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "26a3ba59",
"metadata": {},
"outputs": [],
"source": [
"rekord = lista[0]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "07d836ac",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"tuple"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(rekord)"
]
},
{
"cell_type": "markdown",
"id": "8d89ca70",
"metadata": {},
"source": [
"Pojedynczy rekord jest tuplą. Dostęp do pól wygląda tak:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "22a5c746",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Steven'"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rekord[1]"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "1ca115ca",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"24000"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rekord[7]"
]
},
{
"cell_type": "markdown",
"id": "cd60daca",
"metadata": {},
"source": [
"W przypadku dużej liczby rekordów bardziej wydajne jest pobieranie rekordów wynikowych w pętli bezpośrednio z kursora, bez tworzenia listy wszystkich rekordów na raz."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "acb11cae",
"metadata": {},
"outputs": [],
"source": [
"zapytanie = '''SELECT first_name, last_name, job_title, salary, department_name, city\n",
"FROM employees\n",
" LEFT JOIN jobs USING(job_id)\n",
" LEFT JOIN departments USING(department_id)\n",
" LEFT JOIN locations USING(location_id)\n",
"ORDER BY salary DESC'''\n",
"kursor = connection.execute(zapytanie)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "c49bb272",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Pracownik Steven King (President) zarabia 24000. Departament Executive w mieście Seattle.\n",
"Pracownik Neena Kochhar (Administration Vice President) zarabia 17000. Departament Executive w mieście Seattle.\n",
"Pracownik Lex De Haan (Administration Vice President) zarabia 17000. Departament Executive w mieście Seattle.\n",
"Pracownik John Russell (Sales Manager) zarabia 14000. Departament Sales w mieście Oxford.\n",
"Pracownik Karen Partners (Sales Manager) zarabia 13500. Departament Sales w mieście Oxford.\n",
"Pracownik Michael Hartstein (Marketing Manager) zarabia 13000. Departament Marketing w mieście Toronto.\n",
"Pracownik Nancy Greenberg (Finance Manager) zarabia 12000. Departament Finance w mieście Seattle.\n",
"Pracownik Alberto Errazuriz (Sales Manager) zarabia 12000. Departament Sales w mieście Oxford.\n",
"Pracownik Shelley Higgins (Accounting Manager) zarabia 12000. Departament Accounting w mieście Seattle.\n",
"Pracownik Lisa Ozer (Sales Representative) zarabia 11500. Departament Sales w mieście Oxford.\n",
"Pracownik Den Raphaely (Purchasing Manager) zarabia 11000. Departament Purchasing w mieście Seattle.\n",
"Pracownik Gerald Cambrault (Sales Manager) zarabia 11000. Departament Sales w mieście Oxford.\n",
"Pracownik Ellen Abel (Sales Representative) zarabia 11000. Departament Sales w mieście Oxford.\n",
"Pracownik Eleni Zlotkey (Sales Manager) zarabia 10500. Departament Sales w mieście Oxford.\n",
"Pracownik Clara Vishney (Sales Representative) zarabia 10500. Departament Sales w mieście Oxford.\n",
"Pracownik Peter Tucker (Sales Representative) zarabia 10000. Departament Sales w mieście Oxford.\n",
"Pracownik Janette King (Sales Representative) zarabia 10000. Departament Sales w mieście Oxford.\n",
"Pracownik Harrison Bloom (Sales Representative) zarabia 10000. Departament Sales w mieście Oxford.\n",
"Pracownik Hermann Baer (Public Relations Representative) zarabia 10000. Departament Public Relations w mieście Munich.\n",
"Pracownik Tayler Fox (Sales Representative) zarabia 9600. Departament Sales w mieście Oxford.\n",
"Pracownik David Bernstein (Sales Representative) zarabia 9500. Departament Sales w mieście Oxford.\n",
"Pracownik Patrick Sully (Sales Representative) zarabia 9500. Departament Sales w mieście Oxford.\n",
"Pracownik Danielle Greene (Sales Representative) zarabia 9500. Departament Sales w mieście Oxford.\n",
"Pracownik Alexander Hunold (Programmer) zarabia 9000. Departament IT w mieście Southlake.\n",
"Pracownik Daniel Faviet (Accountant) zarabia 9000. Departament Finance w mieście Seattle.\n",
"Pracownik Peter Hall (Sales Representative) zarabia 9000. Departament Sales w mieście Oxford.\n",
"Pracownik Allan McEwen (Sales Representative) zarabia 9000. Departament Sales w mieście Oxford.\n",
"Pracownik Alyssa Hutton (Sales Representative) zarabia 8800. Departament Sales w mieście Oxford.\n",
"Pracownik Jonathon Taylor (Sales Representative) zarabia 8600. Departament Sales w mieście Oxford.\n",
"Pracownik Jack Livingston (Sales Representative) zarabia 8400. Departament Sales w mieście Oxford.\n",
"Pracownik William Gietz (Public Accountant) zarabia 8300. Departament Accounting w mieście Seattle.\n",
"Pracownik John Chen (Accountant) zarabia 8200. Departament Finance w mieście Seattle.\n",
"Pracownik Adam Fripp (Stock Manager) zarabia 8200. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Matthew Weiss (Stock Manager) zarabia 8000. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Christopher Olsen (Sales Representative) zarabia 8000. Departament Sales w mieście Oxford.\n",
"Pracownik Lindsey Smith (Sales Representative) zarabia 8000. Departament Sales w mieście Oxford.\n",
"Pracownik Payam Kaufling (Stock Manager) zarabia 7900. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Jose Manuel Urman (Accountant) zarabia 7800. Departament Finance w mieście Seattle.\n",
"Pracownik Ismael Sciarra (Accountant) zarabia 7700. Departament Finance w mieście Seattle.\n",
"Pracownik Nanette Cambrault (Sales Representative) zarabia 7500. Departament Sales w mieście Oxford.\n",
"Pracownik Louise Doran (Sales Representative) zarabia 7500. Departament Sales w mieście Oxford.\n",
"Pracownik William Smith (Sales Representative) zarabia 7400. Departament Sales w mieście Oxford.\n",
"Pracownik Elizabeth Bates (Sales Representative) zarabia 7300. Departament Sales w mieście Oxford.\n",
"Pracownik Mattea Marvins (Sales Representative) zarabia 7200. Departament Sales w mieście Oxford.\n",
"Pracownik Oliver Tuvault (Sales Representative) zarabia 7000. Departament Sales w mieście Oxford.\n",
"Pracownik Sarath Sewall (Sales Representative) zarabia 7000. Departament Sales w mieście Oxford.\n",
"Pracownik Kimberely Grant (Sales Representative) zarabia 7000. Departament None w mieście None.\n",
"Pracownik Luis Popp (Accountant) zarabia 6900. Departament Finance w mieście Seattle.\n",
"Pracownik David Lee (Sales Representative) zarabia 6800. Departament Sales w mieście Oxford.\n",
"Pracownik Shanta Vollman (Stock Manager) zarabia 6500. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Susan Mavris (Human Resources Representative) zarabia 6500. Departament Human Resources w mieście London.\n",
"Pracownik Sundar Ande (Sales Representative) zarabia 6400. Departament Sales w mieście Oxford.\n",
"Pracownik Amit Banda (Sales Representative) zarabia 6200. Departament Sales w mieście Oxford.\n",
"Pracownik Charles Johnson (Sales Representative) zarabia 6200. Departament Sales w mieście Oxford.\n",
"Pracownik Sundita Kumar (Sales Representative) zarabia 6100. Departament Sales w mieście Oxford.\n",
"Pracownik Bruce Ernst (Programmer) zarabia 6000. Departament IT w mieście Southlake.\n",
"Pracownik Pat Fay (Marketing Representative) zarabia 6000. Departament Marketing w mieście Toronto.\n",
"Pracownik Kevin Mourgos (Stock Manager) zarabia 5800. Departament Shipping w mieście South San Francisco.\n",
"Pracownik David Austin (Programmer) zarabia 4800. Departament IT w mieście Southlake.\n",
"Pracownik Valli Pataballa (Programmer) zarabia 4800. Departament IT w mieście Southlake.\n",
"Pracownik Jennifer Whalen (Administration Assistant) zarabia 4400. Departament Administration w mieście Seattle.\n",
"Pracownik Diana Lorentz (Programmer) zarabia 4200. Departament IT w mieście Southlake.\n",
"Pracownik Nandita Sarchand (Shipping Clerk) zarabia 4200. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Alexis Bull (Shipping Clerk) zarabia 4100. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Sarah Bell (Shipping Clerk) zarabia 4000. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Britney Everett (Shipping Clerk) zarabia 3900. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Kelly Chung (Shipping Clerk) zarabia 3800. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Renske Ladwig (Stock Clerk) zarabia 3600. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Jennifer Dilly (Shipping Clerk) zarabia 3600. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Trenna Rajs (Stock Clerk) zarabia 3500. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Julia Dellinger (Shipping Clerk) zarabia 3400. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Laura Bissot (Stock Clerk) zarabia 3300. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Jason Mallin (Stock Clerk) zarabia 3300. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Julia Nayer (Stock Clerk) zarabia 3200. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Stephen Stiles (Stock Clerk) zarabia 3200. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Winston Taylor (Shipping Clerk) zarabia 3200. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Samuel McCain (Shipping Clerk) zarabia 3200. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Alexander Khoo (Purchasing Clerk) zarabia 3100. Departament Purchasing w mieście Seattle.\n",
"Pracownik Curtis Davies (Stock Clerk) zarabia 3100. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Jean Fleaur (Shipping Clerk) zarabia 3100. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Alana Walsh (Shipping Clerk) zarabia 3100. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Anthony Cabrio (Shipping Clerk) zarabia 3000. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Kevin Feeney (Shipping Clerk) zarabia 3000. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Shelli Baida (Purchasing Clerk) zarabia 2900. Departament Purchasing w mieście Seattle.\n",
"Pracownik Michael Rogers (Stock Clerk) zarabia 2900. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Timothy Gates (Shipping Clerk) zarabia 2900. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Sigal Tobias (Purchasing Clerk) zarabia 2800. Departament Purchasing w mieście Seattle.\n",
"Pracownik Mozhe Atkinson (Stock Clerk) zarabia 2800. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Girard Geoni (Shipping Clerk) zarabia 2800. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Vance Jones (Shipping Clerk) zarabia 2800. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Irene Mikkilineni (Stock Clerk) zarabia 2700. Departament Shipping w mieście South San Francisco.\n",
"Pracownik John Seo (Stock Clerk) zarabia 2700. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Guy Himuro (Purchasing Clerk) zarabia 2600. Departament Purchasing w mieście Seattle.\n",
"Pracownik Randall Matos (Stock Clerk) zarabia 2600. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Donald OConnell (Shipping Clerk) zarabia 2600. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Douglas Grant (Shipping Clerk) zarabia 2600. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Karen Colmenares (Purchasing Clerk) zarabia 2500. Departament Purchasing w mieście Seattle.\n",
"Pracownik James Marlow (Stock Clerk) zarabia 2500. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Joshua Patel (Stock Clerk) zarabia 2500. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Peter Vargas (Stock Clerk) zarabia 2500. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Martha Sullivan (Shipping Clerk) zarabia 2500. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Randall Perkins (Shipping Clerk) zarabia 2500. Departament Shipping w mieście South San Francisco.\n",
"Pracownik James Landry (Stock Clerk) zarabia 2400. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Ki Gee (Stock Clerk) zarabia 2400. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Steven Markle (Stock Clerk) zarabia 2200. Departament Shipping w mieście South San Francisco.\n",
"Pracownik Hazel Philtanker (Stock Clerk) zarabia 2200. Departament Shipping w mieście South San Francisco.\n",
"Pracownik TJ Olson (Stock Clerk) zarabia 2100. Departament Shipping w mieście South San Francisco.\n"
]
}
],
"source": [
"for rekord in kursor:\n",
" print(f'Pracownik {rekord[0]} {rekord[1]} ({rekord[2]}) zarabia {rekord[3]}. Departament {rekord[4]} w mieście {rekord[5]}.')"
]
},
{
"cell_type": "markdown",
"id": "ec86e166",
"metadata": {},
"source": [
"### Parametry zapytań\n",
"\n",
"Gdy chcemy w zapytaniu SQL umieścić warunek, np. wypisać tylko pracowników z określonego stanowiska:\n",
"\n",
"- Nie powinniśmy wczytywać wszystkich danych z tabeli i sprawdzać warunku po stronie aplikacji (w Pythonie), bo byłoby to niewydajne. To baza danych ma sprawdzić warunek i pobrać tylko te dane, które spełniają warunek. Nie skorzystalibyśmy wtedy z indeksów i innych rozwiązań optymalizujących bazy danych. W przypadku bazy serwerowej (jak Oracle, MySQL, ...) pobieralibyśmy dużo danych przez sieć.\n",
"- **Absolutnie nie powinniśmy** samodzielnie sklejać zapytania SQL z fragmentów tekstu, jeśli niektóre wartości są wpisywane przez użytkownika. Narażamy się wówczas na atak \"SQL injection\"."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "14777f13",
"metadata": {},
"outputs": [],
"source": [
"# zakładamy, że to może podać użytkownik\n",
"szukany_job = 'IT_PROG'"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "9641f70b",
"metadata": {},
"outputs": [],
"source": [
"# TO są złe podejścia, narażające program na SQL injection:\n",
"# connection.execute(\"SELECT * FROM employees WHERE job_id = '\" + szukany_job + \"'\")\n",
"# connection.execute(f\"SELECT * FROM employees WHERE job_id = '{szukany_job}'\")"
]
},
{
"cell_type": "markdown",
"id": "7f30da72",
"metadata": {},
"source": [
"W prawidlowo napisanym programie parametry do zapytań SQL powinny zostać przekazane jako dodatakowe parametry (w formie tupli) operacji execute, a za pomocą znaków zapytania ? w treści polecenia SQL oznaczamy miejsca, gdzie te parametry mają być wpisane:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "c51b5d46",
"metadata": {},
"outputs": [],
"source": [
"kursor = connection.execute(\"SELECT * FROM employees WHERE job_id = ?\", (szukany_job,))"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "a071d07a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[(103,\n",
" 'Alexander',\n",
" 'Hunold',\n",
" 'AHUNOLD',\n",
" '590.423.4567',\n",
" '1990-01-03',\n",
" 'IT_PROG',\n",
" 9000,\n",
" None,\n",
" 102,\n",
" 60),\n",
" (104,\n",
" 'Bruce',\n",
" 'Ernst',\n",
" 'BERNST',\n",
" '590.423.4568',\n",
" '1991-05-21',\n",
" 'IT_PROG',\n",
" 6000,\n",
" None,\n",
" 103,\n",
" 60),\n",
" (105,\n",
" 'David',\n",
" 'Austin',\n",
" 'DAUSTIN',\n",
" '590.423.4569',\n",
" '1997-06-25',\n",
" 'IT_PROG',\n",
" 4800,\n",
" None,\n",
" 103,\n",
" 60),\n",
" (106,\n",
" 'Valli',\n",
" 'Pataballa',\n",
" 'VPATABAL',\n",
" '590.423.4560',\n",
" '1998-02-05',\n",
" 'IT_PROG',\n",
" 4800,\n",
" None,\n",
" 103,\n",
" 60),\n",
" (107,\n",
" 'Diana',\n",
" 'Lorentz',\n",
" 'DLORENTZ',\n",
" '590.423.5567',\n",
" '1999-02-07',\n",
" 'IT_PROG',\n",
" 4200,\n",
" None,\n",
" 103,\n",
" 60)]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"kursor.fetchall()"
]
},
{
"cell_type": "markdown",
"id": "b4954aba",
"metadata": {},
"source": [
"### Modyfikacja (zapis) danych\n",
"\n",
"Przykład UPDATE, czyli zmiany instniejących rekordów."
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "9b2b0faa",
"metadata": {},
"outputs": [],
"source": [
"podwyzka = 500\n",
"job = 'IT_PROG'"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "5a6b0d17",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<sqlite3.Cursor at 0x7fc1ba9020c0>"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"connection.execute('UPDATE employees SET salary = salary + ? WHERE job_id = ?', (podwyzka, job))"
]
},
{
"cell_type": "markdown",
"id": "0ace0cc9",
"metadata": {},
"source": [
"Przykład pojedynczego INSERT."
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "25f81c4a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<sqlite3.Cursor at 0x7fc1ba902340>"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sql1 = \"INSERT INTO countries(country_id, country_name, region_id) VALUES (?, ?, ?)\"\n",
"connection.execute(sql1, ('PL', 'Poland', 1))"
]
},
{
"cell_type": "markdown",
"id": "031ff0de",
"metadata": {},
"source": [
"Przykład kilku INSERT-ów w serii. Wywołanie `executemany` opłaca się bardziej niż samodzielnie napisana pętla."
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "23558cf9",
"metadata": {},
"outputs": [],
"source": [
"kraje = [\n",
" ('CZ', 'Czechia', 1),\n",
" ('SK', 'Slovakia', 1),\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "56199b6d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Seria INSERT-ów...\n"
]
},
{
"data": {
"text/plain": [
"<sqlite3.Cursor at 0x7fc1ba9025c0>"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print('Seria INSERT-ów...')\n",
"sql2 = \"INSERT INTO countries(country_id, country_name, region_id) VALUES (?, ?, ?)\"\n",
"connection.executemany(sql2, kraje)"
]
},
{
"cell_type": "markdown",
"id": "241a39d2",
"metadata": {},
"source": [
"W bazach SQL można używać transakcji.\n",
"\n",
"W przypadku SQLite połączenie zostało otwarte w takim trybie, że dopóki nie wykonamy polecenia `commit`, zmiany w bazie nie będą zapisane w sposób trwały. Na razie plik na dysku się nie zmienił."
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "adabaf4e",
"metadata": {},
"outputs": [],
"source": [
"# Tak wyglądałoby zatwierdzenie zmian:\n",
"# connection.commit()"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "9b8e0d37",
"metadata": {},
"outputs": [],
"source": [
"# A tak wygląda wycofanie zmian. Czyli nasza aplikacja wraca do stanu z momentu otwarcia połączenia (albo poprzedniego commita)\n",
"connection.rollback()\n",
"# Nawet gdybyśmy nie zrobili rollback, to mziany w bazie nie zostaną zapisane."
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "1259c778",
"metadata": {},
"outputs": [],
"source": [
"# Połączenia zasadniczo należy zamykać:\n",
"connection.close()"
]
},
{
"cell_type": "markdown",
"id": "b6bcca09",
"metadata": {},
"source": [
"# Wczytywanie danych SQL do Pandas\n",
"\n",
"Niezależnie od rodzaju bazy danych, jeśli mamy obiekt connection zgodny z wytycznymi Pythona, to Pandas potrafi poprzez to połączenie wykonać zapytanie SQL-owe i pobrać jego wyniki jako DataFrame.\n",
"\n",
"To nie musi być SQLite; równie dobrze mogłoby to być połączenie z ORaclem, SQL Serverem, MySQL czy PostgreSQL."
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "962f96f2",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "85065f89",
"metadata": {},
"outputs": [],
"source": [
"connection = sql.connect('hr.db')"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "9120e8bf",
"metadata": {},
"outputs": [],
"source": [
"employees = pd.read_sql('SELECT * FROM employees', connection)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "4df89c59",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>employee_id</th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>email</th>\n",
" <th>phone_number</th>\n",
" <th>hire_date</th>\n",
" <th>job_id</th>\n",
" <th>salary</th>\n",
" <th>commission_pct</th>\n",
" <th>manager_id</th>\n",
" <th>department_id</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>100</td>\n",
" <td>Steven</td>\n",
" <td>King</td>\n",
" <td>SKING</td>\n",
" <td>515.123.4567</td>\n",
" <td>1987-06-17</td>\n",
" <td>AD_PRES</td>\n",
" <td>24000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>90.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>101</td>\n",
" <td>Neena</td>\n",
" <td>Kochhar</td>\n",
" <td>NKOCHHAR</td>\n",
" <td>515.123.4568</td>\n",
" <td>1989-09-21</td>\n",
" <td>AD_VP</td>\n",
" <td>17000</td>\n",
" <td>NaN</td>\n",
" <td>100.0</td>\n",
" <td>90.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>102</td>\n",
" <td>Lex</td>\n",
" <td>De Haan</td>\n",
" <td>LDEHAAN</td>\n",
" <td>515.123.4569</td>\n",
" <td>1993-01-13</td>\n",
" <td>AD_VP</td>\n",
" <td>17000</td>\n",
" <td>NaN</td>\n",
" <td>100.0</td>\n",
" <td>90.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>103</td>\n",
" <td>Alexander</td>\n",
" <td>Hunold</td>\n",
" <td>AHUNOLD</td>\n",
" <td>590.423.4567</td>\n",
" <td>1990-01-03</td>\n",
" <td>IT_PROG</td>\n",
" <td>9000</td>\n",
" <td>NaN</td>\n",
" <td>102.0</td>\n",
" <td>60.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>104</td>\n",
" <td>Bruce</td>\n",
" <td>Ernst</td>\n",
" <td>BERNST</td>\n",
" <td>590.423.4568</td>\n",
" <td>1991-05-21</td>\n",
" <td>IT_PROG</td>\n",
" <td>6000</td>\n",
" <td>NaN</td>\n",
" <td>103.0</td>\n",
" <td>60.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>102</th>\n",
" <td>202</td>\n",
" <td>Pat</td>\n",
" <td>Fay</td>\n",
" <td>PFAY</td>\n",
" <td>603.123.6666</td>\n",
" <td>1997-08-17</td>\n",
" <td>MK_REP</td>\n",
" <td>6000</td>\n",
" <td>NaN</td>\n",
" <td>201.0</td>\n",
" <td>20.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>103</th>\n",
" <td>203</td>\n",
" <td>Susan</td>\n",
" <td>Mavris</td>\n",
" <td>SMAVRIS</td>\n",
" <td>515.123.7777</td>\n",
" <td>1994-06-07</td>\n",
" <td>HR_REP</td>\n",
" <td>6500</td>\n",
" <td>NaN</td>\n",
" <td>101.0</td>\n",
" <td>40.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>104</th>\n",
" <td>204</td>\n",
" <td>Hermann</td>\n",
" <td>Baer</td>\n",
" <td>HBAER</td>\n",
" <td>515.123.8888</td>\n",
" <td>1994-06-07</td>\n",
" <td>PR_REP</td>\n",
" <td>10000</td>\n",
" <td>NaN</td>\n",
" <td>101.0</td>\n",
" <td>70.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>105</th>\n",
" <td>205</td>\n",
" <td>Shelley</td>\n",
" <td>Higgins</td>\n",
" <td>SHIGGINS</td>\n",
" <td>515.123.8080</td>\n",
" <td>1994-06-07</td>\n",
" <td>AC_MGR</td>\n",
" <td>12000</td>\n",
" <td>NaN</td>\n",
" <td>101.0</td>\n",
" <td>110.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>106</th>\n",
" <td>206</td>\n",
" <td>William</td>\n",
" <td>Gietz</td>\n",
" <td>WGIETZ</td>\n",
" <td>515.123.8181</td>\n",
" <td>1994-06-07</td>\n",
" <td>AC_ACCOUNT</td>\n",
" <td>8300</td>\n",
" <td>NaN</td>\n",
" <td>205.0</td>\n",
" <td>110.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>107 rows × 11 columns</p>\n",
"</div>"
],
"text/plain": [
" employee_id first_name last_name email phone_number hire_date \n",
"0 100 Steven King SKING 515.123.4567 1987-06-17 \\\n",
"1 101 Neena Kochhar NKOCHHAR 515.123.4568 1989-09-21 \n",
"2 102 Lex De Haan LDEHAAN 515.123.4569 1993-01-13 \n",
"3 103 Alexander Hunold AHUNOLD 590.423.4567 1990-01-03 \n",
"4 104 Bruce Ernst BERNST 590.423.4568 1991-05-21 \n",
".. ... ... ... ... ... ... \n",
"102 202 Pat Fay PFAY 603.123.6666 1997-08-17 \n",
"103 203 Susan Mavris SMAVRIS 515.123.7777 1994-06-07 \n",
"104 204 Hermann Baer HBAER 515.123.8888 1994-06-07 \n",
"105 205 Shelley Higgins SHIGGINS 515.123.8080 1994-06-07 \n",
"106 206 William Gietz WGIETZ 515.123.8181 1994-06-07 \n",
"\n",
" job_id salary commission_pct manager_id department_id \n",
"0 AD_PRES 24000 NaN NaN 90.0 \n",
"1 AD_VP 17000 NaN 100.0 90.0 \n",
"2 AD_VP 17000 NaN 100.0 90.0 \n",
"3 IT_PROG 9000 NaN 102.0 60.0 \n",
"4 IT_PROG 6000 NaN 103.0 60.0 \n",
".. ... ... ... ... ... \n",
"102 MK_REP 6000 NaN 201.0 20.0 \n",
"103 HR_REP 6500 NaN 101.0 40.0 \n",
"104 PR_REP 10000 NaN 101.0 70.0 \n",
"105 AC_MGR 12000 NaN 101.0 110.0 \n",
"106 AC_ACCOUNT 8300 NaN 205.0 110.0 \n",
"\n",
"[107 rows x 11 columns]"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"employees"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "a851941a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"employee_id int64\n",
"first_name object\n",
"last_name object\n",
"email object\n",
"phone_number object\n",
"hire_date object\n",
"job_id object\n",
"salary int64\n",
"commission_pct float64\n",
"manager_id float64\n",
"department_id float64\n",
"dtype: object"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"employees.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "b7c6b1aa",
"metadata": {},
"outputs": [],
"source": [
"kod = 'SELECT * FROM employees LEFT JOIN departments USING(department_id) LEFT JOIN locations USING(location_id) ORDER BY employee_id'\n",
"employees = pd.read_sql(kod, connection, index_col='employee_id', parse_dates=['hire_date'])"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "769781a8",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>email</th>\n",
" <th>phone_number</th>\n",
" <th>hire_date</th>\n",
" <th>job_id</th>\n",
" <th>salary</th>\n",
" <th>commission_pct</th>\n",
" <th>manager_id</th>\n",
" <th>department_id</th>\n",
" <th>department_name</th>\n",
" <th>location_id</th>\n",
" <th>street_address</th>\n",
" <th>postal_code</th>\n",
" <th>city</th>\n",
" <th>state_province</th>\n",
" <th>country_id</th>\n",
" </tr>\n",
" <tr>\n",
" <th>employee_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>100</th>\n",
" <td>Steven</td>\n",
" <td>King</td>\n",
" <td>SKING</td>\n",
" <td>515.123.4567</td>\n",
" <td>1987-06-17</td>\n",
" <td>AD_PRES</td>\n",
" <td>24000</td>\n",
" <td>NaN</td>\n",
" <td>100.0</td>\n",
" <td>90.0</td>\n",
" <td>Executive</td>\n",
" <td>1700.0</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>Washington</td>\n",
" <td>US</td>\n",
" </tr>\n",
" <tr>\n",
" <th>101</th>\n",
" <td>Neena</td>\n",
" <td>Kochhar</td>\n",
" <td>NKOCHHAR</td>\n",
" <td>515.123.4568</td>\n",
" <td>1989-09-21</td>\n",
" <td>AD_VP</td>\n",
" <td>17000</td>\n",
" <td>NaN</td>\n",
" <td>100.0</td>\n",
" <td>90.0</td>\n",
" <td>Executive</td>\n",
" <td>1700.0</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>Washington</td>\n",
" <td>US</td>\n",
" </tr>\n",
" <tr>\n",
" <th>102</th>\n",
" <td>Lex</td>\n",
" <td>De Haan</td>\n",
" <td>LDEHAAN</td>\n",
" <td>515.123.4569</td>\n",
" <td>1993-01-13</td>\n",
" <td>AD_VP</td>\n",
" <td>17000</td>\n",
" <td>NaN</td>\n",
" <td>100.0</td>\n",
" <td>90.0</td>\n",
" <td>Executive</td>\n",
" <td>1700.0</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>Washington</td>\n",
" <td>US</td>\n",
" </tr>\n",
" <tr>\n",
" <th>103</th>\n",
" <td>Alexander</td>\n",
" <td>Hunold</td>\n",
" <td>AHUNOLD</td>\n",
" <td>590.423.4567</td>\n",
" <td>1990-01-03</td>\n",
" <td>IT_PROG</td>\n",
" <td>9000</td>\n",
" <td>NaN</td>\n",
" <td>103.0</td>\n",
" <td>60.0</td>\n",
" <td>IT</td>\n",
" <td>1400.0</td>\n",
" <td>2014 Jabberwocky Rd</td>\n",
" <td>26192</td>\n",
" <td>Southlake</td>\n",
" <td>Texas</td>\n",
" <td>US</td>\n",
" </tr>\n",
" <tr>\n",
" <th>104</th>\n",
" <td>Bruce</td>\n",
" <td>Ernst</td>\n",
" <td>BERNST</td>\n",
" <td>590.423.4568</td>\n",
" <td>1991-05-21</td>\n",
" <td>IT_PROG</td>\n",
" <td>6000</td>\n",
" <td>NaN</td>\n",
" <td>103.0</td>\n",
" <td>60.0</td>\n",
" <td>IT</td>\n",
" <td>1400.0</td>\n",
" <td>2014 Jabberwocky Rd</td>\n",
" <td>26192</td>\n",
" <td>Southlake</td>\n",
" <td>Texas</td>\n",
" <td>US</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>202</th>\n",
" <td>Pat</td>\n",
" <td>Fay</td>\n",
" <td>PFAY</td>\n",
" <td>603.123.6666</td>\n",
" <td>1997-08-17</td>\n",
" <td>MK_REP</td>\n",
" <td>6000</td>\n",
" <td>NaN</td>\n",
" <td>201.0</td>\n",
" <td>20.0</td>\n",
" <td>Marketing</td>\n",
" <td>1800.0</td>\n",
" <td>147 Spadina Ave</td>\n",
" <td>M5V 2L7</td>\n",
" <td>Toronto</td>\n",
" <td>Ontario</td>\n",
" <td>CA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>203</th>\n",
" <td>Susan</td>\n",
" <td>Mavris</td>\n",
" <td>SMAVRIS</td>\n",
" <td>515.123.7777</td>\n",
" <td>1994-06-07</td>\n",
" <td>HR_REP</td>\n",
" <td>6500</td>\n",
" <td>NaN</td>\n",
" <td>203.0</td>\n",
" <td>40.0</td>\n",
" <td>Human Resources</td>\n",
" <td>2400.0</td>\n",
" <td>8204 Arthur St</td>\n",
" <td>None</td>\n",
" <td>London</td>\n",
" <td>None</td>\n",
" <td>UK</td>\n",
" </tr>\n",
" <tr>\n",
" <th>204</th>\n",
" <td>Hermann</td>\n",
" <td>Baer</td>\n",
" <td>HBAER</td>\n",
" <td>515.123.8888</td>\n",
" <td>1994-06-07</td>\n",
" <td>PR_REP</td>\n",
" <td>10000</td>\n",
" <td>NaN</td>\n",
" <td>204.0</td>\n",
" <td>70.0</td>\n",
" <td>Public Relations</td>\n",
" <td>2700.0</td>\n",
" <td>Schwanthalerstr. 7031</td>\n",
" <td>80925</td>\n",
" <td>Munich</td>\n",
" <td>Bavaria</td>\n",
" <td>DE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>205</th>\n",
" <td>Shelley</td>\n",
" <td>Higgins</td>\n",
" <td>SHIGGINS</td>\n",
" <td>515.123.8080</td>\n",
" <td>1994-06-07</td>\n",
" <td>AC_MGR</td>\n",
" <td>12000</td>\n",
" <td>NaN</td>\n",
" <td>205.0</td>\n",
" <td>110.0</td>\n",
" <td>Accounting</td>\n",
" <td>1700.0</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>Washington</td>\n",
" <td>US</td>\n",
" </tr>\n",
" <tr>\n",
" <th>206</th>\n",
" <td>William</td>\n",
" <td>Gietz</td>\n",
" <td>WGIETZ</td>\n",
" <td>515.123.8181</td>\n",
" <td>1994-06-07</td>\n",
" <td>AC_ACCOUNT</td>\n",
" <td>8300</td>\n",
" <td>NaN</td>\n",
" <td>205.0</td>\n",
" <td>110.0</td>\n",
" <td>Accounting</td>\n",
" <td>1700.0</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>Washington</td>\n",
" <td>US</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>107 rows × 17 columns</p>\n",
"</div>"
],
"text/plain": [
" first_name last_name email phone_number hire_date \n",
"employee_id \n",
"100 Steven King SKING 515.123.4567 1987-06-17 \\\n",
"101 Neena Kochhar NKOCHHAR 515.123.4568 1989-09-21 \n",
"102 Lex De Haan LDEHAAN 515.123.4569 1993-01-13 \n",
"103 Alexander Hunold AHUNOLD 590.423.4567 1990-01-03 \n",
"104 Bruce Ernst BERNST 590.423.4568 1991-05-21 \n",
"... ... ... ... ... ... \n",
"202 Pat Fay PFAY 603.123.6666 1997-08-17 \n",
"203 Susan Mavris SMAVRIS 515.123.7777 1994-06-07 \n",
"204 Hermann Baer HBAER 515.123.8888 1994-06-07 \n",
"205 Shelley Higgins SHIGGINS 515.123.8080 1994-06-07 \n",
"206 William Gietz WGIETZ 515.123.8181 1994-06-07 \n",
"\n",
" job_id salary commission_pct manager_id department_id \n",
"employee_id \n",
"100 AD_PRES 24000 NaN 100.0 90.0 \\\n",
"101 AD_VP 17000 NaN 100.0 90.0 \n",
"102 AD_VP 17000 NaN 100.0 90.0 \n",
"103 IT_PROG 9000 NaN 103.0 60.0 \n",
"104 IT_PROG 6000 NaN 103.0 60.0 \n",
"... ... ... ... ... ... \n",
"202 MK_REP 6000 NaN 201.0 20.0 \n",
"203 HR_REP 6500 NaN 203.0 40.0 \n",
"204 PR_REP 10000 NaN 204.0 70.0 \n",
"205 AC_MGR 12000 NaN 205.0 110.0 \n",
"206 AC_ACCOUNT 8300 NaN 205.0 110.0 \n",
"\n",
" department_name location_id street_address postal_code \n",
"employee_id \n",
"100 Executive 1700.0 2004 Charade Rd 98199 \\\n",
"101 Executive 1700.0 2004 Charade Rd 98199 \n",
"102 Executive 1700.0 2004 Charade Rd 98199 \n",
"103 IT 1400.0 2014 Jabberwocky Rd 26192 \n",
"104 IT 1400.0 2014 Jabberwocky Rd 26192 \n",
"... ... ... ... ... \n",
"202 Marketing 1800.0 147 Spadina Ave M5V 2L7 \n",
"203 Human Resources 2400.0 8204 Arthur St None \n",
"204 Public Relations 2700.0 Schwanthalerstr. 7031 80925 \n",
"205 Accounting 1700.0 2004 Charade Rd 98199 \n",
"206 Accounting 1700.0 2004 Charade Rd 98199 \n",
"\n",
" city state_province country_id \n",
"employee_id \n",
"100 Seattle Washington US \n",
"101 Seattle Washington US \n",
"102 Seattle Washington US \n",
"103 Southlake Texas US \n",
"104 Southlake Texas US \n",
"... ... ... ... \n",
"202 Toronto Ontario CA \n",
"203 London None UK \n",
"204 Munich Bavaria DE \n",
"205 Seattle Washington US \n",
"206 Seattle Washington US \n",
"\n",
"[107 rows x 17 columns]"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"employees"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "e8daacd3",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"first_name object\n",
"last_name object\n",
"email object\n",
"phone_number object\n",
"hire_date datetime64[ns]\n",
"job_id object\n",
"salary int64\n",
"commission_pct float64\n",
"manager_id float64\n",
"department_id float64\n",
"department_name object\n",
"location_id float64\n",
"street_address object\n",
"postal_code object\n",
"city object\n",
"state_province object\n",
"country_id object\n",
"dtype: object"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"employees.dtypes"
]
},
{
"cell_type": "markdown",
"id": "aae0fa3e",
"metadata": {},
"source": [
"Można wykonać dowolne zapytanie SQL, w tym np. grupowanie."
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "1aeaddaf",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>count(employee_id)</th>\n",
" <th>avg(salary)</th>\n",
" </tr>\n",
" <tr>\n",
" <th>department_name</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>None</th>\n",
" <td>1</td>\n",
" <td>7000.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Accounting</th>\n",
" <td>2</td>\n",
" <td>10150.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Administration</th>\n",
" <td>1</td>\n",
" <td>4400.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Executive</th>\n",
" <td>3</td>\n",
" <td>19333.333333</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Finance</th>\n",
" <td>6</td>\n",
" <td>8600.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Human Resources</th>\n",
" <td>1</td>\n",
" <td>6500.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>IT</th>\n",
" <td>5</td>\n",
" <td>5760.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Marketing</th>\n",
" <td>2</td>\n",
" <td>9500.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Public Relations</th>\n",
" <td>1</td>\n",
" <td>10000.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Purchasing</th>\n",
" <td>6</td>\n",
" <td>4150.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Sales</th>\n",
" <td>34</td>\n",
" <td>8955.882353</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Shipping</th>\n",
" <td>45</td>\n",
" <td>3475.555556</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" count(employee_id) avg(salary)\n",
"department_name \n",
"None 1 7000.000000\n",
"Accounting 2 10150.000000\n",
"Administration 1 4400.000000\n",
"Executive 3 19333.333333\n",
"Finance 6 8600.000000\n",
"Human Resources 1 6500.000000\n",
"IT 5 5760.000000\n",
"Marketing 2 9500.000000\n",
"Public Relations 1 10000.000000\n",
"Purchasing 6 4150.000000\n",
"Sales 34 8955.882353\n",
"Shipping 45 3475.555556"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"deps = pd.read_sql('SELECT department_name, count(employee_id), avg(salary) FROM employees '\n",
" 'LEFT JOIN departments USING(department_id) '\n",
" 'GROUP BY department_id, department_name '\n",
" 'ORDER BY department_name', connection, index_col='department_name')\n",
"deps"
]
},
{
"cell_type": "markdown",
"id": "5777a6ab",
"metadata": {},
"source": [
"Zadania analityczno-obliczeniowe można wykonywać albo za pomocą SQL, albo już w pamięci za pomocą poleceń Pandas.\n",
"\n",
"Zaleta drugiego podejścia: dane wczytujemy tylko raz i mamy już je w pamięci, a później możemy wielokrotnie wykonywać rózne operacje. Z drugiej strony wstępne przetworzeie danych już w zapytaniu SQL może pomóc w przypadku dużych tabel.\n",
"\n",
"Wyrażenia pisane w Pandas (`tabela[warunek]` , `tabela.groupby('kryterium')`, ...) **nie są** automatyczne tłumaczone na zapytania SQL. Pandas działą na prostszej zasadzie: najpierw musisz wczytać dane do pamięci, a dopiero potem możesz na nich operować."
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "46df69e5",
"metadata": {},
"outputs": [],
"source": [
"connection.close()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
{
"cells": [
{
"cell_type": "markdown",
"id": "ce796b23",
"metadata": {},
"source": [
"JSON (JavaScript Object Notation) to format ostatnio bardzo popularny w wymianie danych z usługami sieciowymi. Wiele usług dostepnych w internecie, np. płatności online, dostęp do serwisów społecznościowych, posiada interfejs programistyczny (\"API\") działający w oparciu o zapytania JSON.\n",
"\n",
"Również dane często są dostepne w tym formacie. Pandas (i ogólnie Python) pozwalają te dane pobrać."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "23a6855a",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "82c818a7",
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_json(\"https://api.nbp.pl/api/exchangerates/rates/A/EUR/2021-10-01/2021-12-31/?format=json\")"
]
},
{
"cell_type": "markdown",
"id": "a8683274",
"metadata": {},
"source": [
"Jednak obiektowa / \"drzewiasta\" struktura danch JSON nie zawsze dobrze wspasowuje się do płaskich tabel takich jak DataFrame...\n",
"\n",
"W kolumnie rates wartościami są Pythonowe słowniki."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "de1714fc",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>table</th>\n",
" <th>currency</th>\n",
" <th>code</th>\n",
" <th>rates</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A</td>\n",
" <td>euro</td>\n",
" <td>EUR</td>\n",
" <td>{'no': '191/A/NBP/2021', 'effectiveDate': '202...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A</td>\n",
" <td>euro</td>\n",
" <td>EUR</td>\n",
" <td>{'no': '192/A/NBP/2021', 'effectiveDate': '202...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>A</td>\n",
" <td>euro</td>\n",
" <td>EUR</td>\n",
" <td>{'no': '193/A/NBP/2021', 'effectiveDate': '202...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>A</td>\n",
" <td>euro</td>\n",
" <td>EUR</td>\n",
" <td>{'no': '194/A/NBP/2021', 'effectiveDate': '202...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>A</td>\n",
" <td>euro</td>\n",
" <td>EUR</td>\n",
" <td>{'no': '195/A/NBP/2021', 'effectiveDate': '202...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>A</td>\n",
" <td>euro</td>\n",
" <td>EUR</td>\n",
" <td>{'no': '196/A/NBP/2021', 'effectiveDate': '202...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>A</td>\n",
" <td>euro</td>\n",
" <td>EUR</td>\n",
" <td>{'no': '197/A/NBP/2021', 'effectiveDate': '202...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>A</td>\n",
" <td>euro</td>\n",
" <td>EUR</td>\n",
" <td>{'no': '198/A/NBP/2021', 'effectiveDate': '202...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>A</td>\n",
" <td>euro</td>\n",
" <td>EUR</td>\n",
" <td>{'no': '199/A/NBP/2021', 'effectiveDate': '202...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>A</td>\n",
" <td>euro</td>\n",
" <td>EUR</td>\n",
" <td>{'no': '200/A/NBP/2021', 'effectiveDate': '202...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" table currency code rates\n",
"0 A euro EUR {'no': '191/A/NBP/2021', 'effectiveDate': '202...\n",
"1 A euro EUR {'no': '192/A/NBP/2021', 'effectiveDate': '202...\n",
"2 A euro EUR {'no': '193/A/NBP/2021', 'effectiveDate': '202...\n",
"3 A euro EUR {'no': '194/A/NBP/2021', 'effectiveDate': '202...\n",
"4 A euro EUR {'no': '195/A/NBP/2021', 'effectiveDate': '202...\n",
"5 A euro EUR {'no': '196/A/NBP/2021', 'effectiveDate': '202...\n",
"6 A euro EUR {'no': '197/A/NBP/2021', 'effectiveDate': '202...\n",
"7 A euro EUR {'no': '198/A/NBP/2021', 'effectiveDate': '202...\n",
"8 A euro EUR {'no': '199/A/NBP/2021', 'effectiveDate': '202...\n",
"9 A euro EUR {'no': '200/A/NBP/2021', 'effectiveDate': '202..."
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head(10)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "1f99410f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"table object\n",
"currency object\n",
"code object\n",
"rates object\n",
"dtype: object"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "b58e9299",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'no': '191/A/NBP/2021', 'effectiveDate': '2021-10-01', 'mid': 4.5941}"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[0, 'rates']"
]
},
{
"cell_type": "markdown",
"id": "8aec9fe6",
"metadata": {},
"source": [
"Przygotwuję nowy DF, w którym wartościami są pola mid z tych słowników, a indeksem są wartości pól effectiveDate"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "89be33ed",
"metadata": {},
"outputs": [],
"source": [
"df2 = pd.DataFrame(df.rates.apply(lambda rekord: rekord[\"mid\"]))"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "8d83f47f",
"metadata": {},
"outputs": [],
"source": [
"df2.index = df.rates.apply(lambda rekord: rekord[\"effectiveDate\"])"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "e6172f22",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>rates</th>\n",
" </tr>\n",
" <tr>\n",
" <th>rates</th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2021-10-01</th>\n",
" <td>4.5941</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2021-10-04</th>\n",
" <td>4.5716</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2021-10-05</th>\n",
" <td>4.6034</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2021-10-06</th>\n",
" <td>4.6203</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2021-10-07</th>\n",
" <td>4.5472</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2021-12-27</th>\n",
" <td>4.6239</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2021-12-28</th>\n",
" <td>4.6028</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2021-12-29</th>\n",
" <td>4.5997</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2021-12-30</th>\n",
" <td>4.5915</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2021-12-31</th>\n",
" <td>4.5994</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>64 rows × 1 columns</p>\n",
"</div>"
],
"text/plain": [
" rates\n",
"rates \n",
"2021-10-01 4.5941\n",
"2021-10-04 4.5716\n",
"2021-10-05 4.6034\n",
"2021-10-06 4.6203\n",
"2021-10-07 4.5472\n",
"... ...\n",
"2021-12-27 4.6239\n",
"2021-12-28 4.6028\n",
"2021-12-29 4.5997\n",
"2021-12-30 4.5915\n",
"2021-12-31 4.5994\n",
"\n",
"[64 rows x 1 columns]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "8263ce65",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>data</th>\n",
" <th>nr_tabeli</th>\n",
" <th>kurs</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2021-10-01</td>\n",
" <td>191/A/NBP/2021</td>\n",
" <td>4.5941</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2021-10-04</td>\n",
" <td>192/A/NBP/2021</td>\n",
" <td>4.5716</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2021-10-05</td>\n",
" <td>193/A/NBP/2021</td>\n",
" <td>4.6034</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2021-10-06</td>\n",
" <td>194/A/NBP/2021</td>\n",
" <td>4.6203</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2021-10-07</td>\n",
" <td>195/A/NBP/2021</td>\n",
" <td>4.5472</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>59</th>\n",
" <td>2021-12-27</td>\n",
" <td>250/A/NBP/2021</td>\n",
" <td>4.6239</td>\n",
" </tr>\n",
" <tr>\n",
" <th>60</th>\n",
" <td>2021-12-28</td>\n",
" <td>251/A/NBP/2021</td>\n",
" <td>4.6028</td>\n",
" </tr>\n",
" <tr>\n",
" <th>61</th>\n",
" <td>2021-12-29</td>\n",
" <td>252/A/NBP/2021</td>\n",
" <td>4.5997</td>\n",
" </tr>\n",
" <tr>\n",
" <th>62</th>\n",
" <td>2021-12-30</td>\n",
" <td>253/A/NBP/2021</td>\n",
" <td>4.5915</td>\n",
" </tr>\n",
" <tr>\n",
" <th>63</th>\n",
" <td>2021-12-31</td>\n",
" <td>254/A/NBP/2021</td>\n",
" <td>4.5994</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>64 rows × 3 columns</p>\n",
"</div>"
],
"text/plain": [
" data nr_tabeli kurs\n",
"0 2021-10-01 191/A/NBP/2021 4.5941\n",
"1 2021-10-04 192/A/NBP/2021 4.5716\n",
"2 2021-10-05 193/A/NBP/2021 4.6034\n",
"3 2021-10-06 194/A/NBP/2021 4.6203\n",
"4 2021-10-07 195/A/NBP/2021 4.5472\n",
".. ... ... ...\n",
"59 2021-12-27 250/A/NBP/2021 4.6239\n",
"60 2021-12-28 251/A/NBP/2021 4.6028\n",
"61 2021-12-29 252/A/NBP/2021 4.5997\n",
"62 2021-12-30 253/A/NBP/2021 4.5915\n",
"63 2021-12-31 254/A/NBP/2021 4.5994\n",
"\n",
"[64 rows x 3 columns]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3 = pd.DataFrame({\"data\" : df.rates.apply(lambda rekord: rekord[\"effectiveDate\"]),\n",
" \"nr_tabeli\": df.rates.apply(lambda rekord: rekord[\"no\"]),\n",
" \"kurs\" : df.rates.apply(lambda rekord: rekord[\"mid\"]),\n",
" })\n",
"df3"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "26d7332f",
"metadata": {},
"outputs": [],
"source": [
"df3.index = pd.to_datetime(df3.data)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "58fd9bcd",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>data</th>\n",
" <th>nr_tabeli</th>\n",
" <th>kurs</th>\n",
" </tr>\n",
" <tr>\n",
" <th>data</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2021-10-01</th>\n",
" <td>2021-10-01</td>\n",
" <td>191/A/NBP/2021</td>\n",
" <td>4.5941</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2021-10-04</th>\n",
" <td>2021-10-04</td>\n",
" <td>192/A/NBP/2021</td>\n",
" <td>4.5716</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2021-10-05</th>\n",
" <td>2021-10-05</td>\n",
" <td>193/A/NBP/2021</td>\n",
" <td>4.6034</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2021-10-06</th>\n",
" <td>2021-10-06</td>\n",
" <td>194/A/NBP/2021</td>\n",
" <td>4.6203</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2021-10-07</th>\n",
" <td>2021-10-07</td>\n",
" <td>195/A/NBP/2021</td>\n",
" <td>4.5472</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2021-12-27</th>\n",
" <td>2021-12-27</td>\n",
" <td>250/A/NBP/2021</td>\n",
" <td>4.6239</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2021-12-28</th>\n",
" <td>2021-12-28</td>\n",
" <td>251/A/NBP/2021</td>\n",
" <td>4.6028</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2021-12-29</th>\n",
" <td>2021-12-29</td>\n",
" <td>252/A/NBP/2021</td>\n",
" <td>4.5997</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2021-12-30</th>\n",
" <td>2021-12-30</td>\n",
" <td>253/A/NBP/2021</td>\n",
" <td>4.5915</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2021-12-31</th>\n",
" <td>2021-12-31</td>\n",
" <td>254/A/NBP/2021</td>\n",
" <td>4.5994</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>64 rows × 3 columns</p>\n",
"</div>"
],
"text/plain": [
" data nr_tabeli kurs\n",
"data \n",
"2021-10-01 2021-10-01 191/A/NBP/2021 4.5941\n",
"2021-10-04 2021-10-04 192/A/NBP/2021 4.5716\n",
"2021-10-05 2021-10-05 193/A/NBP/2021 4.6034\n",
"2021-10-06 2021-10-06 194/A/NBP/2021 4.6203\n",
"2021-10-07 2021-10-07 195/A/NBP/2021 4.5472\n",
"... ... ... ...\n",
"2021-12-27 2021-12-27 250/A/NBP/2021 4.6239\n",
"2021-12-28 2021-12-28 251/A/NBP/2021 4.6028\n",
"2021-12-29 2021-12-29 252/A/NBP/2021 4.5997\n",
"2021-12-30 2021-12-30 253/A/NBP/2021 4.5915\n",
"2021-12-31 2021-12-31 254/A/NBP/2021 4.5994\n",
"\n",
"[64 rows x 3 columns]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "34992328",
"metadata": {},
"outputs": [],
"source": [
"df3.drop(columns='data', inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "57c70395",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DatetimeIndex(['2021-10-01', '2021-10-04', '2021-10-05', '2021-10-06',\n",
" '2021-10-07', '2021-10-08', '2021-10-11', '2021-10-12',\n",
" '2021-10-13', '2021-10-14', '2021-10-15', '2021-10-18',\n",
" '2021-10-19', '2021-10-20', '2021-10-21', '2021-10-22',\n",
" '2021-10-25', '2021-10-26', '2021-10-27', '2021-10-28',\n",
" '2021-10-29', '2021-11-02', '2021-11-03', '2021-11-04',\n",
" '2021-11-05', '2021-11-08', '2021-11-09', '2021-11-10',\n",
" '2021-11-12', '2021-11-15', '2021-11-16', '2021-11-17',\n",
" '2021-11-18', '2021-11-19', '2021-11-22', '2021-11-23',\n",
" '2021-11-24', '2021-11-25', '2021-11-26', '2021-11-29',\n",
" '2021-11-30', '2021-12-01', '2021-12-02', '2021-12-03',\n",
" '2021-12-06', '2021-12-07', '2021-12-08', '2021-12-09',\n",
" '2021-12-10', '2021-12-13', '2021-12-14', '2021-12-15',\n",
" '2021-12-16', '2021-12-17', '2021-12-20', '2021-12-21',\n",
" '2021-12-22', '2021-12-23', '2021-12-24', '2021-12-27',\n",
" '2021-12-28', '2021-12-29', '2021-12-30', '2021-12-31'],\n",
" dtype='datetime64[ns]', name='data', freq=None)"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3.index"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "624eab03",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>nr_tabeli</th>\n",
" <th>kurs</th>\n",
" </tr>\n",
" <tr>\n",
" <th>data</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2021-10-01</th>\n",
" <td>191/A/NBP/2021</td>\n",
" <td>4.5941</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2021-10-04</th>\n",
" <td>192/A/NBP/2021</td>\n",
" <td>4.5716</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2021-10-05</th>\n",
" <td>193/A/NBP/2021</td>\n",
" <td>4.6034</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2021-10-06</th>\n",
" <td>194/A/NBP/2021</td>\n",
" <td>4.6203</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2021-10-07</th>\n",
" <td>195/A/NBP/2021</td>\n",
" <td>4.5472</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2021-12-27</th>\n",
" <td>250/A/NBP/2021</td>\n",
" <td>4.6239</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2021-12-28</th>\n",
" <td>251/A/NBP/2021</td>\n",
" <td>4.6028</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2021-12-29</th>\n",
" <td>252/A/NBP/2021</td>\n",
" <td>4.5997</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2021-12-30</th>\n",
" <td>253/A/NBP/2021</td>\n",
" <td>4.5915</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2021-12-31</th>\n",
" <td>254/A/NBP/2021</td>\n",
" <td>4.5994</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>64 rows × 2 columns</p>\n",
"</div>"
],
"text/plain": [
" nr_tabeli kurs\n",
"data \n",
"2021-10-01 191/A/NBP/2021 4.5941\n",
"2021-10-04 192/A/NBP/2021 4.5716\n",
"2021-10-05 193/A/NBP/2021 4.6034\n",
"2021-10-06 194/A/NBP/2021 4.6203\n",
"2021-10-07 195/A/NBP/2021 4.5472\n",
"... ... ...\n",
"2021-12-27 250/A/NBP/2021 4.6239\n",
"2021-12-28 251/A/NBP/2021 4.6028\n",
"2021-12-29 252/A/NBP/2021 4.5997\n",
"2021-12-30 253/A/NBP/2021 4.5915\n",
"2021-12-31 254/A/NBP/2021 4.5994\n",
"\n",
"[64 rows x 2 columns]"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "4f53324a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<Axes: xlabel='data'>"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 1200x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df3.kurs.plot(figsize=(12,6))"
]
},
{
"cell_type": "markdown",
"id": "9b89e1cd",
"metadata": {},
"source": [
"## XML\n",
"\n",
"Analogicznie do JSON, ale tylko w najnowszych wersjach Pandas (od 1.3)."
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "61bcb65f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'2.0.1'"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.__version__"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "e4e6539a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Collecting lxml\n",
" Downloading lxml-4.9.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (7.1 MB)\n",
"\u001b[2K \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.1/7.1 MB\u001b[0m \u001b[31m17.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0mm eta \u001b[36m0:00:01\u001b[0m0:01\u001b[0m02\u001b[0m\n",
"\u001b[?25hInstalling collected packages: lxml\n",
"Successfully installed lxml-4.9.2\n",
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"%pip install lxml"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "03c3b8e8",
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_xml(\"https://api.nbp.pl/api/exchangerates/rates/A/EUR/2021-10-01/2021-12-31/?format=xml\")"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "12a4b66c",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Table</th>\n",
" <th>Currency</th>\n",
" <th>Code</th>\n",
" <th>Rate</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>None</td>\n",
" <td>euro</td>\n",
" <td>None</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>EUR</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Table Currency Code Rate\n",
"0 A None None NaN\n",
"1 None euro None NaN\n",
"2 None None EUR NaN\n",
"3 None None None NaN"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"id": "716d74ba",
"metadata": {},
"source": [
"Nie do końca o to chodziło - nie ma jak dostać się do danych. Struktura wczytywanego XML-a była zbyt głęboka."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "7be6e5fa",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"id": "cff5bf82",
"metadata": {},
"source": [
"Wczytajmy plik `sales-records` używając domyślnych ustawień. Zmierzymy przy tym czas działania."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "1d883757",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 1.58 s, sys: 226 ms, total: 1.8 s\n",
"Wall time: 1.8 s\n"
]
}
],
"source": [
"%time df = pd.read_csv('sales-records.csv')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "0676ce00",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Region</th>\n",
" <th>Country</th>\n",
" <th>Item Type</th>\n",
" <th>Sales Channel</th>\n",
" <th>Order Priority</th>\n",
" <th>Order Date</th>\n",
" <th>Order ID</th>\n",
" <th>Ship Date</th>\n",
" <th>Units Sold</th>\n",
" <th>Unit Price</th>\n",
" <th>Unit Cost</th>\n",
" <th>Total Revenue</th>\n",
" <th>Total Cost</th>\n",
" <th>Total Profit</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Sub-Saharan Africa</td>\n",
" <td>South Africa</td>\n",
" <td>Fruits</td>\n",
" <td>Offline</td>\n",
" <td>M</td>\n",
" <td>7/27/2012</td>\n",
" <td>443368995</td>\n",
" <td>7/28/2012</td>\n",
" <td>1593</td>\n",
" <td>9.33</td>\n",
" <td>6.92</td>\n",
" <td>14862.69</td>\n",
" <td>11023.56</td>\n",
" <td>3839.13</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Middle East and North Africa</td>\n",
" <td>Morocco</td>\n",
" <td>Clothes</td>\n",
" <td>Online</td>\n",
" <td>M</td>\n",
" <td>9/14/2013</td>\n",
" <td>667593514</td>\n",
" <td>10/19/2013</td>\n",
" <td>4611</td>\n",
" <td>109.28</td>\n",
" <td>35.84</td>\n",
" <td>503890.08</td>\n",
" <td>165258.24</td>\n",
" <td>338631.84</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Australia and Oceania</td>\n",
" <td>Papua New Guinea</td>\n",
" <td>Meat</td>\n",
" <td>Offline</td>\n",
" <td>M</td>\n",
" <td>5/15/2015</td>\n",
" <td>940995585</td>\n",
" <td>6/4/2015</td>\n",
" <td>360</td>\n",
" <td>421.89</td>\n",
" <td>364.69</td>\n",
" <td>151880.40</td>\n",
" <td>131288.40</td>\n",
" <td>20592.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Sub-Saharan Africa</td>\n",
" <td>Djibouti</td>\n",
" <td>Clothes</td>\n",
" <td>Offline</td>\n",
" <td>H</td>\n",
" <td>5/17/2017</td>\n",
" <td>880811536</td>\n",
" <td>7/2/2017</td>\n",
" <td>562</td>\n",
" <td>109.28</td>\n",
" <td>35.84</td>\n",
" <td>61415.36</td>\n",
" <td>20142.08</td>\n",
" <td>41273.28</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Europe</td>\n",
" <td>Slovakia</td>\n",
" <td>Beverages</td>\n",
" <td>Offline</td>\n",
" <td>L</td>\n",
" <td>10/26/2016</td>\n",
" <td>174590194</td>\n",
" <td>12/4/2016</td>\n",
" <td>3973</td>\n",
" <td>47.45</td>\n",
" <td>31.79</td>\n",
" <td>188518.85</td>\n",
" <td>126301.67</td>\n",
" <td>62217.18</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>999995</th>\n",
" <td>Sub-Saharan Africa</td>\n",
" <td>Senegal</td>\n",
" <td>Baby Food</td>\n",
" <td>Offline</td>\n",
" <td>L</td>\n",
" <td>11/6/2010</td>\n",
" <td>575470578</td>\n",
" <td>12/11/2010</td>\n",
" <td>3387</td>\n",
" <td>255.28</td>\n",
" <td>159.42</td>\n",
" <td>864633.36</td>\n",
" <td>539955.54</td>\n",
" <td>324677.82</td>\n",
" </tr>\n",
" <tr>\n",
" <th>999996</th>\n",
" <td>Central America and the Caribbean</td>\n",
" <td>Panama</td>\n",
" <td>Office Supplies</td>\n",
" <td>Offline</td>\n",
" <td>C</td>\n",
" <td>1/12/2015</td>\n",
" <td>766942107</td>\n",
" <td>3/1/2015</td>\n",
" <td>4068</td>\n",
" <td>651.21</td>\n",
" <td>524.96</td>\n",
" <td>2649122.28</td>\n",
" <td>2135537.28</td>\n",
" <td>513585.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>999997</th>\n",
" <td>Europe</td>\n",
" <td>Norway</td>\n",
" <td>Office Supplies</td>\n",
" <td>Online</td>\n",
" <td>M</td>\n",
" <td>10/25/2011</td>\n",
" <td>685472047</td>\n",
" <td>12/5/2011</td>\n",
" <td>5266</td>\n",
" <td>651.21</td>\n",
" <td>524.96</td>\n",
" <td>3429271.86</td>\n",
" <td>2764439.36</td>\n",
" <td>664832.50</td>\n",
" </tr>\n",
" <tr>\n",
" <th>999998</th>\n",
" <td>Europe</td>\n",
" <td>Montenegro</td>\n",
" <td>Beverages</td>\n",
" <td>Offline</td>\n",
" <td>M</td>\n",
" <td>10/31/2010</td>\n",
" <td>946734225</td>\n",
" <td>12/8/2010</td>\n",
" <td>8551</td>\n",
" <td>47.45</td>\n",
" <td>31.79</td>\n",
" <td>405744.95</td>\n",
" <td>271836.29</td>\n",
" <td>133908.66</td>\n",
" </tr>\n",
" <tr>\n",
" <th>999999</th>\n",
" <td>Central America and the Caribbean</td>\n",
" <td>Nicaragua</td>\n",
" <td>Meat</td>\n",
" <td>Online</td>\n",
" <td>C</td>\n",
" <td>3/17/2015</td>\n",
" <td>886714971</td>\n",
" <td>4/8/2015</td>\n",
" <td>7519</td>\n",
" <td>421.89</td>\n",
" <td>364.69</td>\n",
" <td>3172190.91</td>\n",
" <td>2742104.11</td>\n",
" <td>430086.80</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1000000 rows × 14 columns</p>\n",
"</div>"
],
"text/plain": [
" Region Country Item Type \n",
"0 Sub-Saharan Africa South Africa Fruits \\\n",
"1 Middle East and North Africa Morocco Clothes \n",
"2 Australia and Oceania Papua New Guinea Meat \n",
"3 Sub-Saharan Africa Djibouti Clothes \n",
"4 Europe Slovakia Beverages \n",
"... ... ... ... \n",
"999995 Sub-Saharan Africa Senegal Baby Food \n",
"999996 Central America and the Caribbean Panama Office Supplies \n",
"999997 Europe Norway Office Supplies \n",
"999998 Europe Montenegro Beverages \n",
"999999 Central America and the Caribbean Nicaragua Meat \n",
"\n",
" Sales Channel Order Priority Order Date Order ID Ship Date \n",
"0 Offline M 7/27/2012 443368995 7/28/2012 \\\n",
"1 Online M 9/14/2013 667593514 10/19/2013 \n",
"2 Offline M 5/15/2015 940995585 6/4/2015 \n",
"3 Offline H 5/17/2017 880811536 7/2/2017 \n",
"4 Offline L 10/26/2016 174590194 12/4/2016 \n",
"... ... ... ... ... ... \n",
"999995 Offline L 11/6/2010 575470578 12/11/2010 \n",
"999996 Offline C 1/12/2015 766942107 3/1/2015 \n",
"999997 Online M 10/25/2011 685472047 12/5/2011 \n",
"999998 Offline M 10/31/2010 946734225 12/8/2010 \n",
"999999 Online C 3/17/2015 886714971 4/8/2015 \n",
"\n",
" Units Sold Unit Price Unit Cost Total Revenue Total Cost \n",
"0 1593 9.33 6.92 14862.69 11023.56 \\\n",
"1 4611 109.28 35.84 503890.08 165258.24 \n",
"2 360 421.89 364.69 151880.40 131288.40 \n",
"3 562 109.28 35.84 61415.36 20142.08 \n",
"4 3973 47.45 31.79 188518.85 126301.67 \n",
"... ... ... ... ... ... \n",
"999995 3387 255.28 159.42 864633.36 539955.54 \n",
"999996 4068 651.21 524.96 2649122.28 2135537.28 \n",
"999997 5266 651.21 524.96 3429271.86 2764439.36 \n",
"999998 8551 47.45 31.79 405744.95 271836.29 \n",
"999999 7519 421.89 364.69 3172190.91 2742104.11 \n",
"\n",
" Total Profit \n",
"0 3839.13 \n",
"1 338631.84 \n",
"2 20592.00 \n",
"3 41273.28 \n",
"4 62217.18 \n",
"... ... \n",
"999995 324677.82 \n",
"999996 513585.00 \n",
"999997 664832.50 \n",
"999998 133908.66 \n",
"999999 430086.80 \n",
"\n",
"[1000000 rows x 14 columns]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "a74f670a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Region object\n",
"Country object\n",
"Item Type object\n",
"Sales Channel object\n",
"Order Priority object\n",
"Order Date object\n",
"Order ID int64\n",
"Ship Date object\n",
"Units Sold int64\n",
"Unit Price float64\n",
"Unit Cost float64\n",
"Total Revenue float64\n",
"Total Cost float64\n",
"Total Profit float64\n",
"dtype: object"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.dtypes"
]
},
{
"cell_type": "markdown",
"id": "1668b33d",
"metadata": {},
"source": [
"Zmierzmy zajętość pamięci."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "9b7feef2",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index 128\n",
"Region 8000000\n",
"Country 8000000\n",
"Item Type 8000000\n",
"Sales Channel 8000000\n",
"Order Priority 8000000\n",
"Order Date 8000000\n",
"Order ID 8000000\n",
"Ship Date 8000000\n",
"Units Sold 8000000\n",
"Unit Price 8000000\n",
"Unit Cost 8000000\n",
"Total Revenue 8000000\n",
"Total Cost 8000000\n",
"Total Profit 8000000\n",
"dtype: int64"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.memory_usage()"
]
},
{
"cell_type": "markdown",
"id": "658fe4d2",
"metadata": {},
"source": [
"Domyślnie `memory_udage` podaje pamięć zajmowaną **bezpośrednio** przez kolumnę. W przypadku typów liczbowych to jest poprawne, ale w przypadku typu `object`, stosowanego m.in. dla napisów, to nie obejmuje całej zajętej pamięci. Gdyż kolumna zawiera tylko wskaźniki, które prowadzą do obiektów znajdujących się poza tablicą.\n",
"\n",
"Aby policzyć również obiekty wskazywane z kolumn, trzeba dopisać `deep=True`."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "9acfb726",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index 128\n",
"Region 72845056\n",
"Country 65904370\n",
"Item Type 65583558\n",
"Sales Channel 63500249\n",
"Order Priority 58000000\n",
"Order Date 65936701\n",
"Order ID 8000000\n",
"Ship Date 65936993\n",
"Units Sold 8000000\n",
"Unit Price 8000000\n",
"Unit Cost 8000000\n",
"Total Revenue 8000000\n",
"Total Cost 8000000\n",
"Total Profit 8000000\n",
"dtype: int64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.memory_usage(deep=True)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "e5b89e79",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"513707055"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.memory_usage(deep=True).sum()"
]
},
{
"cell_type": "markdown",
"id": "66cb0b85",
"metadata": {},
"source": [
"## Sposoby na oszczędzanie pamięci\n",
"\n",
"Stosując odpowiednie techniki możemy wczytywać znacznie większe zestawy danych, niż bylibyśmy w stanie przy uzyciu domyslnych opcji.\n",
"\n",
"W tym pliku zajmiemy się kwestią wyboru odp. typu dla kolumny. W następnym pliku dowiemy się, jak czytać tylko część danych."
]
},
{
"cell_type": "markdown",
"id": "d3f63fa5",
"metadata": {},
"source": [
"### Typy liczbowe\n",
"\n",
"Liczby są przechowywane bezpośrednio w komórkach i cała akolumna musi być jednakowego typu. Jeśli użyjemy typu o mniejszym zakresie (np. `int16` zamiast `int64`) to kolumna zajmie mniej pamięci, ale będzie obsługiwać mniejszy zakres liczbowy.\n",
"\n",
"Domyślnie Pandas używa domyślnie typów 64-bitowych.\n",
"Patrząc na dane, jakie mamy w kolumnie, możemy ustalić dla konkretnych danych jaki typ będzie potrzebny."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "2d55dcd0",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(100001180, 999999892)"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Order ID'].min(), df['Order ID'].max()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "fad5635e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(1, 10000)"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Units Sold'].min(), df['Units Sold'].max()"
]
},
{
"cell_type": "markdown",
"id": "09a3b4c6",
"metadata": {},
"source": [
"Wystarczą typy `int32` (zakres do 2 mld) oraz `int16` (zakres 32 tys)."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "67c30312",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(2147483647, 32767)"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"2**31-1 , 2**15-1"
]
},
{
"cell_type": "markdown",
"id": "11a48f59",
"metadata": {},
"source": [
"Operacja `astype` dokonuje konwersji całej serii (kolumny) na podany typ.\n",
"Typ można podać bezpośrednio (z biblioteki numpy) albo jako 'nazwę'."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "9cb8c8d3",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 443368995\n",
"1 667593514\n",
"2 940995585\n",
"3 880811536\n",
"4 174590194\n",
" ... \n",
"999995 575470578\n",
"999996 766942107\n",
"999997 685472047\n",
"999998 946734225\n",
"999999 886714971\n",
"Name: Order ID, Length: 1000000, dtype: int32"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Order ID'].astype('int32')"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "18c794eb",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 443368995\n",
"1 667593514\n",
"2 940995585\n",
"3 880811536\n",
"4 174590194\n",
" ... \n",
"999995 575470578\n",
"999996 766942107\n",
"999997 685472047\n",
"999998 946734225\n",
"999999 886714971\n",
"Name: Order ID, Length: 1000000, dtype: int32"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Order ID'].astype(np.int32)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "ffaa3d67",
"metadata": {},
"outputs": [],
"source": [
"df['Order ID'] = df['Order ID'].astype('int32')\n",
"df['Units Sold'] = df['Units Sold'].astype('int16')"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "077c26b2",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Region object\n",
"Country object\n",
"Item Type object\n",
"Sales Channel object\n",
"Order Priority object\n",
"Order Date object\n",
"Order ID int32\n",
"Ship Date object\n",
"Units Sold int16\n",
"Unit Price float64\n",
"Unit Cost float64\n",
"Total Revenue float64\n",
"Total Cost float64\n",
"Total Profit float64\n",
"dtype: object"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.dtypes"
]
},
{
"cell_type": "markdown",
"id": "8cbee249",
"metadata": {},
"source": [
"Dla liczb z ułamkiem rozmiar typu wpływa na precyzję - liczbę znaczących cyfr."
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "7243af12",
"metadata": {},
"outputs": [],
"source": [
"df['Unit Price'] = df['Unit Price'].astype('float32')\n",
"df['Unit Cost'] = df['Unit Cost'].astype('float32')"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "87dc47d5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Region object\n",
"Country object\n",
"Item Type object\n",
"Sales Channel object\n",
"Order Priority object\n",
"Order Date object\n",
"Order ID int32\n",
"Ship Date object\n",
"Units Sold int16\n",
"Unit Price float32\n",
"Unit Cost float32\n",
"Total Revenue float64\n",
"Total Cost float64\n",
"Total Profit float64\n",
"dtype: object"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "7e1b70fe",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"495707055"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.memory_usage(deep=True).sum()"
]
},
{
"cell_type": "markdown",
"id": "000ac737",
"metadata": {},
"source": [
"### Typ daty\n",
"\n",
"Odpowiednio zinterpretowana data zajmuje mniej pamięci niż string z tą datą. Dodatkowo typ daty daje nam dostęp do poszczególnych pól i dedykowanych operacji.\n",
"\n",
"Gdy mamy już serię danych tekstowych i dopiero teraz chcemy skonwertować na datę, to możemy uzyć operacji `pd.parse_dates`."
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "2dcffe2e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 2.89 s, sys: 14.3 ms, total: 2.91 s\n",
"Wall time: 2.9 s\n"
]
},
{
"data": {
"text/plain": [
"0 2012-07-27\n",
"1 2013-09-14\n",
"2 2015-05-15\n",
"3 2017-05-17\n",
"4 2016-10-26\n",
" ... \n",
"999995 2010-11-06\n",
"999996 2015-01-12\n",
"999997 2011-10-25\n",
"999998 2010-10-31\n",
"999999 2015-03-17\n",
"Name: Order Date, Length: 1000000, dtype: datetime64[ns]"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%time pd.to_datetime(df['Order Date'])"
]
},
{
"cell_type": "markdown",
"id": "0f52b5a3",
"metadata": {},
"source": [
"To trwało długo (**w starej wersji Pandas**; w nowej widzimy, że nie jest wcale tragicznie). Jednak zadziała znacznie szybciej, gdy podpowiedmy pandasowi, jakiego formatu daty użyć.\n",
"\n",
"Nadal ma to sens, bo pozwala uniknąć pomyłek."
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "cb15ddd8",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 2.72 s, sys: 12.9 ms, total: 2.73 s\n",
"Wall time: 2.72 s\n"
]
},
{
"data": {
"text/plain": [
"0 2012-07-27\n",
"1 2013-09-14\n",
"2 2015-05-15\n",
"3 2017-05-17\n",
"4 2016-10-26\n",
" ... \n",
"999995 2010-11-06\n",
"999996 2015-01-12\n",
"999997 2011-10-25\n",
"999998 2010-10-31\n",
"999999 2015-03-17\n",
"Name: Order Date, Length: 1000000, dtype: datetime64[ns]"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%time pd.to_datetime(df['Order Date'], format='%m/%d/%Y')"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "328128e8",
"metadata": {},
"outputs": [],
"source": [
"df['Order Date'] = pd.to_datetime(df['Order Date'], format='%m/%d/%Y')\n",
"df['Ship Date'] = pd.to_datetime(df['Ship Date'], format='%m/%d/%Y')"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "775a0daa",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Region object\n",
"Country object\n",
"Item Type object\n",
"Sales Channel object\n",
"Order Priority object\n",
"Order Date datetime64[ns]\n",
"Order ID int32\n",
"Ship Date datetime64[ns]\n",
"Units Sold int16\n",
"Unit Price float32\n",
"Unit Cost float32\n",
"Total Revenue float64\n",
"Total Cost float64\n",
"Total Profit float64\n",
"dtype: object"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "372d2256",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"379833361"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.memory_usage(deep=True).sum()"
]
},
{
"cell_type": "markdown",
"id": "50729610",
"metadata": {},
"source": [
"To nie tylko ograniczyło pamięć, ale też dało nam dostęp do szczegółów dat."
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "fc3d427d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"176178734459.91003"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[df['Order Date'].dt.year == 2013]['Total Revenue'].sum()"
]
},
{
"cell_type": "markdown",
"id": "286a5902",
"metadata": {},
"source": [
"A gdybyśmy chcieli odwrotnie: na podstawie daty uzyskać tekst, to można w taki sposób."
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "7ec23eaf",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 Friday, 27.07.2012\n",
"1 Saturday, 14.09.2013\n",
"2 Friday, 15.05.2015\n",
"3 Wednesday, 17.05.2017\n",
"4 Wednesday, 26.10.2016\n",
" ... \n",
"999995 Saturday, 06.11.2010\n",
"999996 Monday, 12.01.2015\n",
"999997 Tuesday, 25.10.2011\n",
"999998 Sunday, 31.10.2010\n",
"999999 Tuesday, 17.03.2015\n",
"Name: Order Date, Length: 1000000, dtype: object"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Order Date'].dt.strftime('%A, %d.%m.%Y')"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "56626444",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index 128\n",
"Region 72845056\n",
"Country 65904370\n",
"Item Type 65583558\n",
"Sales Channel 63500249\n",
"Order Priority 58000000\n",
"Order Date 8000000\n",
"Order ID 4000000\n",
"Ship Date 8000000\n",
"Units Sold 2000000\n",
"Unit Price 4000000\n",
"Unit Cost 4000000\n",
"Total Revenue 8000000\n",
"Total Cost 8000000\n",
"Total Profit 8000000\n",
"dtype: int64"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.memory_usage(deep=True)"
]
},
{
"cell_type": "markdown",
"id": "81fa6172",
"metadata": {},
"source": [
"### Typ kategoryczny\n",
"\n",
"Jeśli kolumna zawiera powtarzające się wartości, szczególnie powtarzające się teksty (jak kategoria produktu, kraj, region, ...), to:\n",
"\n",
"- domyślnie każde wystapienie tej wartości jest przechowywane w pamięci jako oddzielny obiekt (oddzielny string),\n",
"- ale gdy zamienimy typ z `object` na `category`, to każda unikalna wartość będzie przechowywania w pamięci tylko raz, a pole w kolumnie będzie zawierać tylko numer tej wartości. Pandas dodkona tego wewnętrznie, a praca z kolumną nadal będzie wyglądać tak samo.\n",
"\n",
"Na przykładzie regionu. Istnieje tylko 7 różnych wartości na milion rekorów.\n"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "a9e840a4",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['Sub-Saharan Africa', 'Middle East and North Africa',\n",
" 'Australia and Oceania', 'Europe', 'Asia',\n",
" 'Central America and the Caribbean', 'North America'], dtype=object)"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Region'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "d483c175",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"72845184"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Region'].memory_usage(deep=True)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "79380acb",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1000950"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Region'].astype('category').memory_usage(deep=True)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "56ea9351",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"72.7760467555822"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"72845184 / 1000950"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "c0141053",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.986259215159646"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"1.0 - 1000950 / 72845184"
]
},
{
"cell_type": "markdown",
"id": "ffbde741",
"metadata": {},
"source": [
"Teraz pole kolumny zajmuje fizycznie tylko 1 bajt, bo tyle wystarcza do wpisnaia liczby z zakresu 0-6.\n",
"\n",
"Oprócz tego same napisy są umieszczone w dodatkowym słowniku. W sumie zajmuje to wszystko ok 1 MB, czyli kilkadziesiąt razy mniej niż dane teksty typu `object`.\n",
"\n",
"Dla wszystkich kolumn tekstowych stosujemy tę zamianę:"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "efc1b769",
"metadata": {},
"outputs": [],
"source": [
"df['Region'] = df['Region'].astype('category')\n",
"df['Country'] = df['Country'].astype('category')\n",
"df['Item Type'] = df['Item Type'].astype('category')\n",
"df['Sales Channel'] = df['Sales Channel'].astype('category')\n",
"df['Order Priority'] = df['Order Priority'].astype('category')"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "3cca5b88",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index 128\n",
"Region 1000822\n",
"Country 2016360\n",
"Item Type 1001087\n",
"Sales Channel 1000235\n",
"Order Priority 1000404\n",
"Order Date 8000000\n",
"Order ID 4000000\n",
"Ship Date 8000000\n",
"Units Sold 2000000\n",
"Unit Price 4000000\n",
"Unit Cost 4000000\n",
"Total Revenue 8000000\n",
"Total Cost 8000000\n",
"Total Profit 8000000\n",
"dtype: int64"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.memory_usage(deep=True)"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "4d1fc4d5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"60019036"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.memory_usage(deep=True).sum()"
]
},
{
"cell_type": "markdown",
"id": "46b69778",
"metadata": {},
"source": [
"## Wskazanie właściwych typów już podczas czytania\n",
"\n",
"Gdy od razu wiemy, jakich typów użyć, to możemy przygotować słownik z informacjami o typach i przekazać go do operacji `read_csv`.\n",
"\n",
"W słowniku nie podaje się kolumn z datą - je osobno wskazujemy opcją `parse_date`s. Dodatkowo `infer_datetime_format` przyspiesza wczytywanie. Wg mojego rozumienia tematu Pandas ustala format daty dla całej kolumny na podstawie początkowych wartości."
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "7b0dc3a7",
"metadata": {},
"outputs": [],
"source": [
"typy = {\n",
" \"Region\": 'category',\n",
" \"Country\": 'category',\n",
" \"Item Type\": 'category',\n",
" \"Sales Channel\": 'category',\n",
" \"Order Priority\": 'category',\n",
" \"Order ID\": 'int32',\n",
" \"Units Sold\": 'int16',\n",
" \"Unit Price\": 'float32',\n",
" \"Unit Cost\": 'float32',\n",
" \"Total Revenue\": 'float64',\n",
" \"Total Cost\": 'float64',\n",
" \"Total Profit\": 'float64',\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "2c27b889",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 7.18 s, sys: 88.2 ms, total: 7.27 s\n",
"Wall time: 7.25 s\n"
]
}
],
"source": [
"%time df = pd.read_csv('sales-records.csv', dtype=typy, parse_dates=[\"Order Date\", \"Ship Date\"])"
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "b4680422",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"60018928"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.memory_usage(deep=True).sum()"
]
},
{
"cell_type": "markdown",
"id": "aad88aae",
"metadata": {},
"source": [
"W starszej wersji Pandas (przed 2.0) trzeba było podawać `infer_datetime_format=True`, aby znacząco przyspieszyć wczytywanie dat. Od 2.0 i tak jest to robione domyślnie."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "948af779",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"id": "9e055f4e",
"metadata": {},
"source": [
"## Wczytywanie tylko części wierszy\n",
"\n",
"Jeśli chcemy jednorazowo wczytać tylko część wierszy z dużego pliku, możemy użyć parametrów nrows i skiprows.\n",
"\n",
"Jeśli jednak chcemy wczytać całe dane, z tym że podzielone na części (\"czytać porcjami\"), lepiej użyć chunksize i korzystać z \"generatora\" w pętli.\n",
"\n",
"`nrows` - ile wierszy z pliku odczytać\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "569e3b3a",
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv('sales-records.csv', nrows=1000)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "b94a8e96",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Region</th>\n",
" <th>Country</th>\n",
" <th>Item Type</th>\n",
" <th>Sales Channel</th>\n",
" <th>Order Priority</th>\n",
" <th>Order Date</th>\n",
" <th>Order ID</th>\n",
" <th>Ship Date</th>\n",
" <th>Units Sold</th>\n",
" <th>Unit Price</th>\n",
" <th>Unit Cost</th>\n",
" <th>Total Revenue</th>\n",
" <th>Total Cost</th>\n",
" <th>Total Profit</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Sub-Saharan Africa</td>\n",
" <td>South Africa</td>\n",
" <td>Fruits</td>\n",
" <td>Offline</td>\n",
" <td>M</td>\n",
" <td>7/27/2012</td>\n",
" <td>443368995</td>\n",
" <td>7/28/2012</td>\n",
" <td>1593</td>\n",
" <td>9.33</td>\n",
" <td>6.92</td>\n",
" <td>14862.69</td>\n",
" <td>11023.56</td>\n",
" <td>3839.13</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Middle East and North Africa</td>\n",
" <td>Morocco</td>\n",
" <td>Clothes</td>\n",
" <td>Online</td>\n",
" <td>M</td>\n",
" <td>9/14/2013</td>\n",
" <td>667593514</td>\n",
" <td>10/19/2013</td>\n",
" <td>4611</td>\n",
" <td>109.28</td>\n",
" <td>35.84</td>\n",
" <td>503890.08</td>\n",
" <td>165258.24</td>\n",
" <td>338631.84</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Australia and Oceania</td>\n",
" <td>Papua New Guinea</td>\n",
" <td>Meat</td>\n",
" <td>Offline</td>\n",
" <td>M</td>\n",
" <td>5/15/2015</td>\n",
" <td>940995585</td>\n",
" <td>6/4/2015</td>\n",
" <td>360</td>\n",
" <td>421.89</td>\n",
" <td>364.69</td>\n",
" <td>151880.40</td>\n",
" <td>131288.40</td>\n",
" <td>20592.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Sub-Saharan Africa</td>\n",
" <td>Djibouti</td>\n",
" <td>Clothes</td>\n",
" <td>Offline</td>\n",
" <td>H</td>\n",
" <td>5/17/2017</td>\n",
" <td>880811536</td>\n",
" <td>7/2/2017</td>\n",
" <td>562</td>\n",
" <td>109.28</td>\n",
" <td>35.84</td>\n",
" <td>61415.36</td>\n",
" <td>20142.08</td>\n",
" <td>41273.28</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Europe</td>\n",
" <td>Slovakia</td>\n",
" <td>Beverages</td>\n",
" <td>Offline</td>\n",
" <td>L</td>\n",
" <td>10/26/2016</td>\n",
" <td>174590194</td>\n",
" <td>12/4/2016</td>\n",
" <td>3973</td>\n",
" <td>47.45</td>\n",
" <td>31.79</td>\n",
" <td>188518.85</td>\n",
" <td>126301.67</td>\n",
" <td>62217.18</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>995</th>\n",
" <td>Asia</td>\n",
" <td>Thailand</td>\n",
" <td>Clothes</td>\n",
" <td>Online</td>\n",
" <td>H</td>\n",
" <td>6/11/2012</td>\n",
" <td>768737256</td>\n",
" <td>7/7/2012</td>\n",
" <td>5292</td>\n",
" <td>109.28</td>\n",
" <td>35.84</td>\n",
" <td>578309.76</td>\n",
" <td>189665.28</td>\n",
" <td>388644.48</td>\n",
" </tr>\n",
" <tr>\n",
" <th>996</th>\n",
" <td>Middle East and North Africa</td>\n",
" <td>Iraq</td>\n",
" <td>Vegetables</td>\n",
" <td>Offline</td>\n",
" <td>H</td>\n",
" <td>12/11/2011</td>\n",
" <td>492740523</td>\n",
" <td>1/3/2012</td>\n",
" <td>1725</td>\n",
" <td>154.06</td>\n",
" <td>90.93</td>\n",
" <td>265753.50</td>\n",
" <td>156854.25</td>\n",
" <td>108899.25</td>\n",
" </tr>\n",
" <tr>\n",
" <th>997</th>\n",
" <td>Europe</td>\n",
" <td>Kosovo</td>\n",
" <td>Beverages</td>\n",
" <td>Offline</td>\n",
" <td>L</td>\n",
" <td>12/16/2014</td>\n",
" <td>552005326</td>\n",
" <td>2/3/2015</td>\n",
" <td>9498</td>\n",
" <td>47.45</td>\n",
" <td>31.79</td>\n",
" <td>450680.10</td>\n",
" <td>301941.42</td>\n",
" <td>148738.68</td>\n",
" </tr>\n",
" <tr>\n",
" <th>998</th>\n",
" <td>North America</td>\n",
" <td>Canada</td>\n",
" <td>Fruits</td>\n",
" <td>Online</td>\n",
" <td>M</td>\n",
" <td>2/19/2013</td>\n",
" <td>774705493</td>\n",
" <td>2/26/2013</td>\n",
" <td>1426</td>\n",
" <td>9.33</td>\n",
" <td>6.92</td>\n",
" <td>13304.58</td>\n",
" <td>9867.92</td>\n",
" <td>3436.66</td>\n",
" </tr>\n",
" <tr>\n",
" <th>999</th>\n",
" <td>Middle East and North Africa</td>\n",
" <td>Turkey</td>\n",
" <td>Snacks</td>\n",
" <td>Offline</td>\n",
" <td>M</td>\n",
" <td>5/3/2014</td>\n",
" <td>609327352</td>\n",
" <td>5/19/2014</td>\n",
" <td>2359</td>\n",
" <td>152.58</td>\n",
" <td>97.44</td>\n",
" <td>359936.22</td>\n",
" <td>229860.96</td>\n",
" <td>130075.26</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1000 rows × 14 columns</p>\n",
"</div>"
],
"text/plain": [
" Region Country Item Type Sales Channel \\\n",
"0 Sub-Saharan Africa South Africa Fruits Offline \n",
"1 Middle East and North Africa Morocco Clothes Online \n",
"2 Australia and Oceania Papua New Guinea Meat Offline \n",
"3 Sub-Saharan Africa Djibouti Clothes Offline \n",
"4 Europe Slovakia Beverages Offline \n",
".. ... ... ... ... \n",
"995 Asia Thailand Clothes Online \n",
"996 Middle East and North Africa Iraq Vegetables Offline \n",
"997 Europe Kosovo Beverages Offline \n",
"998 North America Canada Fruits Online \n",
"999 Middle East and North Africa Turkey Snacks Offline \n",
"\n",
" Order Priority Order Date Order ID Ship Date Units Sold Unit Price \\\n",
"0 M 7/27/2012 443368995 7/28/2012 1593 9.33 \n",
"1 M 9/14/2013 667593514 10/19/2013 4611 109.28 \n",
"2 M 5/15/2015 940995585 6/4/2015 360 421.89 \n",
"3 H 5/17/2017 880811536 7/2/2017 562 109.28 \n",
"4 L 10/26/2016 174590194 12/4/2016 3973 47.45 \n",
".. ... ... ... ... ... ... \n",
"995 H 6/11/2012 768737256 7/7/2012 5292 109.28 \n",
"996 H 12/11/2011 492740523 1/3/2012 1725 154.06 \n",
"997 L 12/16/2014 552005326 2/3/2015 9498 47.45 \n",
"998 M 2/19/2013 774705493 2/26/2013 1426 9.33 \n",
"999 M 5/3/2014 609327352 5/19/2014 2359 152.58 \n",
"\n",
" Unit Cost Total Revenue Total Cost Total Profit \n",
"0 6.92 14862.69 11023.56 3839.13 \n",
"1 35.84 503890.08 165258.24 338631.84 \n",
"2 364.69 151880.40 131288.40 20592.00 \n",
"3 35.84 61415.36 20142.08 41273.28 \n",
"4 31.79 188518.85 126301.67 62217.18 \n",
".. ... ... ... ... \n",
"995 35.84 578309.76 189665.28 388644.48 \n",
"996 90.93 265753.50 156854.25 108899.25 \n",
"997 31.79 450680.10 301941.42 148738.68 \n",
"998 6.92 13304.58 9867.92 3436.66 \n",
"999 97.44 359936.22 229860.96 130075.26 \n",
"\n",
"[1000 rows x 14 columns]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"id": "d4c265a6",
"metadata": {},
"source": [
"`skiprows` - ile początkowych linii pliku pominąć.\n",
"\n",
"Niestety zgubimy w ten sposób wiersza nagłówkowy."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "89d8ba23",
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv('sales-records.csv', skiprows=5000, nrows=1000)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "5bc5e0a3",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Sub-Saharan Africa</th>\n",
" <th>Republic of the Congo</th>\n",
" <th>Fruits</th>\n",
" <th>Offline</th>\n",
" <th>L</th>\n",
" <th>12/8/2010</th>\n",
" <th>261239278</th>\n",
" <th>1/12/2011</th>\n",
" <th>5362</th>\n",
" <th>9.33</th>\n",
" <th>6.92</th>\n",
" <th>50027.46</th>\n",
" <th>37105.04</th>\n",
" <th>12922.42</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>North America</td>\n",
" <td>Canada</td>\n",
" <td>Personal Care</td>\n",
" <td>Offline</td>\n",
" <td>C</td>\n",
" <td>8/2/2011</td>\n",
" <td>981317985</td>\n",
" <td>9/2/2011</td>\n",
" <td>2654</td>\n",
" <td>81.73</td>\n",
" <td>56.67</td>\n",
" <td>216911.42</td>\n",
" <td>150402.18</td>\n",
" <td>66509.24</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Europe</td>\n",
" <td>Bosnia and Herzegovina</td>\n",
" <td>Fruits</td>\n",
" <td>Offline</td>\n",
" <td>L</td>\n",
" <td>4/18/2016</td>\n",
" <td>941541039</td>\n",
" <td>5/28/2016</td>\n",
" <td>1772</td>\n",
" <td>9.33</td>\n",
" <td>6.92</td>\n",
" <td>16532.76</td>\n",
" <td>12262.24</td>\n",
" <td>4270.52</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Sub-Saharan Africa</td>\n",
" <td>Senegal</td>\n",
" <td>Fruits</td>\n",
" <td>Online</td>\n",
" <td>L</td>\n",
" <td>3/12/2017</td>\n",
" <td>701332271</td>\n",
" <td>4/22/2017</td>\n",
" <td>8659</td>\n",
" <td>9.33</td>\n",
" <td>6.92</td>\n",
" <td>80788.47</td>\n",
" <td>59920.28</td>\n",
" <td>20868.19</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Sub-Saharan Africa</td>\n",
" <td>Togo</td>\n",
" <td>Fruits</td>\n",
" <td>Online</td>\n",
" <td>H</td>\n",
" <td>1/12/2015</td>\n",
" <td>440428006</td>\n",
" <td>2/13/2015</td>\n",
" <td>1289</td>\n",
" <td>9.33</td>\n",
" <td>6.92</td>\n",
" <td>12026.37</td>\n",
" <td>8919.88</td>\n",
" <td>3106.49</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Central America and the Caribbean</td>\n",
" <td>Antigua and Barbuda</td>\n",
" <td>Baby Food</td>\n",
" <td>Online</td>\n",
" <td>L</td>\n",
" <td>6/4/2015</td>\n",
" <td>408877074</td>\n",
" <td>7/18/2015</td>\n",
" <td>4667</td>\n",
" <td>255.28</td>\n",
" <td>159.42</td>\n",
" <td>1191391.76</td>\n",
" <td>744013.14</td>\n",
" <td>447378.62</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>995</th>\n",
" <td>North America</td>\n",
" <td>Mexico</td>\n",
" <td>Baby Food</td>\n",
" <td>Offline</td>\n",
" <td>M</td>\n",
" <td>3/3/2017</td>\n",
" <td>507650601</td>\n",
" <td>4/4/2017</td>\n",
" <td>3949</td>\n",
" <td>255.28</td>\n",
" <td>159.42</td>\n",
" <td>1008100.72</td>\n",
" <td>629549.58</td>\n",
" <td>378551.14</td>\n",
" </tr>\n",
" <tr>\n",
" <th>996</th>\n",
" <td>Sub-Saharan Africa</td>\n",
" <td>Sierra Leone</td>\n",
" <td>Fruits</td>\n",
" <td>Online</td>\n",
" <td>M</td>\n",
" <td>11/19/2012</td>\n",
" <td>793824279</td>\n",
" <td>12/5/2012</td>\n",
" <td>9099</td>\n",
" <td>9.33</td>\n",
" <td>6.92</td>\n",
" <td>84893.67</td>\n",
" <td>62965.08</td>\n",
" <td>21928.59</td>\n",
" </tr>\n",
" <tr>\n",
" <th>997</th>\n",
" <td>Asia</td>\n",
" <td>Myanmar</td>\n",
" <td>Clothes</td>\n",
" <td>Offline</td>\n",
" <td>C</td>\n",
" <td>8/9/2010</td>\n",
" <td>153150117</td>\n",
" <td>9/19/2010</td>\n",
" <td>9378</td>\n",
" <td>109.28</td>\n",
" <td>35.84</td>\n",
" <td>1024827.84</td>\n",
" <td>336107.52</td>\n",
" <td>688720.32</td>\n",
" </tr>\n",
" <tr>\n",
" <th>998</th>\n",
" <td>Asia</td>\n",
" <td>Singapore</td>\n",
" <td>Personal Care</td>\n",
" <td>Offline</td>\n",
" <td>L</td>\n",
" <td>7/2/2010</td>\n",
" <td>776301944</td>\n",
" <td>7/16/2010</td>\n",
" <td>3851</td>\n",
" <td>81.73</td>\n",
" <td>56.67</td>\n",
" <td>314742.23</td>\n",
" <td>218236.17</td>\n",
" <td>96506.06</td>\n",
" </tr>\n",
" <tr>\n",
" <th>999</th>\n",
" <td>Sub-Saharan Africa</td>\n",
" <td>Cameroon</td>\n",
" <td>Personal Care</td>\n",
" <td>Online</td>\n",
" <td>C</td>\n",
" <td>10/19/2013</td>\n",
" <td>874266088</td>\n",
" <td>10/29/2013</td>\n",
" <td>3618</td>\n",
" <td>81.73</td>\n",
" <td>56.67</td>\n",
" <td>295699.14</td>\n",
" <td>205032.06</td>\n",
" <td>90667.08</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1000 rows × 14 columns</p>\n",
"</div>"
],
"text/plain": [
" Sub-Saharan Africa Republic of the Congo Fruits \\\n",
"0 North America Canada Personal Care \n",
"1 Europe Bosnia and Herzegovina Fruits \n",
"2 Sub-Saharan Africa Senegal Fruits \n",
"3 Sub-Saharan Africa Togo Fruits \n",
"4 Central America and the Caribbean Antigua and Barbuda Baby Food \n",
".. ... ... ... \n",
"995 North America Mexico Baby Food \n",
"996 Sub-Saharan Africa Sierra Leone Fruits \n",
"997 Asia Myanmar Clothes \n",
"998 Asia Singapore Personal Care \n",
"999 Sub-Saharan Africa Cameroon Personal Care \n",
"\n",
" Offline L 12/8/2010 261239278 1/12/2011 5362 9.33 6.92 \\\n",
"0 Offline C 8/2/2011 981317985 9/2/2011 2654 81.73 56.67 \n",
"1 Offline L 4/18/2016 941541039 5/28/2016 1772 9.33 6.92 \n",
"2 Online L 3/12/2017 701332271 4/22/2017 8659 9.33 6.92 \n",
"3 Online H 1/12/2015 440428006 2/13/2015 1289 9.33 6.92 \n",
"4 Online L 6/4/2015 408877074 7/18/2015 4667 255.28 159.42 \n",
".. ... .. ... ... ... ... ... ... \n",
"995 Offline M 3/3/2017 507650601 4/4/2017 3949 255.28 159.42 \n",
"996 Online M 11/19/2012 793824279 12/5/2012 9099 9.33 6.92 \n",
"997 Offline C 8/9/2010 153150117 9/19/2010 9378 109.28 35.84 \n",
"998 Offline L 7/2/2010 776301944 7/16/2010 3851 81.73 56.67 \n",
"999 Online C 10/19/2013 874266088 10/29/2013 3618 81.73 56.67 \n",
"\n",
" 50027.46 37105.04 12922.42 \n",
"0 216911.42 150402.18 66509.24 \n",
"1 16532.76 12262.24 4270.52 \n",
"2 80788.47 59920.28 20868.19 \n",
"3 12026.37 8919.88 3106.49 \n",
"4 1191391.76 744013.14 447378.62 \n",
".. ... ... ... \n",
"995 1008100.72 629549.58 378551.14 \n",
"996 84893.67 62965.08 21928.59 \n",
"997 1024827.84 336107.52 688720.32 \n",
"998 314742.23 218236.17 96506.06 \n",
"999 295699.14 205032.06 90667.08 \n",
"\n",
"[1000 rows x 14 columns]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"id": "9952e946",
"metadata": {},
"source": [
"## Ograniczanie zakresu kolumn"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "95ee720a",
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv('sales-records.csv', usecols=[0, 1, 2, 3, 8, 9, 10])"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "7c8fa26a",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Region</th>\n",
" <th>Country</th>\n",
" <th>Item Type</th>\n",
" <th>Sales Channel</th>\n",
" <th>Units Sold</th>\n",
" <th>Unit Price</th>\n",
" <th>Unit Cost</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Sub-Saharan Africa</td>\n",
" <td>South Africa</td>\n",
" <td>Fruits</td>\n",
" <td>Offline</td>\n",
" <td>1593</td>\n",
" <td>9.33</td>\n",
" <td>6.92</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Middle East and North Africa</td>\n",
" <td>Morocco</td>\n",
" <td>Clothes</td>\n",
" <td>Online</td>\n",
" <td>4611</td>\n",
" <td>109.28</td>\n",
" <td>35.84</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Australia and Oceania</td>\n",
" <td>Papua New Guinea</td>\n",
" <td>Meat</td>\n",
" <td>Offline</td>\n",
" <td>360</td>\n",
" <td>421.89</td>\n",
" <td>364.69</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Sub-Saharan Africa</td>\n",
" <td>Djibouti</td>\n",
" <td>Clothes</td>\n",
" <td>Offline</td>\n",
" <td>562</td>\n",
" <td>109.28</td>\n",
" <td>35.84</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Europe</td>\n",
" <td>Slovakia</td>\n",
" <td>Beverages</td>\n",
" <td>Offline</td>\n",
" <td>3973</td>\n",
" <td>47.45</td>\n",
" <td>31.79</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>999995</th>\n",
" <td>Sub-Saharan Africa</td>\n",
" <td>Senegal</td>\n",
" <td>Baby Food</td>\n",
" <td>Offline</td>\n",
" <td>3387</td>\n",
" <td>255.28</td>\n",
" <td>159.42</td>\n",
" </tr>\n",
" <tr>\n",
" <th>999996</th>\n",
" <td>Central America and the Caribbean</td>\n",
" <td>Panama</td>\n",
" <td>Office Supplies</td>\n",
" <td>Offline</td>\n",
" <td>4068</td>\n",
" <td>651.21</td>\n",
" <td>524.96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>999997</th>\n",
" <td>Europe</td>\n",
" <td>Norway</td>\n",
" <td>Office Supplies</td>\n",
" <td>Online</td>\n",
" <td>5266</td>\n",
" <td>651.21</td>\n",
" <td>524.96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>999998</th>\n",
" <td>Europe</td>\n",
" <td>Montenegro</td>\n",
" <td>Beverages</td>\n",
" <td>Offline</td>\n",
" <td>8551</td>\n",
" <td>47.45</td>\n",
" <td>31.79</td>\n",
" </tr>\n",
" <tr>\n",
" <th>999999</th>\n",
" <td>Central America and the Caribbean</td>\n",
" <td>Nicaragua</td>\n",
" <td>Meat</td>\n",
" <td>Online</td>\n",
" <td>7519</td>\n",
" <td>421.89</td>\n",
" <td>364.69</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1000000 rows × 7 columns</p>\n",
"</div>"
],
"text/plain": [
" Region Country Item Type \\\n",
"0 Sub-Saharan Africa South Africa Fruits \n",
"1 Middle East and North Africa Morocco Clothes \n",
"2 Australia and Oceania Papua New Guinea Meat \n",
"3 Sub-Saharan Africa Djibouti Clothes \n",
"4 Europe Slovakia Beverages \n",
"... ... ... ... \n",
"999995 Sub-Saharan Africa Senegal Baby Food \n",
"999996 Central America and the Caribbean Panama Office Supplies \n",
"999997 Europe Norway Office Supplies \n",
"999998 Europe Montenegro Beverages \n",
"999999 Central America and the Caribbean Nicaragua Meat \n",
"\n",
" Sales Channel Units Sold Unit Price Unit Cost \n",
"0 Offline 1593 9.33 6.92 \n",
"1 Online 4611 109.28 35.84 \n",
"2 Offline 360 421.89 364.69 \n",
"3 Offline 562 109.28 35.84 \n",
"4 Offline 3973 47.45 31.79 \n",
"... ... ... ... ... \n",
"999995 Offline 3387 255.28 159.42 \n",
"999996 Offline 4068 651.21 524.96 \n",
"999997 Online 5266 651.21 524.96 \n",
"999998 Offline 8551 47.45 31.79 \n",
"999999 Online 7519 421.89 364.69 \n",
"\n",
"[1000000 rows x 7 columns]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "94b2072a",
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv('sales-records.csv',\n",
" usecols=['Region', 'Country', 'Item Type', 'Sales Channel', 'Units Sold', 'Unit Price', 'Unit Cost'])"
]
},
{
"cell_type": "markdown",
"id": "43eaedf3",
"metadata": {},
"source": [
"Gdy mamy słownik z opisem kolumn i ich typów, możemy też go użyć do podania zakresu wczytywanych kolumn."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "d8521d50",
"metadata": {},
"outputs": [],
"source": [
"typy2 = {\n",
" \"Region\": 'category',\n",
" \"Country\": 'category',\n",
" \"Item Type\": 'category',\n",
" \"Order ID\": 'int32',\n",
" \"Units Sold\": 'int16',\n",
" \"Unit Price\": 'float32',\n",
" \"Unit Cost\": 'float32',\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "9af89450",
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv('sales-records.csv', dtype=typy2, usecols=typy2)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "6129793b",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Region</th>\n",
" <th>Country</th>\n",
" <th>Item Type</th>\n",
" <th>Order ID</th>\n",
" <th>Units Sold</th>\n",
" <th>Unit Price</th>\n",
" <th>Unit Cost</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Sub-Saharan Africa</td>\n",
" <td>South Africa</td>\n",
" <td>Fruits</td>\n",
" <td>443368995</td>\n",
" <td>1593</td>\n",
" <td>9.330000</td>\n",
" <td>6.920000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Middle East and North Africa</td>\n",
" <td>Morocco</td>\n",
" <td>Clothes</td>\n",
" <td>667593514</td>\n",
" <td>4611</td>\n",
" <td>109.279999</td>\n",
" <td>35.840000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Australia and Oceania</td>\n",
" <td>Papua New Guinea</td>\n",
" <td>Meat</td>\n",
" <td>940995585</td>\n",
" <td>360</td>\n",
" <td>421.890015</td>\n",
" <td>364.690002</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Sub-Saharan Africa</td>\n",
" <td>Djibouti</td>\n",
" <td>Clothes</td>\n",
" <td>880811536</td>\n",
" <td>562</td>\n",
" <td>109.279999</td>\n",
" <td>35.840000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Europe</td>\n",
" <td>Slovakia</td>\n",
" <td>Beverages</td>\n",
" <td>174590194</td>\n",
" <td>3973</td>\n",
" <td>47.450001</td>\n",
" <td>31.790001</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>999995</th>\n",
" <td>Sub-Saharan Africa</td>\n",
" <td>Senegal</td>\n",
" <td>Baby Food</td>\n",
" <td>575470578</td>\n",
" <td>3387</td>\n",
" <td>255.279999</td>\n",
" <td>159.419998</td>\n",
" </tr>\n",
" <tr>\n",
" <th>999996</th>\n",
" <td>Central America and the Caribbean</td>\n",
" <td>Panama</td>\n",
" <td>Office Supplies</td>\n",
" <td>766942107</td>\n",
" <td>4068</td>\n",
" <td>651.210022</td>\n",
" <td>524.960022</td>\n",
" </tr>\n",
" <tr>\n",
" <th>999997</th>\n",
" <td>Europe</td>\n",
" <td>Norway</td>\n",
" <td>Office Supplies</td>\n",
" <td>685472047</td>\n",
" <td>5266</td>\n",
" <td>651.210022</td>\n",
" <td>524.960022</td>\n",
" </tr>\n",
" <tr>\n",
" <th>999998</th>\n",
" <td>Europe</td>\n",
" <td>Montenegro</td>\n",
" <td>Beverages</td>\n",
" <td>946734225</td>\n",
" <td>8551</td>\n",
" <td>47.450001</td>\n",
" <td>31.790001</td>\n",
" </tr>\n",
" <tr>\n",
" <th>999999</th>\n",
" <td>Central America and the Caribbean</td>\n",
" <td>Nicaragua</td>\n",
" <td>Meat</td>\n",
" <td>886714971</td>\n",
" <td>7519</td>\n",
" <td>421.890015</td>\n",
" <td>364.690002</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1000000 rows × 7 columns</p>\n",
"</div>"
],
"text/plain": [
" Region Country Item Type \\\n",
"0 Sub-Saharan Africa South Africa Fruits \n",
"1 Middle East and North Africa Morocco Clothes \n",
"2 Australia and Oceania Papua New Guinea Meat \n",
"3 Sub-Saharan Africa Djibouti Clothes \n",
"4 Europe Slovakia Beverages \n",
"... ... ... ... \n",
"999995 Sub-Saharan Africa Senegal Baby Food \n",
"999996 Central America and the Caribbean Panama Office Supplies \n",
"999997 Europe Norway Office Supplies \n",
"999998 Europe Montenegro Beverages \n",
"999999 Central America and the Caribbean Nicaragua Meat \n",
"\n",
" Order ID Units Sold Unit Price Unit Cost \n",
"0 443368995 1593 9.330000 6.920000 \n",
"1 667593514 4611 109.279999 35.840000 \n",
"2 940995585 360 421.890015 364.690002 \n",
"3 880811536 562 109.279999 35.840000 \n",
"4 174590194 3973 47.450001 31.790001 \n",
"... ... ... ... ... \n",
"999995 575470578 3387 255.279999 159.419998 \n",
"999996 766942107 4068 651.210022 524.960022 \n",
"999997 685472047 5266 651.210022 524.960022 \n",
"999998 946734225 8551 47.450001 31.790001 \n",
"999999 886714971 7519 421.890015 364.690002 \n",
"\n",
"[1000000 rows x 7 columns]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"id": "22495c4d",
"metadata": {},
"source": [
"## Czytanie danych porcjami\n",
"\n",
"Gdy chcemy przetworzyć cały plik, ale całość nie mieści się w pamięci, możemy czytać dane porcjami i w pętli przetwarzać te porcje.\n",
"\n",
"Gdy do funkcji `read_csv` podamy parametr `chunksize`, to ta funkcja nie zwraca tabeli (DataFrame), tylko obiekt, z którego będzie można pobierać kolejne porcje. To jest pewien rodzaj generatora."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "cf82701d",
"metadata": {},
"outputs": [],
"source": [
"maszyna_wczytujaca = pd.read_csv('sales-records.csv', chunksize=1000)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "5a2a6495",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<pandas.io.parsers.readers.TextFileReader at 0x1f767ed4d00>"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"maszyna_wczytujaca"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "7d851ba2",
"metadata": {},
"outputs": [],
"source": [
"chunk1 = maszyna_wczytujaca.get_chunk()"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "ba9e6caf",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Region</th>\n",
" <th>Country</th>\n",
" <th>Item Type</th>\n",
" <th>Sales Channel</th>\n",
" <th>Order Priority</th>\n",
" <th>Order Date</th>\n",
" <th>Order ID</th>\n",
" <th>Ship Date</th>\n",
" <th>Units Sold</th>\n",
" <th>Unit Price</th>\n",
" <th>Unit Cost</th>\n",
" <th>Total Revenue</th>\n",
" <th>Total Cost</th>\n",
" <th>Total Profit</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Sub-Saharan Africa</td>\n",
" <td>South Africa</td>\n",
" <td>Fruits</td>\n",
" <td>Offline</td>\n",
" <td>M</td>\n",
" <td>7/27/2012</td>\n",
" <td>443368995</td>\n",
" <td>7/28/2012</td>\n",
" <td>1593</td>\n",
" <td>9.33</td>\n",
" <td>6.92</td>\n",
" <td>14862.69</td>\n",
" <td>11023.56</td>\n",
" <td>3839.13</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Middle East and North Africa</td>\n",
" <td>Morocco</td>\n",
" <td>Clothes</td>\n",
" <td>Online</td>\n",
" <td>M</td>\n",
" <td>9/14/2013</td>\n",
" <td>667593514</td>\n",
" <td>10/19/2013</td>\n",
" <td>4611</td>\n",
" <td>109.28</td>\n",
" <td>35.84</td>\n",
" <td>503890.08</td>\n",
" <td>165258.24</td>\n",
" <td>338631.84</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Australia and Oceania</td>\n",
" <td>Papua New Guinea</td>\n",
" <td>Meat</td>\n",
" <td>Offline</td>\n",
" <td>M</td>\n",
" <td>5/15/2015</td>\n",
" <td>940995585</td>\n",
" <td>6/4/2015</td>\n",
" <td>360</td>\n",
" <td>421.89</td>\n",
" <td>364.69</td>\n",
" <td>151880.40</td>\n",
" <td>131288.40</td>\n",
" <td>20592.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Sub-Saharan Africa</td>\n",
" <td>Djibouti</td>\n",
" <td>Clothes</td>\n",
" <td>Offline</td>\n",
" <td>H</td>\n",
" <td>5/17/2017</td>\n",
" <td>880811536</td>\n",
" <td>7/2/2017</td>\n",
" <td>562</td>\n",
" <td>109.28</td>\n",
" <td>35.84</td>\n",
" <td>61415.36</td>\n",
" <td>20142.08</td>\n",
" <td>41273.28</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Europe</td>\n",
" <td>Slovakia</td>\n",
" <td>Beverages</td>\n",
" <td>Offline</td>\n",
" <td>L</td>\n",
" <td>10/26/2016</td>\n",
" <td>174590194</td>\n",
" <td>12/4/2016</td>\n",
" <td>3973</td>\n",
" <td>47.45</td>\n",
" <td>31.79</td>\n",
" <td>188518.85</td>\n",
" <td>126301.67</td>\n",
" <td>62217.18</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>995</th>\n",
" <td>Asia</td>\n",
" <td>Thailand</td>\n",
" <td>Clothes</td>\n",
" <td>Online</td>\n",
" <td>H</td>\n",
" <td>6/11/2012</td>\n",
" <td>768737256</td>\n",
" <td>7/7/2012</td>\n",
" <td>5292</td>\n",
" <td>109.28</td>\n",
" <td>35.84</td>\n",
" <td>578309.76</td>\n",
" <td>189665.28</td>\n",
" <td>388644.48</td>\n",
" </tr>\n",
" <tr>\n",
" <th>996</th>\n",
" <td>Middle East and North Africa</td>\n",
" <td>Iraq</td>\n",
" <td>Vegetables</td>\n",
" <td>Offline</td>\n",
" <td>H</td>\n",
" <td>12/11/2011</td>\n",
" <td>492740523</td>\n",
" <td>1/3/2012</td>\n",
" <td>1725</td>\n",
" <td>154.06</td>\n",
" <td>90.93</td>\n",
" <td>265753.50</td>\n",
" <td>156854.25</td>\n",
" <td>108899.25</td>\n",
" </tr>\n",
" <tr>\n",
" <th>997</th>\n",
" <td>Europe</td>\n",
" <td>Kosovo</td>\n",
" <td>Beverages</td>\n",
" <td>Offline</td>\n",
" <td>L</td>\n",
" <td>12/16/2014</td>\n",
" <td>552005326</td>\n",
" <td>2/3/2015</td>\n",
" <td>9498</td>\n",
" <td>47.45</td>\n",
" <td>31.79</td>\n",
" <td>450680.10</td>\n",
" <td>301941.42</td>\n",
" <td>148738.68</td>\n",
" </tr>\n",
" <tr>\n",
" <th>998</th>\n",
" <td>North America</td>\n",
" <td>Canada</td>\n",
" <td>Fruits</td>\n",
" <td>Online</td>\n",
" <td>M</td>\n",
" <td>2/19/2013</td>\n",
" <td>774705493</td>\n",
" <td>2/26/2013</td>\n",
" <td>1426</td>\n",
" <td>9.33</td>\n",
" <td>6.92</td>\n",
" <td>13304.58</td>\n",
" <td>9867.92</td>\n",
" <td>3436.66</td>\n",
" </tr>\n",
" <tr>\n",
" <th>999</th>\n",
" <td>Middle East and North Africa</td>\n",
" <td>Turkey</td>\n",
" <td>Snacks</td>\n",
" <td>Offline</td>\n",
" <td>M</td>\n",
" <td>5/3/2014</td>\n",
" <td>609327352</td>\n",
" <td>5/19/2014</td>\n",
" <td>2359</td>\n",
" <td>152.58</td>\n",
" <td>97.44</td>\n",
" <td>359936.22</td>\n",
" <td>229860.96</td>\n",
" <td>130075.26</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1000 rows × 14 columns</p>\n",
"</div>"
],
"text/plain": [
" Region Country Item Type Sales Channel \\\n",
"0 Sub-Saharan Africa South Africa Fruits Offline \n",
"1 Middle East and North Africa Morocco Clothes Online \n",
"2 Australia and Oceania Papua New Guinea Meat Offline \n",
"3 Sub-Saharan Africa Djibouti Clothes Offline \n",
"4 Europe Slovakia Beverages Offline \n",
".. ... ... ... ... \n",
"995 Asia Thailand Clothes Online \n",
"996 Middle East and North Africa Iraq Vegetables Offline \n",
"997 Europe Kosovo Beverages Offline \n",
"998 North America Canada Fruits Online \n",
"999 Middle East and North Africa Turkey Snacks Offline \n",
"\n",
" Order Priority Order Date Order ID Ship Date Units Sold Unit Price \\\n",
"0 M 7/27/2012 443368995 7/28/2012 1593 9.33 \n",
"1 M 9/14/2013 667593514 10/19/2013 4611 109.28 \n",
"2 M 5/15/2015 940995585 6/4/2015 360 421.89 \n",
"3 H 5/17/2017 880811536 7/2/2017 562 109.28 \n",
"4 L 10/26/2016 174590194 12/4/2016 3973 47.45 \n",
".. ... ... ... ... ... ... \n",
"995 H 6/11/2012 768737256 7/7/2012 5292 109.28 \n",
"996 H 12/11/2011 492740523 1/3/2012 1725 154.06 \n",
"997 L 12/16/2014 552005326 2/3/2015 9498 47.45 \n",
"998 M 2/19/2013 774705493 2/26/2013 1426 9.33 \n",
"999 M 5/3/2014 609327352 5/19/2014 2359 152.58 \n",
"\n",
" Unit Cost Total Revenue Total Cost Total Profit \n",
"0 6.92 14862.69 11023.56 3839.13 \n",
"1 35.84 503890.08 165258.24 338631.84 \n",
"2 364.69 151880.40 131288.40 20592.00 \n",
"3 35.84 61415.36 20142.08 41273.28 \n",
"4 31.79 188518.85 126301.67 62217.18 \n",
".. ... ... ... ... \n",
"995 35.84 578309.76 189665.28 388644.48 \n",
"996 90.93 265753.50 156854.25 108899.25 \n",
"997 31.79 450680.10 301941.42 148738.68 \n",
"998 6.92 13304.58 9867.92 3436.66 \n",
"999 97.44 359936.22 229860.96 130075.26 \n",
"\n",
"[1000 rows x 14 columns]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# pierwsze tysiąc rekordów\n",
"chunk1"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "1caff85e",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Region</th>\n",
" <th>Country</th>\n",
" <th>Item Type</th>\n",
" <th>Sales Channel</th>\n",
" <th>Order Priority</th>\n",
" <th>Order Date</th>\n",
" <th>Order ID</th>\n",
" <th>Ship Date</th>\n",
" <th>Units Sold</th>\n",
" <th>Unit Price</th>\n",
" <th>Unit Cost</th>\n",
" <th>Total Revenue</th>\n",
" <th>Total Cost</th>\n",
" <th>Total Profit</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1000</th>\n",
" <td>Central America and the Caribbean</td>\n",
" <td>Guatemala</td>\n",
" <td>Fruits</td>\n",
" <td>Offline</td>\n",
" <td>H</td>\n",
" <td>2/14/2016</td>\n",
" <td>474669730</td>\n",
" <td>3/16/2016</td>\n",
" <td>4176</td>\n",
" <td>9.33</td>\n",
" <td>6.92</td>\n",
" <td>38962.08</td>\n",
" <td>28897.92</td>\n",
" <td>10064.16</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1001</th>\n",
" <td>Europe</td>\n",
" <td>United Kingdom</td>\n",
" <td>Beverages</td>\n",
" <td>Offline</td>\n",
" <td>M</td>\n",
" <td>2/12/2017</td>\n",
" <td>409843957</td>\n",
" <td>3/2/2017</td>\n",
" <td>789</td>\n",
" <td>47.45</td>\n",
" <td>31.79</td>\n",
" <td>37438.05</td>\n",
" <td>25082.31</td>\n",
" <td>12355.74</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1002</th>\n",
" <td>Europe</td>\n",
" <td>Cyprus</td>\n",
" <td>Cosmetics</td>\n",
" <td>Online</td>\n",
" <td>H</td>\n",
" <td>3/26/2013</td>\n",
" <td>524273860</td>\n",
" <td>5/1/2013</td>\n",
" <td>3141</td>\n",
" <td>437.20</td>\n",
" <td>263.33</td>\n",
" <td>1373245.20</td>\n",
" <td>827119.53</td>\n",
" <td>546125.67</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1003</th>\n",
" <td>Middle East and North Africa</td>\n",
" <td>Qatar</td>\n",
" <td>Fruits</td>\n",
" <td>Online</td>\n",
" <td>C</td>\n",
" <td>8/1/2016</td>\n",
" <td>547695767</td>\n",
" <td>8/4/2016</td>\n",
" <td>4203</td>\n",
" <td>9.33</td>\n",
" <td>6.92</td>\n",
" <td>39213.99</td>\n",
" <td>29084.76</td>\n",
" <td>10129.23</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1004</th>\n",
" <td>Sub-Saharan Africa</td>\n",
" <td>Sao Tome and Principe</td>\n",
" <td>Office Supplies</td>\n",
" <td>Online</td>\n",
" <td>L</td>\n",
" <td>11/7/2013</td>\n",
" <td>280158507</td>\n",
" <td>12/7/2013</td>\n",
" <td>3983</td>\n",
" <td>651.21</td>\n",
" <td>524.96</td>\n",
" <td>2593769.43</td>\n",
" <td>2090915.68</td>\n",
" <td>502853.75</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1995</th>\n",
" <td>Asia</td>\n",
" <td>Malaysia</td>\n",
" <td>Personal Care</td>\n",
" <td>Offline</td>\n",
" <td>L</td>\n",
" <td>8/1/2014</td>\n",
" <td>291480863</td>\n",
" <td>8/9/2014</td>\n",
" <td>4519</td>\n",
" <td>81.73</td>\n",
" <td>56.67</td>\n",
" <td>369337.87</td>\n",
" <td>256091.73</td>\n",
" <td>113246.14</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1996</th>\n",
" <td>Sub-Saharan Africa</td>\n",
" <td>Sierra Leone</td>\n",
" <td>Clothes</td>\n",
" <td>Online</td>\n",
" <td>M</td>\n",
" <td>12/16/2010</td>\n",
" <td>791668212</td>\n",
" <td>1/17/2011</td>\n",
" <td>1071</td>\n",
" <td>109.28</td>\n",
" <td>35.84</td>\n",
" <td>117038.88</td>\n",
" <td>38384.64</td>\n",
" <td>78654.24</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1997</th>\n",
" <td>Middle East and North Africa</td>\n",
" <td>Egypt</td>\n",
" <td>Baby Food</td>\n",
" <td>Online</td>\n",
" <td>M</td>\n",
" <td>1/23/2012</td>\n",
" <td>294695222</td>\n",
" <td>1/29/2012</td>\n",
" <td>5720</td>\n",
" <td>255.28</td>\n",
" <td>159.42</td>\n",
" <td>1460201.60</td>\n",
" <td>911882.40</td>\n",
" <td>548319.20</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1998</th>\n",
" <td>Europe</td>\n",
" <td>Iceland</td>\n",
" <td>Beverages</td>\n",
" <td>Online</td>\n",
" <td>M</td>\n",
" <td>4/11/2014</td>\n",
" <td>541235721</td>\n",
" <td>4/29/2014</td>\n",
" <td>2532</td>\n",
" <td>47.45</td>\n",
" <td>31.79</td>\n",
" <td>120143.40</td>\n",
" <td>80492.28</td>\n",
" <td>39651.12</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1999</th>\n",
" <td>Sub-Saharan Africa</td>\n",
" <td>The Gambia</td>\n",
" <td>Snacks</td>\n",
" <td>Online</td>\n",
" <td>L</td>\n",
" <td>1/16/2015</td>\n",
" <td>192276036</td>\n",
" <td>1/29/2015</td>\n",
" <td>7607</td>\n",
" <td>152.58</td>\n",
" <td>97.44</td>\n",
" <td>1160676.06</td>\n",
" <td>741226.08</td>\n",
" <td>419449.98</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1000 rows × 14 columns</p>\n",
"</div>"
],
"text/plain": [
" Region Country \\\n",
"1000 Central America and the Caribbean Guatemala \n",
"1001 Europe United Kingdom \n",
"1002 Europe Cyprus \n",
"1003 Middle East and North Africa Qatar \n",
"1004 Sub-Saharan Africa Sao Tome and Principe \n",
"... ... ... \n",
"1995 Asia Malaysia \n",
"1996 Sub-Saharan Africa Sierra Leone \n",
"1997 Middle East and North Africa Egypt \n",
"1998 Europe Iceland \n",
"1999 Sub-Saharan Africa The Gambia \n",
"\n",
" Item Type Sales Channel Order Priority Order Date Order ID \\\n",
"1000 Fruits Offline H 2/14/2016 474669730 \n",
"1001 Beverages Offline M 2/12/2017 409843957 \n",
"1002 Cosmetics Online H 3/26/2013 524273860 \n",
"1003 Fruits Online C 8/1/2016 547695767 \n",
"1004 Office Supplies Online L 11/7/2013 280158507 \n",
"... ... ... ... ... ... \n",
"1995 Personal Care Offline L 8/1/2014 291480863 \n",
"1996 Clothes Online M 12/16/2010 791668212 \n",
"1997 Baby Food Online M 1/23/2012 294695222 \n",
"1998 Beverages Online M 4/11/2014 541235721 \n",
"1999 Snacks Online L 1/16/2015 192276036 \n",
"\n",
" Ship Date Units Sold Unit Price Unit Cost Total Revenue Total Cost \\\n",
"1000 3/16/2016 4176 9.33 6.92 38962.08 28897.92 \n",
"1001 3/2/2017 789 47.45 31.79 37438.05 25082.31 \n",
"1002 5/1/2013 3141 437.20 263.33 1373245.20 827119.53 \n",
"1003 8/4/2016 4203 9.33 6.92 39213.99 29084.76 \n",
"1004 12/7/2013 3983 651.21 524.96 2593769.43 2090915.68 \n",
"... ... ... ... ... ... ... \n",
"1995 8/9/2014 4519 81.73 56.67 369337.87 256091.73 \n",
"1996 1/17/2011 1071 109.28 35.84 117038.88 38384.64 \n",
"1997 1/29/2012 5720 255.28 159.42 1460201.60 911882.40 \n",
"1998 4/29/2014 2532 47.45 31.79 120143.40 80492.28 \n",
"1999 1/29/2015 7607 152.58 97.44 1160676.06 741226.08 \n",
"\n",
" Total Profit \n",
"1000 10064.16 \n",
"1001 12355.74 \n",
"1002 546125.67 \n",
"1003 10129.23 \n",
"1004 502853.75 \n",
"... ... \n",
"1995 113246.14 \n",
"1996 78654.24 \n",
"1997 548319.20 \n",
"1998 39651.12 \n",
"1999 419449.98 \n",
"\n",
"[1000 rows x 14 columns]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chunk2 = maszyna_wczytujaca.get_chunk()\n",
"chunk2"
]
},
{
"cell_type": "markdown",
"id": "31365b39",
"metadata": {},
"source": [
"I tak dalej...\n",
"\n",
"Zamiast w pętli wywoływać get_chunk, możemy pobrać wszystkie porcje w pętli `for`, ponieważ maszynak wczytująca jest iterowalna."
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "d31091c3",
"metadata": {},
"outputs": [],
"source": [
"typy = {\n",
" \"Region\": 'category',\n",
" \"Country\": 'category',\n",
" \"Item Type\": 'category',\n",
" \"Sales Channel\": 'category',\n",
" \"Order Priority\": 'category',\n",
" \"Order ID\": 'int32',\n",
" \"Units Sold\": 'int16',\n",
" \"Unit Price\": 'float32',\n",
" \"Unit Cost\": 'float32',\n",
" \"Total Revenue\": 'float64',\n",
" \"Total Cost\": 'float64',\n",
" \"Total Profit\": 'float64',\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "25fc9200",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Suma porcji = 39277763055.299995 , suma całości = 39277763055.299995\n",
"Suma porcji = 39100160089.25 , suma całości = 78377923144.54999\n",
"Suma porcji = 39326230975.14 , suma całości = 117704154119.68999\n",
"Suma porcji = 39356150158.33001 , suma całości = 157060304278.02\n",
"Suma porcji = 39179678016.200005 , suma całości = 196239982294.22\n",
"Suma porcji = 39264451692.43 , suma całości = 235504433986.65\n",
"Suma porcji = 39263680732.28 , suma całości = 274768114718.93\n",
"Suma porcji = 39205522643.08 , suma całości = 313973637362.01\n",
"Suma porcji = 39159852119.67 , suma całości = 353133489481.68\n",
"Suma porcji = 39162072167.27 , suma całości = 392295561648.95\n",
"Suma końcowa: 392 295 561 648.95\n"
]
}
],
"source": [
"maszyna = pd.read_csv('sales-records.csv', chunksize=100000,\n",
" dtype=typy, usecols=['Region', 'Country', 'Total Profit'])\n",
"suma = 0.0\n",
"for porcja in maszyna:\n",
" suma_porcji = porcja[\"Total Profit\"].sum()\n",
" suma += suma_porcji\n",
" print(f'Suma porcji = {suma_porcji} , suma całości = {suma}')\n",
"\n",
"maszyna.close()\n",
"print(f'Suma końcowa: {suma:,.2f}'.replace(',', ' '))"
]
},
{
"cell_type": "markdown",
"id": "bc42c524",
"metadata": {},
"source": [
"Czytanie porcjami służy przede wszystkim oszczędzaniu pamięci. Natomiast zużycie procesora może być nawet większe.\n",
"\n",
"Należy tak ustalić rozmiar porcji, aby swobodnie mieściła się w pamięci, ale żeby nie było niepotrzebnie dużo obrotów pętli.\n",
"\n",
"Wersja końcowa tego przykładu, już z minimalną ilością printów. Dodatkowo korzystamy jeszcze z konstrukcji `with`, dzięki której plik zostanie automatycznie zamknięty bez potrzeby pisania `close`."
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "794b0046",
"metadata": {},
"outputs": [],
"source": [
"suma = 0.0\n",
"with pd.read_csv('sales-records.csv', chunksize=10000, dtype=typy, usecols=['Region', 'Country', 'Total Profit']) as reader:\n",
" for chunk in reader:\n",
" suma += chunk[\"Total Profit\"].sum()"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "91df80cc",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"392295561648.95"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"suma"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
...@@ -2441,8 +2441,28 @@ ...@@ -2441,8 +2441,28 @@
] ]
}, },
{ {
"cell_type": "markdown",
"id": "65096c60-0917-4939-8aa0-95de69a4eafd",
"metadata": {},
"source": [
"We wczytanej tabeli mamy kolumny `cena` oraz `sztuk`, a dopiero ich iloczyn zawiera info o wartości transakcji.\n",
"\n",
"Do tabeli dodamy nową kolumnę `wartosc` , która będzie zawierać iloczyn."
]
},
{
"cell_type": "code", "cell_type": "code",
"execution_count": 65, "execution_count": 66,
"id": "a335695f-9e7f-4977-94be-484404276027",
"metadata": {},
"outputs": [],
"source": [
"sprzedaz['wartosc'] = sprzedaz.cena * sprzedaz.sztuk"
]
},
{
"cell_type": "code",
"execution_count": 67,
"id": "cfa4857f-99bb-4e89-a7e4-6c916efdcb11", "id": "cfa4857f-99bb-4e89-a7e4-6c916efdcb11",
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
...@@ -2474,6 +2494,7 @@ ...@@ -2474,6 +2494,7 @@
" <th>towar</th>\n", " <th>towar</th>\n",
" <th>cena</th>\n", " <th>cena</th>\n",
" <th>sztuk</th>\n", " <th>sztuk</th>\n",
" <th>wartosc</th>\n",
" </tr>\n", " </tr>\n",
" </thead>\n", " </thead>\n",
" <tbody>\n", " <tbody>\n",
...@@ -2486,6 +2507,7 @@ ...@@ -2486,6 +2507,7 @@
" <td>biurko</td>\n", " <td>biurko</td>\n",
" <td>149.99</td>\n", " <td>149.99</td>\n",
" <td>4</td>\n", " <td>4</td>\n",
" <td>599.96</td>\n",
" </tr>\n", " </tr>\n",
" <tr>\n", " <tr>\n",
" <th>1</th>\n", " <th>1</th>\n",
...@@ -2496,6 +2518,7 @@ ...@@ -2496,6 +2518,7 @@
" <td>tablica</td>\n", " <td>tablica</td>\n",
" <td>590.00</td>\n", " <td>590.00</td>\n",
" <td>2</td>\n", " <td>2</td>\n",
" <td>1180.00</td>\n",
" </tr>\n", " </tr>\n",
" <tr>\n", " <tr>\n",
" <th>2</th>\n", " <th>2</th>\n",
...@@ -2506,6 +2529,7 @@ ...@@ -2506,6 +2529,7 @@
" <td>flamaster</td>\n", " <td>flamaster</td>\n",
" <td>0.99</td>\n", " <td>0.99</td>\n",
" <td>51</td>\n", " <td>51</td>\n",
" <td>50.49</td>\n",
" </tr>\n", " </tr>\n",
" <tr>\n", " <tr>\n",
" <th>3</th>\n", " <th>3</th>\n",
...@@ -2516,6 +2540,7 @@ ...@@ -2516,6 +2540,7 @@
" <td>gąbka</td>\n", " <td>gąbka</td>\n",
" <td>4.00</td>\n", " <td>4.00</td>\n",
" <td>250</td>\n", " <td>250</td>\n",
" <td>1000.00</td>\n",
" </tr>\n", " </tr>\n",
" <tr>\n", " <tr>\n",
" <th>4</th>\n", " <th>4</th>\n",
...@@ -2526,6 +2551,7 @@ ...@@ -2526,6 +2551,7 @@
" <td>biurko</td>\n", " <td>biurko</td>\n",
" <td>149.99</td>\n", " <td>149.99</td>\n",
" <td>9</td>\n", " <td>9</td>\n",
" <td>1349.91</td>\n",
" </tr>\n", " </tr>\n",
" <tr>\n", " <tr>\n",
" <th>...</th>\n", " <th>...</th>\n",
...@@ -2536,6 +2562,7 @@ ...@@ -2536,6 +2562,7 @@
" <td>...</td>\n", " <td>...</td>\n",
" <td>...</td>\n", " <td>...</td>\n",
" <td>...</td>\n", " <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n", " </tr>\n",
" <tr>\n", " <tr>\n",
" <th>9995</th>\n", " <th>9995</th>\n",
...@@ -2546,6 +2573,7 @@ ...@@ -2546,6 +2573,7 @@
" <td>dziurkacz</td>\n", " <td>dziurkacz</td>\n",
" <td>7.50</td>\n", " <td>7.50</td>\n",
" <td>178</td>\n", " <td>178</td>\n",
" <td>1335.00</td>\n",
" </tr>\n", " </tr>\n",
" <tr>\n", " <tr>\n",
" <th>9996</th>\n", " <th>9996</th>\n",
...@@ -2556,6 +2584,7 @@ ...@@ -2556,6 +2584,7 @@
" <td>biurko</td>\n", " <td>biurko</td>\n",
" <td>149.99</td>\n", " <td>149.99</td>\n",
" <td>7</td>\n", " <td>7</td>\n",
" <td>1049.93</td>\n",
" </tr>\n", " </tr>\n",
" <tr>\n", " <tr>\n",
" <th>9997</th>\n", " <th>9997</th>\n",
...@@ -2566,6 +2595,7 @@ ...@@ -2566,6 +2595,7 @@
" <td>długopis</td>\n", " <td>długopis</td>\n",
" <td>1.49</td>\n", " <td>1.49</td>\n",
" <td>87</td>\n", " <td>87</td>\n",
" <td>129.63</td>\n",
" </tr>\n", " </tr>\n",
" <tr>\n", " <tr>\n",
" <th>9998</th>\n", " <th>9998</th>\n",
...@@ -2576,6 +2606,7 @@ ...@@ -2576,6 +2606,7 @@
" <td>biurko</td>\n", " <td>biurko</td>\n",
" <td>149.99</td>\n", " <td>149.99</td>\n",
" <td>10</td>\n", " <td>10</td>\n",
" <td>1499.90</td>\n",
" </tr>\n", " </tr>\n",
" <tr>\n", " <tr>\n",
" <th>9999</th>\n", " <th>9999</th>\n",
...@@ -2586,10 +2617,11 @@ ...@@ -2586,10 +2617,11 @@
" <td>gąbka</td>\n", " <td>gąbka</td>\n",
" <td>4.00</td>\n", " <td>4.00</td>\n",
" <td>152</td>\n", " <td>152</td>\n",
" <td>608.00</td>\n",
" </tr>\n", " </tr>\n",
" </tbody>\n", " </tbody>\n",
"</table>\n", "</table>\n",
"<p>10000 rows × 7 columns</p>\n", "<p>10000 rows × 8 columns</p>\n",
"</div>" "</div>"
], ],
"text/plain": [ "text/plain": [
...@@ -2606,23 +2638,23 @@ ...@@ -2606,23 +2638,23 @@
"9998 2015-05-01 Kraków Kozłowski meble biurko 149.99 \n", "9998 2015-05-01 Kraków Kozłowski meble biurko 149.99 \n",
"9999 2016-08-26 Kraków Kozłowski wyposażenie szkolne gąbka 4.00 \n", "9999 2016-08-26 Kraków Kozłowski wyposażenie szkolne gąbka 4.00 \n",
"\n", "\n",
" sztuk \n", " sztuk wartosc \n",
"0 4 \n", "0 4 599.96 \n",
"1 2 \n", "1 2 1180.00 \n",
"2 51 \n", "2 51 50.49 \n",
"3 250 \n", "3 250 1000.00 \n",
"4 9 \n", "4 9 1349.91 \n",
"... ... \n", "... ... ... \n",
"9995 178 \n", "9995 178 1335.00 \n",
"9996 7 \n", "9996 7 1049.93 \n",
"9997 87 \n", "9997 87 129.63 \n",
"9998 10 \n", "9998 10 1499.90 \n",
"9999 152 \n", "9999 152 608.00 \n",
"\n", "\n",
"[10000 rows x 7 columns]" "[10000 rows x 8 columns]"
] ]
}, },
"execution_count": 65, "execution_count": 67,
"metadata": {}, "metadata": {},
"output_type": "execute_result" "output_type": "execute_result"
} }
...@@ -2632,9 +2664,168 @@ ...@@ -2632,9 +2664,168 @@
] ]
}, },
{ {
"cell_type": "markdown",
"id": "baf91a7e-573e-4335-91b2-041cd147df3f",
"metadata": {},
"source": [
"### Zadania:\n",
"1. Oblicz sumę wartości transakcji w całym pliku"
]
},
{
"cell_type": "code",
"execution_count": 69,
"id": "a8dc418a-d810-4cdc-8d01-4ba730318706",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"8049567.3"
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sprzedaz.wartosc.sum()"
]
},
{
"cell_type": "markdown",
"id": "505a1ff6-1ce6-4d6a-b818-8ae3b31b00c1",
"metadata": {},
"source": [
"2. Oblicz sumę wartości transakcji w Katowicach"
]
},
{
"cell_type": "code",
"execution_count": 70,
"id": "b284f678-9e34-4cfe-a215-92527a366ed9",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1456316.08"
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sprzedaz[sprzedaz.miasto == 'Katowice'].wartosc.sum()"
]
},
{
"cell_type": "markdown",
"id": "dd64c472-980a-488d-b529-1103b78f92c4",
"metadata": {},
"source": [
"3. Oblicz liczbę transakcji, sumę wartości (i jeśli dasz radę sumaryczną liczbę sztuk) dotyczących towaru biurko w Katowicach"
]
},
{
"cell_type": "code",
"execution_count": 72,
"id": "bf3ce0ea-4885-4e87-aaac-abc30aaf2b29",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 248.0\n",
"sum 391473.9\n",
"Name: wartosc, dtype: float64"
]
},
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sprzedaz[(sprzedaz.towar == 'biurko') & (sprzedaz.miasto == 'Katowice')].wartosc.agg(['count', 'sum'])"
]
},
{
"cell_type": "markdown",
"id": "753d3b1d-64b7-4d46-8498-dffe519e2819",
"metadata": {},
"source": [
"Operację `agg` można też zastosować dla `DataFrame` i przekazać **słownik**, który mówi, jakiew funkcje mają być liczone dla jakich kolumn."
]
},
{
"cell_type": "code",
"execution_count": 75,
"id": "a77019c4-8e4d-4065-b99b-3495e5da0ca5",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sztuk</th>\n",
" <th>wartosc</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>sum</th>\n",
" <td>2610.0</td>\n",
" <td>391473.9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>NaN</td>\n",
" <td>248.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" sztuk wartosc\n",
"sum 2610.0 391473.9\n",
"count NaN 248.0"
]
},
"execution_count": 75,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sprzedaz[(sprzedaz.towar == 'biurko') & (sprzedaz.miasto == 'Katowice')].agg({'sztuk': ['sum'], 'wartosc': ['count', 'sum']})"
]
},
{
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"id": "6554eb49-a75f-4a99-bee3-22077bc36162", "id": "147e601a-49b2-4c20-96d3-e65b98acbca7",
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [] "source": []
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment