Commit adcbf13b by Patryk Czarnik

początek analizy danych

parent 36cce81b
{
"cells": [
{
"cell_type": "markdown",
"id": "413742af-1218-433a-bea3-6684adb0517a",
"metadata": {},
"source": [
"Aby dokonywać \"biznesowej analizy danych\", użyjemy biblioteki `pandas`. Dodatkowo zaimportujemy też `numpy`, ale zwykle nie ma takiej potrzeby. Konwencją jest, że zaimportowanym modułom nadaje się skrótowe nazwy `pd` oraz `np`."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "833db353-396e-46a5-ba91-695ad955799a",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"id": "176ff729-bf4e-4c94-a002-a557fc26ed4f",
"metadata": {},
"source": [
"## Wczytanie danych\n",
"\n",
"Zwykle na początku wczytujemy dane z zewnętrznego źródła. Zwykle jest to plik csv, ale Pandas potrafi też czytać pliki Excel i wiele innych oraz pobierać dane z baz SQL."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "518a1c12-6c15-40aa-984d-96723ce0a770",
"metadata": {},
"outputs": [],
"source": [
"# to wczytuje dane w podstawowy sposób\n",
"# emps = pd.read_csv('pliki/emps.csv', sep=';')"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "f24a4023-5669-4411-8ffe-528d1cbf6276",
"metadata": {},
"outputs": [],
"source": [
"# teraz dodamy kolejne ustawienia:\n",
"emps = pd.read_csv('pliki/emps.csv', sep=';', index_col='employee_id', parse_dates=['hire_date'])"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "d1df7179-f3f2-4c0d-b26b-5cad84a0097a",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>first_name</th>\n",
" <th>last_name</th>\n",
" <th>job_title</th>\n",
" <th>salary</th>\n",
" <th>hire_date</th>\n",
" <th>department_name</th>\n",
" <th>address</th>\n",
" <th>postal_code</th>\n",
" <th>city</th>\n",
" <th>country</th>\n",
" </tr>\n",
" <tr>\n",
" <th>employee_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>100</th>\n",
" <td>Steven</td>\n",
" <td>King</td>\n",
" <td>President</td>\n",
" <td>24000</td>\n",
" <td>1997-06-17</td>\n",
" <td>Executive</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>101</th>\n",
" <td>Neena</td>\n",
" <td>Kochhar</td>\n",
" <td>Administration Vice President</td>\n",
" <td>17000</td>\n",
" <td>1999-09-21</td>\n",
" <td>Executive</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>102</th>\n",
" <td>Lex</td>\n",
" <td>De Haan</td>\n",
" <td>Administration Vice President</td>\n",
" <td>17000</td>\n",
" <td>2003-01-13</td>\n",
" <td>Executive</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>103</th>\n",
" <td>Alexander</td>\n",
" <td>Hunold</td>\n",
" <td>Programmer</td>\n",
" <td>9000</td>\n",
" <td>2000-01-03</td>\n",
" <td>IT</td>\n",
" <td>2014 Jabberwocky Rd</td>\n",
" <td>26192</td>\n",
" <td>Southlake</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>104</th>\n",
" <td>Bruce</td>\n",
" <td>Ernst</td>\n",
" <td>Programmer</td>\n",
" <td>6000</td>\n",
" <td>2001-05-21</td>\n",
" <td>IT</td>\n",
" <td>2014 Jabberwocky Rd</td>\n",
" <td>26192</td>\n",
" <td>Southlake</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>202</th>\n",
" <td>Pat</td>\n",
" <td>Fay</td>\n",
" <td>Marketing Representative</td>\n",
" <td>6000</td>\n",
" <td>2007-08-17</td>\n",
" <td>Marketing</td>\n",
" <td>147 Spadina Ave</td>\n",
" <td>M5V 2L7</td>\n",
" <td>Toronto</td>\n",
" <td>Canada</td>\n",
" </tr>\n",
" <tr>\n",
" <th>203</th>\n",
" <td>Susan</td>\n",
" <td>Mavris</td>\n",
" <td>Human Resources Representative</td>\n",
" <td>6500</td>\n",
" <td>2004-06-07</td>\n",
" <td>Human Resources</td>\n",
" <td>8204 Arthur St</td>\n",
" <td>NaN</td>\n",
" <td>London</td>\n",
" <td>United Kingdom</td>\n",
" </tr>\n",
" <tr>\n",
" <th>204</th>\n",
" <td>Hermann</td>\n",
" <td>Baer</td>\n",
" <td>Public Relations Representative</td>\n",
" <td>10000</td>\n",
" <td>2004-06-07</td>\n",
" <td>Public Relations</td>\n",
" <td>Schwanthalerstr. 7031</td>\n",
" <td>80925</td>\n",
" <td>Munich</td>\n",
" <td>Germany</td>\n",
" </tr>\n",
" <tr>\n",
" <th>205</th>\n",
" <td>Shelley</td>\n",
" <td>Higgins</td>\n",
" <td>Accounting Manager</td>\n",
" <td>12000</td>\n",
" <td>2004-06-07</td>\n",
" <td>Accounting</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" <tr>\n",
" <th>206</th>\n",
" <td>William</td>\n",
" <td>Gietz</td>\n",
" <td>Public Accountant</td>\n",
" <td>8300</td>\n",
" <td>2004-06-07</td>\n",
" <td>Accounting</td>\n",
" <td>2004 Charade Rd</td>\n",
" <td>98199</td>\n",
" <td>Seattle</td>\n",
" <td>United States of America</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>107 rows × 10 columns</p>\n",
"</div>"
],
"text/plain": [
" first_name last_name job_title salary \\\n",
"employee_id \n",
"100 Steven King President 24000 \n",
"101 Neena Kochhar Administration Vice President 17000 \n",
"102 Lex De Haan Administration Vice President 17000 \n",
"103 Alexander Hunold Programmer 9000 \n",
"104 Bruce Ernst Programmer 6000 \n",
"... ... ... ... ... \n",
"202 Pat Fay Marketing Representative 6000 \n",
"203 Susan Mavris Human Resources Representative 6500 \n",
"204 Hermann Baer Public Relations Representative 10000 \n",
"205 Shelley Higgins Accounting Manager 12000 \n",
"206 William Gietz Public Accountant 8300 \n",
"\n",
" hire_date department_name address postal_code \\\n",
"employee_id \n",
"100 1997-06-17 Executive 2004 Charade Rd 98199 \n",
"101 1999-09-21 Executive 2004 Charade Rd 98199 \n",
"102 2003-01-13 Executive 2004 Charade Rd 98199 \n",
"103 2000-01-03 IT 2014 Jabberwocky Rd 26192 \n",
"104 2001-05-21 IT 2014 Jabberwocky Rd 26192 \n",
"... ... ... ... ... \n",
"202 2007-08-17 Marketing 147 Spadina Ave M5V 2L7 \n",
"203 2004-06-07 Human Resources 8204 Arthur St NaN \n",
"204 2004-06-07 Public Relations Schwanthalerstr. 7031 80925 \n",
"205 2004-06-07 Accounting 2004 Charade Rd 98199 \n",
"206 2004-06-07 Accounting 2004 Charade Rd 98199 \n",
"\n",
" city country \n",
"employee_id \n",
"100 Seattle United States of America \n",
"101 Seattle United States of America \n",
"102 Seattle United States of America \n",
"103 Southlake United States of America \n",
"104 Southlake United States of America \n",
"... ... ... \n",
"202 Toronto Canada \n",
"203 London United Kingdom \n",
"204 Munich Germany \n",
"205 Seattle United States of America \n",
"206 Seattle United States of America \n",
"\n",
"[107 rows x 10 columns]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps"
]
},
{
"cell_type": "markdown",
"id": "806bcbce-a21a-49a2-8348-e6f4f3462756",
"metadata": {},
"source": [
"Wczytamy jeszcze drugi plik..."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "d5947655-9698-42ad-996f-23d4271792c6",
"metadata": {},
"outputs": [],
"source": [
"sprzedaz = pd.read_csv('pliki/sprzedaz.csv', sep=',', parse_dates=['data'])"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "30fc9f3c-630b-4014-881c-de1aec07cb6a",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>data</th>\n",
" <th>miasto</th>\n",
" <th>sklep</th>\n",
" <th>kategoria</th>\n",
" <th>towar</th>\n",
" <th>cena</th>\n",
" <th>sztuk</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2014-11-23</td>\n",
" <td>Łódź</td>\n",
" <td>Wdowiak</td>\n",
" <td>meble</td>\n",
" <td>biurko</td>\n",
" <td>149.99</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2017-05-07</td>\n",
" <td>Radom</td>\n",
" <td>Czarnecki</td>\n",
" <td>wyposażenie szkolne</td>\n",
" <td>tablica</td>\n",
" <td>590.00</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2017-05-05</td>\n",
" <td>Kraków</td>\n",
" <td>Kozłowski</td>\n",
" <td>szkolno-biurowe</td>\n",
" <td>flamaster</td>\n",
" <td>0.99</td>\n",
" <td>51</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2016-10-19</td>\n",
" <td>Kraków</td>\n",
" <td>Wróbel</td>\n",
" <td>wyposażenie szkolne</td>\n",
" <td>gąbka</td>\n",
" <td>4.00</td>\n",
" <td>250</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2016-04-08</td>\n",
" <td>Poznań</td>\n",
" <td>Borowik</td>\n",
" <td>meble</td>\n",
" <td>biurko</td>\n",
" <td>149.99</td>\n",
" <td>9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9995</th>\n",
" <td>2016-05-22</td>\n",
" <td>Katowice</td>\n",
" <td>Gaińska</td>\n",
" <td>szkolno-biurowe</td>\n",
" <td>dziurkacz</td>\n",
" <td>7.50</td>\n",
" <td>178</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9996</th>\n",
" <td>2016-11-19</td>\n",
" <td>Kraków</td>\n",
" <td>Kozłowski</td>\n",
" <td>meble</td>\n",
" <td>biurko</td>\n",
" <td>149.99</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9997</th>\n",
" <td>2016-09-30</td>\n",
" <td>Łódź</td>\n",
" <td>Wdowiak</td>\n",
" <td>szkolno-biurowe</td>\n",
" <td>długopis</td>\n",
" <td>1.49</td>\n",
" <td>87</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9998</th>\n",
" <td>2015-05-01</td>\n",
" <td>Kraków</td>\n",
" <td>Kozłowski</td>\n",
" <td>meble</td>\n",
" <td>biurko</td>\n",
" <td>149.99</td>\n",
" <td>10</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9999</th>\n",
" <td>2016-08-26</td>\n",
" <td>Kraków</td>\n",
" <td>Kozłowski</td>\n",
" <td>wyposażenie szkolne</td>\n",
" <td>gąbka</td>\n",
" <td>4.00</td>\n",
" <td>152</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>10000 rows × 7 columns</p>\n",
"</div>"
],
"text/plain": [
" data miasto sklep kategoria towar cena \\\n",
"0 2014-11-23 Łódź Wdowiak meble biurko 149.99 \n",
"1 2017-05-07 Radom Czarnecki wyposażenie szkolne tablica 590.00 \n",
"2 2017-05-05 Kraków Kozłowski szkolno-biurowe flamaster 0.99 \n",
"3 2016-10-19 Kraków Wróbel wyposażenie szkolne gąbka 4.00 \n",
"4 2016-04-08 Poznań Borowik meble biurko 149.99 \n",
"... ... ... ... ... ... ... \n",
"9995 2016-05-22 Katowice Gaińska szkolno-biurowe dziurkacz 7.50 \n",
"9996 2016-11-19 Kraków Kozłowski meble biurko 149.99 \n",
"9997 2016-09-30 Łódź Wdowiak szkolno-biurowe długopis 1.49 \n",
"9998 2015-05-01 Kraków Kozłowski meble biurko 149.99 \n",
"9999 2016-08-26 Kraków Kozłowski wyposażenie szkolne gąbka 4.00 \n",
"\n",
" sztuk \n",
"0 4 \n",
"1 2 \n",
"2 51 \n",
"3 250 \n",
"4 9 \n",
"... ... \n",
"9995 178 \n",
"9996 7 \n",
"9997 87 \n",
"9998 10 \n",
"9999 152 \n",
"\n",
"[10000 rows x 7 columns]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sprzedaz"
]
},
{
"cell_type": "markdown",
"id": "6e55dad7-dc62-43a2-90e0-3199a86bba22",
"metadata": {},
"source": [
"## Typy danych itp.\n",
"\n",
"Tabela z danymi, coś, co odpowiada arkuszowi Excela albo tabeli bazodanowej, to jest `DataFrame`."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "32594f3d-21a1-4bcd-8c53-f0ad1c09926b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.frame.DataFrame"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(emps)"
]
},
{
"cell_type": "markdown",
"id": "30dad998-30cd-4f10-8787-9b00e4d667b8",
"metadata": {},
"source": [
"Pojedyncza kolumna, „seria danych” jest typu `Series`."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "68dab7ee-4ab8-44c0-a432-9576cfba60ec",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.series.Series"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(emps.last_name)"
]
},
{
"cell_type": "markdown",
"id": "dc3bb090-1949-4c02-9b28-8d03bdd9358d",
"metadata": {},
"source": [
"Jakie typu są kolumny?"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "3e3d2466-d010-4c0a-8808-c68986e531cc",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"data datetime64[ns]\n",
"miasto object\n",
"sklep object\n",
"kategoria object\n",
"towar object\n",
"cena float64\n",
"sztuk int64\n",
"dtype: object"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sprzedaz.dtypes"
]
},
{
"cell_type": "markdown",
"id": "3ecee7dc-6b81-4d06-9b33-86d26d35d3aa",
"metadata": {},
"source": [
"Nazwy kolumn:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "0bca2b80-503a-4ecd-8bcf-ca0424b8c56d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['data', 'miasto', 'sklep', 'kategoria', 'towar', 'cena', 'sztuk'], dtype='object')"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sprzedaz.columns"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "b55585f4-c7bb-4b8a-abab-dc381049770e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'sklep'"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sprzedaz.columns[2]"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "af56bc84-5d80-44a8-88d4-72d79e8453fa",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(107, 10)"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.shape"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "3a411a55-e813-4037-9f25-7482c72a1c4f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1070"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emps.size"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "56e657a0-5c17-4741-bcb7-41cddc570993",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"107"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(emps)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "94ae90c7-4bad-4f46-bb32-cbc3fcc97d8a",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment