{ "cells": [ { "cell_type": "markdown", "id": "7ee4ca52", "metadata": {}, "source": [ "# Pandas常用操作\n", "\n", "上一章的內容主要介紹如何擷取資料的基本資訊，例如資料的維度、大略地看資料內容、統計值、遺漏值、欄位有無異常值等等。\n", "\n", "本節開始要介紹對DataFrame或是欄位的一些操作及轉換。\n", "\n", "同樣使用相同的資料，首先讀入資料：" ] }, { "cell_type": "code", "execution_count": 1, "id": "ce23df29", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv('./data/credit_customers.csv')" ] }, { "cell_type": "markdown", "id": "4eff1ad3", "metadata": {}, "source": [ "## 基本操作\n", "\n", "**改變欄位名稱**\n", "\n", "由於其中一個欄位名稱\"class\"與python關鍵字相同，故建議是更改名稱，避免後許使用的困擾。\n", "\n", "此外，剛好該欄位是該資料集用來預測是否違約的標籤，因此可以命名為\"label\"。" ] }, { "cell_type": "code", "execution_count": 2, "id": "43ac8314", "metadata": {}, "outputs": [ { "data": { "text/html": [ "

\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "

	checking_status	duration	credit_history	purpose	credit_amount	savings_status	employment	installment_commitment	personal_status	other_parties	...	property_magnitude	age	other_payment_plans	housing	existing_credits	job	num_dependents	own_telephone	foreign_worker	label
0	<0	6.0	critical/other existing credit	radio/tv	1169.0	no known savings	>=7	4.0	male single	NaN	...	real estate	67.0	NaN	own	2.0	skilled	1.0	yes	yes	good
1	0<=X<200	48.0	existing paid	radio/tv	5951.0	<100	1<=X<4	2.0	female div/dep/mar	NaN	...	real estate	22.0	NaN	own	1.0	skilled	1.0	NaN	yes	bad
2	no checking	12.0	critical/other existing credit	education	2096.0	<100	4<=X<7	2.0	male single	NaN	...	real estate	49.0	NaN	own	1.0	unskilled resident	2.0	NaN	yes	good
3	<0	42.0	existing paid	furniture/equipment	7882.0	<100	4<=X<7	2.0	male single	guarantor	...	life insurance	45.0	NaN	for free	1.0	skilled	2.0	NaN	yes	good
4	<0	24.0	delayed previously	new car	4870.0	<100	1<=X<4	3.0	male single	NaN	...	no known property	53.0	NaN	for free	2.0	skilled	2.0	NaN	yes	bad
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
995	no checking	12.0	existing paid	furniture/equipment	1736.0	<100	4<=X<7	3.0	female div/dep/mar	NaN	...	real estate	31.0	NaN	own	1.0	unskilled resident	1.0	NaN	yes	good
996	<0	30.0	existing paid	used car	3857.0	<100	1<=X<4	4.0	male div/sep	NaN	...	life insurance	40.0	NaN	own	1.0	high qualif/self emp/mgmt	1.0	yes	yes	good
997	no checking	12.0	existing paid	radio/tv	804.0	<100	>=7	4.0	male single	NaN	...	car	38.0	NaN	own	1.0	skilled	1.0	NaN	yes	good
998	<0	45.0	existing paid	radio/tv	1845.0	<100	1<=X<4	4.0	male single	NaN	...	no known property	23.0	NaN	for free	1.0	skilled	1.0	yes	yes	bad
999	0<=X<200	45.0	critical/other existing credit	used car	4576.0	100<=X<500	unemployed	3.0	male single	NaN	...	car	27.0	NaN	own	1.0	skilled	1.0	NaN	yes	good

\n", "

1000 rows × 21 columns

\n", "

" ], "text/plain": [ " checking_status duration credit_history \\\n", "0 <0 6.0 critical/other existing credit \n", "1 0<=X<200 48.0 existing paid \n", "2 no checking 12.0 critical/other existing credit \n", "3 <0 42.0 existing paid \n", "4 <0 24.0 delayed previously \n", ".. ... ... ... \n", "995 no checking 12.0 existing paid \n", "996 <0 30.0 existing paid \n", "997 no checking 12.0 existing paid \n", "998 <0 45.0 existing paid \n", "999 0<=X<200 45.0 critical/other existing credit \n", "\n", " purpose credit_amount savings_status employment \\\n", "0 radio/tv 1169.0 no known savings >=7 \n", "1 radio/tv 5951.0 <100 1<=X<4 \n", "2 education 2096.0 <100 4<=X<7 \n", "3 furniture/equipment 7882.0 <100 4<=X<7 \n", "4 new car 4870.0 <100 1<=X<4 \n", ".. ... ... ... ... \n", "995 furniture/equipment 1736.0 <100 4<=X<7 \n", "996 used car 3857.0 <100 1<=X<4 \n", "997 radio/tv 804.0 <100 >=7 \n", "998 radio/tv 1845.0 <100 1<=X<4 \n", "999 used car 4576.0 100<=X<500 unemployed \n", "\n", " installment_commitment personal_status other_parties ... \\\n", "0 4.0 male single NaN ... \n", "1 2.0 female div/dep/mar NaN ... \n", "2 2.0 male single NaN ... \n", "3 2.0 male single guarantor ... \n", "4 3.0 male single NaN ... \n", ".. ... ... ... ... \n", "995 3.0 female div/dep/mar NaN ... \n", "996 4.0 male div/sep NaN ... \n", "997 4.0 male single NaN ... \n", "998 4.0 male single NaN ... \n", "999 3.0 male single NaN ... \n", "\n", " property_magnitude age other_payment_plans housing existing_credits \\\n", "0 real estate 67.0 NaN own 2.0 \n", "1 real estate 22.0 NaN own 1.0 \n", "2 real estate 49.0 NaN own 1.0 \n", "3 life insurance 45.0 NaN for free 1.0 \n", "4 no known property 53.0 NaN for free 2.0 \n", ".. ... ... ... ... ... \n", "995 real estate 31.0 NaN own 1.0 \n", "996 life insurance 40.0 NaN own 1.0 \n", "997 car 38.0 NaN own 1.0 \n", "998 no known property 23.0 NaN for free 1.0 \n", "999 car 27.0 NaN own 1.0 \n", "\n", " job num_dependents own_telephone foreign_worker \\\n", "0 skilled 1.0 yes yes \n", "1 skilled 1.0 NaN yes \n", "2 unskilled resident 2.0 NaN yes \n", "3 skilled 2.0 NaN yes \n", "4 skilled 2.0 NaN yes \n", ".. ... ... ... ... \n", "995 unskilled resident 1.0 NaN yes \n", "996 high qualif/self emp/mgmt 1.0 yes yes \n", "997 skilled 1.0 NaN yes \n", "998 skilled 1.0 yes yes \n", "999 skilled 1.0 NaN yes \n", "\n", " label \n", "0 good \n", "1 bad \n", "2 good \n", "3 good \n", "4 bad \n", ".. ... \n", "995 good \n", "996 good \n", "997 good \n", "998 bad \n", "999 good \n", "\n", "[1000 rows x 21 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.rename(columns={'class': 'label'})" ] }, { "cell_type": "markdown", "id": "41dde3d4", "metadata": {}, "source": [ "注意到，這邊結果是回傳一個DataFrame，並不會改變原本DataFrame的內容：" ] }, { "cell_type": "code", "execution_count": 3, "id": "eb55b774", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['checking_status', 'duration', 'credit_history', 'purpose',\n", " 'credit_amount', 'savings_status', 'employment',\n", " 'installment_commitment', 'personal_status', 'other_parties',\n", " 'residence_since', 'property_magnitude', 'age', 'other_payment_plans',\n", " 'housing', 'existing_credits', 'job', 'num_dependents', 'own_telephone',\n", " 'foreign_worker', 'class'],\n", " dtype='object')" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 該欄位的名稱仍為\"class\"\n", "df.columns" ] }, { "cell_type": "markdown", "id": "5c01f312", "metadata": {}, "source": [ "如果要直接修改的話必須帶```inplace=True```參數：" ] }, { "cell_type": "code", "execution_count": 4, "id": "da16c61b", "metadata": {}, "outputs": [], "source": [ "df.rename(columns={'class': 'label'}, inplace=True)" ] }, { "cell_type": "markdown", "id": "36170e51", "metadata": {}, "source": [ "發現到若加上```inplace=True```參數後，執行該語法並不會回傳任何東西，而是對DataFrame的內容直接做修改。" ] }, { "cell_type": "code", "execution_count": 5, "id": "8348fda5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['checking_status', 'duration', 'credit_history', 'purpose',\n", " 'credit_amount', 'savings_status', 'employment',\n", " 'installment_commitment', 'personal_status', 'other_parties',\n", " 'residence_since', 'property_magnitude', 'age', 'other_payment_plans',\n", " 'housing', 'existing_credits', 'job', 'num_dependents', 'own_telephone',\n", " 'foreign_worker', 'label'],\n", " dtype='object')" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "cell_type": "markdown", "id": "8085cb22", "metadata": {}, "source": [ "```inplace=True```在很多DataFrame的方法中都有。\n", "\n", "然而，實務上並不建議直接使用```inplace=True```，原因是這會直接修改資料，若要復原操作，就要反向再執行一次，或是重新讀入DataFrame，造成時間上會付出較大的代價。建議是在有明確理由的情況下使用。\n", "\n", "比較好的做法是，將結果賦值到新變數中。一來程式的易讀性較高（賦值的動作明確表達出有新資料產出），二來若想修改為其他名稱也只需要在這個步驟重來就好。" ] }, { "cell_type": "code", "execution_count": 6, "id": "0b209262", "metadata": {}, "outputs": [], "source": [ "df_processed = df.rename(columns={'class': 'label'})" ] }, { "cell_type": "markdown", "id": "ed7a669b", "metadata": {}, "source": [ "但缺點是會造成記憶體空間佔用以及需要命名新的物件，所以需要一般會是配合幾個相關的處理需求，利用chaining或是```.pipe()```方法一起執行(後面章節會再詳細說明)。\n", "\n", "**填補遺漏值**\n", "\n", "可以用以下語法填補資料的遺漏值，但須注意資料格式，建議用相同格式：" ] }, { "cell_type": "code", "execution_count": 7, "id": "9beda6cf", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 other\n", "1 other\n", "2 other\n", "3 other\n", "4 other\n", " ... \n", "995 other\n", "996 other\n", "997 other\n", "998 other\n", "999 other\n", "Name: other_payment_plans, Length: 1000, dtype: object" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['other_payment_plans'].fillna('other')" ] }, { "cell_type": "markdown", "id": "f6b375ff", "metadata": {}, "source": [ "一樣，要加上```inplace=True```才會直接改變資料內容：" ] }, { "cell_type": "code", "execution_count": 8, "id": "32ef9cd3", "metadata": {}, "outputs": [], "source": [ "df['other_payment_plans'].fillna('other', inplace=True)" ] }, { "cell_type": "code", "execution_count": 9, "id": "7c2358aa", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['other_payment_plans'].isna().sum()" ] }, { "cell_type": "markdown", "id": "d065e648", "metadata": {}, "source": [ "**欄位值轉換**\n", "\n", "後續若要進行建模，通常會需要把文字的欄位值轉換爲數字才能夠丟給模型。\n", "\n", "例如我們想把label的值，從good/bad改為0/1，可以把轉換的對應關係儲存成一個字典，然後搭配```.map()```方法：" ] }, { "cell_type": "code", "execution_count": 10, "id": "5c48d323", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 1\n", "2 0\n", "3 0\n", "4 1\n", " ..\n", "995 0\n", "996 0\n", "997 0\n", "998 1\n", "999 0\n", "Name: label, Length: 1000, dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mapper = {\n", " 'good': 0,\n", " 'bad': 1\n", "}\n", "\n", "df['label'].map(mapper)" ] }, { "cell_type": "markdown", "id": "7b73dae5", "metadata": {}, "source": [ "注意，如果是mapper沒定義到的欄位值，```.map()```之後會轉為空值。" ] }, { "cell_type": "code", "execution_count": 11, "id": "23ba6b45", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 NaN\n", "1 1.0\n", "2 NaN\n", "3 NaN\n", "4 1.0\n", " ... \n", "995 NaN\n", "996 NaN\n", "997 NaN\n", "998 1.0\n", "999 NaN\n", "Name: label, Length: 1000, dtype: float64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mapper = {\n", " 'bad': 1\n", "}\n", "\n", "df['label'].map(mapper)" ] }, { "cell_type": "markdown", "id": "7c3231ea", "metadata": {}, "source": [ "```.map()```方法是回傳一個DataFrame，但沒有修改原始的DataFrame。而```.map()```方法也沒有```inplace```參數，通常的做法會是新建一個欄位。\n", "\n", "如何新建一個欄位？下一章會詳細說明，這邊先劇透其中一個方法：" ] }, { "cell_type": "code", "execution_count": 12, "id": "7e8f03b4", "metadata": {}, "outputs": [], "source": [ "df.loc[:, \"label_new\"] = df['label'].map(mapper)" ] }, { "cell_type": "code", "execution_count": 13, "id": "890e21a1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "

\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "

	label	label_new
0	good	NaN
1	bad	1.0
2	good	NaN
3	good	NaN
4	bad	1.0
...	...	...
995	good	NaN
996	good	NaN
997	good	NaN
998	bad	1.0
999	good	NaN

\n", "

1000 rows × 2 columns

\n", "

" ], "text/plain": [ " label label_new\n", "0 good NaN\n", "1 bad 1.0\n", "2 good NaN\n", "3 good NaN\n", "4 bad 1.0\n", ".. ... ...\n", "995 good NaN\n", "996 good NaN\n", "997 good NaN\n", "998 bad 1.0\n", "999 good NaN\n", "\n", "[1000 rows x 2 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.filter(like='label')" ] }, { "cell_type": "markdown", "id": "9afa0d2a", "metadata": {}, "source": [ "## 數值欄位操作\n", "\n", "針對數值型的欄位，可能會有一些組合運算，接下來介紹一些方法。\n", "\n", "**欄位運算**\n", "\n", "**加法**" ] }, { "cell_type": "code", "execution_count": 14, "id": "011dc1a0", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 68.0\n", "1 23.0\n", "2 50.0\n", "3 46.0\n", "4 54.0\n", " ... \n", "995 32.0\n", "996 41.0\n", "997 39.0\n", "998 24.0\n", "999 28.0\n", "Name: age, Length: 1000, dtype: float64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['age'] + 1" ] }, { "cell_type": "code", "execution_count": 15, "id": "973fc2c8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 68.0\n", "1 23.0\n", "2 50.0\n", "3 46.0\n", "4 54.0\n", " ... \n", "995 32.0\n", "996 41.0\n", "997 39.0\n", "998 24.0\n", "999 28.0\n", "Name: age, Length: 1000, dtype: float64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['age'].add(1)" ] }, { "cell_type": "markdown", "id": "3a30a6cc", "metadata": {}, "source": [ "**減法**" ] }, { "cell_type": "code", "execution_count": 16, "id": "19ed14e6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 66.0\n", "1 21.0\n", "2 48.0\n", "3 44.0\n", "4 52.0\n", " ... \n", "995 30.0\n", "996 39.0\n", "997 37.0\n", "998 22.0\n", "999 26.0\n", "Name: age, Length: 1000, dtype: float64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['age'] - 1" ] }, { "cell_type": "code", "execution_count": 17, "id": "c64435f8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 66.0\n", "1 21.0\n", "2 48.0\n", "3 44.0\n", "4 52.0\n", " ... \n", "995 30.0\n", "996 39.0\n", "997 37.0\n", "998 22.0\n", "999 26.0\n", "Name: age, Length: 1000, dtype: float64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['age'].sub(1)" ] }, { "cell_type": "markdown", "id": "36848dfe", "metadata": {}, "source": [ "**乘法**" ] }, { "cell_type": "code", "execution_count": 18, "id": "00ff2765", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.04\n", "1 0.02\n", "2 0.02\n", "3 0.02\n", "4 0.03\n", " ... \n", "995 0.03\n", "996 0.04\n", "997 0.04\n", "998 0.04\n", "999 0.03\n", "Name: installment_commitment, Length: 1000, dtype: float64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['installment_commitment'] * 0.01" ] }, { "cell_type": "code", "execution_count": 19, "id": "fca4b393", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.04\n", "1 0.02\n", "2 0.02\n", "3 0.02\n", "4 0.03\n", " ... \n", "995 0.03\n", "996 0.04\n", "997 0.04\n", "998 0.04\n", "999 0.03\n", "Name: installment_commitment, Length: 1000, dtype: float64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['installment_commitment'].mul(0.01)" ] }, { "cell_type": "markdown", "id": "48d478c3", "metadata": {}, "source": [ "**除法**" ] }, { "cell_type": "code", "execution_count": 20, "id": "5759302c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.089552\n", "1 2.181818\n", "2 0.244898\n", "3 0.933333\n", "4 0.452830\n", " ... \n", "995 0.387097\n", "996 0.750000\n", "997 0.315789\n", "998 1.956522\n", "999 1.666667\n", "Length: 1000, dtype: float64" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['duration'] / df['age']" ] }, { "cell_type": "code", "execution_count": 21, "id": "749a8e74", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.089552\n", "1 2.181818\n", "2 0.244898\n", "3 0.933333\n", "4 0.452830\n", " ... \n", "995 0.387097\n", "996 0.750000\n", "997 0.315789\n", "998 1.956522\n", "999 1.666667\n", "Length: 1000, dtype: float64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['duration'].div(df['age'])" ] }, { "cell_type": "markdown", "id": "0ba6cf78", "metadata": {}, "source": [ "**取商數**" ] }, { "cell_type": "code", "execution_count": 22, "id": "f2c1f839", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 5.0\n", "1 1.0\n", "2 4.0\n", "3 3.0\n", "4 4.0\n", " ... \n", "995 2.0\n", "996 3.0\n", "997 3.0\n", "998 1.0\n", "999 2.0\n", "Name: age, Length: 1000, dtype: float64" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['age'] // 12" ] }, { "cell_type": "code", "execution_count": 23, "id": "3e7c68c8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 5.0\n", "1 1.0\n", "2 4.0\n", "3 3.0\n", "4 4.0\n", " ... \n", "995 2.0\n", "996 3.0\n", "997 3.0\n", "998 1.0\n", "999 2.0\n", "Name: age, Length: 1000, dtype: float64" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['age'].floordiv(12)" ] }, { "cell_type": "markdown", "id": "89d346d2", "metadata": {}, "source": [ "**取餘數**" ] }, { "cell_type": "code", "execution_count": 24, "id": "9a43ad16", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 7.0\n", "1 10.0\n", "2 1.0\n", "3 9.0\n", "4 5.0\n", " ... \n", "995 7.0\n", "996 4.0\n", "997 2.0\n", "998 11.0\n", "999 3.0\n", "Name: age, Length: 1000, dtype: float64" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['age'] % 12" ] }, { "cell_type": "code", "execution_count": 25, "id": "5af3bf41", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 7.0\n", "1 10.0\n", "2 1.0\n", "3 9.0\n", "4 5.0\n", " ... \n", "995 7.0\n", "996 4.0\n", "997 2.0\n", "998 11.0\n", "999 3.0\n", "Name: age, Length: 1000, dtype: float64" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['age'].mod(12)" ] }, { "cell_type": "markdown", "id": "59a4a8fa", "metadata": {}, "source": [ "**進位運算**\n", "\n", "括弧內的數字代表進位到第幾位。\n", "\n", "注意到這邊rounding的行為是\"四捨五入到最接近的偶數\"，5.5捨去小數點到整數會是6，4.5捨去小數點到整數會是4。這種rounding又稱作是round-to-even。\n", "\n", "參考：\n", "\n", "1. [python - Strange behavior of numpy.round - Stack Overflow](https://stackoverflow.com/questions/45021268/strange-behavior-of-numpy-round)。\n", "2. [算錢學問大 | iThome](https://www.ithome.com.tw/voice/112663)" ] }, { "cell_type": "code", "execution_count": 26, "id": "befcf469", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.1\n", "1 2.2\n", "2 0.2\n", "3 0.9\n", "4 0.5\n", " ... \n", "995 0.4\n", "996 0.8\n", "997 0.3\n", "998 2.0\n", "999 1.7\n", "Name: relative_duration, Length: 1000, dtype: float64" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[:, 'relative_duration'] = df['duration'] / df['age']\n", "\n", "df['relative_duration'].round(1)" ] }, { "cell_type": "markdown", "id": "2b6da2d7", "metadata": {}, "source": [ "**取次方**\n", "\n", "括弧內可指定次方項。" ] }, { "cell_type": "code", "execution_count": 27, "id": "917d0c8e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.008020\n", "1 4.760331\n", "2 0.059975\n", "3 0.871111\n", "4 0.205055\n", " ... \n", "995 0.149844\n", "996 0.562500\n", "997 0.099723\n", "998 3.827977\n", "999 2.777778\n", "Name: relative_duration, Length: 1000, dtype: float64" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 取平方\n", "\n", "df['relative_duration'].pow(2)" ] }, { "cell_type": "markdown", "id": "ebf39cab", "metadata": {}, "source": [ "**取絕對值**" ] }, { "cell_type": "code", "execution_count": 28, "id": "c153abe8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 31.454\n", "1 -13.546\n", "2 13.454\n", "3 9.454\n", "4 17.454\n", " ... \n", "995 -4.546\n", "996 4.454\n", "997 2.454\n", "998 -12.546\n", "999 -8.546\n", "Name: age, Length: 1000, dtype: float64" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['age'].sub(df['age'].mean())" ] }, { "cell_type": "code", "execution_count": 29, "id": "b2cd80f7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 31.454\n", "1 13.546\n", "2 13.454\n", "3 9.454\n", "4 17.454\n", " ... \n", "995 4.546\n", "996 4.454\n", "997 2.454\n", "998 12.546\n", "999 8.546\n", "Name: age, Length: 1000, dtype: float64" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['age'].sub(df['age'].mean()).abs()" ] }, { "cell_type": "markdown", "id": "038e680c", "metadata": {}, "source": [ "**限定值的範圍**\n", "\n", "第一個參數是下界，第二個參數是上界。" ] }, { "cell_type": "code", "execution_count": 30, "id": "64c71076", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.089552\n", "1 2.000000\n", "2 0.244898\n", "3 0.933333\n", "4 0.452830\n", " ... \n", "995 0.387097\n", "996 0.750000\n", "997 0.315789\n", "998 1.956522\n", "999 1.666667\n", "Name: relative_duration, Length: 1000, dtype: float64" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['relative_duration'].clip(0.05, 2)" ] }, { "cell_type": "markdown", "id": "4f981e4c", "metadata": {}, "source": [ "**數值離散化**\n", "\n", "有時要將數值型變數切割，會比較方便做一些描述統計或視覺化分析。\n", "\n", "其中參數```bins```是據以切割的數值，```right=False```代表左邊是```[```，而右邊是```)```。\n", "\n", "以下面的例子來看，第一個bin會是：0 ≤ age < 20，第二個bin則是：20 ≤ age < 40……以此類推。" ] }, { "cell_type": "code", "execution_count": 31, "id": "f2f97b88", "metadata": {}, "outputs": [], "source": [ "df.loc[:, \"age_bins\"] = pd.cut(df['age'], bins=[0, 20, 40, 60, 80], right=False)" ] }, { "cell_type": "code", "execution_count": 32, "id": "774e3f1a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "age_bins\n", "[0, 20) 2\n", "[20, 40) 699\n", "[40, 60) 248\n", "[60, 80) 51\n", "Name: count, dtype: int64" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['age_bins'].value_counts().sort_index()" ] }, { "cell_type": "markdown", "id": "3060973a", "metadata": {}, "source": [ "另外一種方式則是直接透過資料的百分位數來做切割。" ] }, { "cell_type": "code", "execution_count": 33, "id": "1442f069", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 (52.0, 75.0]\n", "1 (18.999, 23.0]\n", "2 (45.0, 52.0]\n", "3 (39.0, 45.0]\n", "4 (52.0, 75.0]\n", " ... \n", "995 (30.0, 33.0]\n", "996 (39.0, 45.0]\n", "997 (36.0, 39.0]\n", "998 (18.999, 23.0]\n", "999 (26.0, 28.0]\n", "Name: age, Length: 1000, dtype: category\n", "Categories (10, interval[float64, right]): [(18.999, 23.0] < (23.0, 26.0] < (26.0, 28.0] < (28.0, 30.0] ... (36.0, 39.0] < (39.0, 45.0] < (45.0, 52.0] < (52.0, 75.0]]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.qcut(df['age'], q=10)" ] }, { "cell_type": "markdown", "id": "61d39710", "metadata": {}, "source": [ "## 字串欄位操作\n", "\n", "以下範例來自pandas官方API文件：\n", "\n", "**字串截取**" ] }, { "cell_type": "code", "execution_count": 34, "id": "af30fbe8", "metadata": {}, "outputs": [], "source": [ "s = pd.Series([\"koala\", \"dog\", \"chameleon\"])" ] }, { "cell_type": "code", "execution_count": 35, "id": "7ae7a0c7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 oala\n", "1 og\n", "2 hameleon\n", "dtype: object" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.str.slice(start=1) # = s.str[1:]" ] }, { "cell_type": "code", "execution_count": 36, "id": "95359e52", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 a\n", "1 g\n", "2 n\n", "dtype: object" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.str.slice(start=-1) # = s.str[-1:]" ] }, { "cell_type": "code", "execution_count": 37, "id": "0df329f6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 ko\n", "1 do\n", "2 ch\n", "dtype: object" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.str.slice(stop=2) # = s.str[:2]" ] }, { "cell_type": "code", "execution_count": 38, "id": "507edf2d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 kaa\n", "1 dg\n", "2 caeen\n", "dtype: object" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.str.slice(step=2) # = s.str[::2]" ] }, { "cell_type": "code", "execution_count": 39, "id": "b5fe8bc9", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 kl\n", "1 d\n", "2 cm\n", "dtype: object" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.str.slice(start=0, stop=5, step=3) # = s.str[0:5:3]" ] }, { "cell_type": "markdown", "id": "9ec67a54", "metadata": {}, "source": [ "**判斷字串是否存在**" ] }, { "cell_type": "code", "execution_count": 40, "id": "4093691d", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "s1 = pd.Series(['Mouse', 'dog', 'house and parrot', '23', np.nan])" ] }, { "cell_type": "markdown", "id": "7f6b07ec", "metadata": {}, "source": [ "判斷特定字串是否包含在值當中" ] }, { "cell_type": "code", "execution_count": 41, "id": "358e56f7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 True\n", "2 False\n", "3 False\n", "4 NaN\n", "dtype: object" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1.str.contains('og')" ] }, { "cell_type": "markdown", "id": "94ac426c", "metadata": {}, "source": [ "設定參數```na```代表當遇到空值要填入什麼值，下面設定填入```False```" ] }, { "cell_type": "code", "execution_count": 42, "id": "828aa279", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 True\n", "2 False\n", "3 False\n", "4 False\n", "dtype: bool" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1.str.contains('og', na=False)" ] }, { "cell_type": "markdown", "id": "a6e2ae1e", "metadata": {}, "source": [ "以下用法稱作**正規表達式（regular expression）**，正規表達式專門處理字串，但內容頗多，附上資源供自行參考。" ] }, { "cell_type": "code", "execution_count": 43, "id": "6e2c309f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 True\n", "2 True\n", "3 False\n", "4 NaN\n", "dtype: object" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1.str.contains('house|dog', regex=True)" ] }, { "cell_type": "code", "execution_count": 44, "id": "9f329699", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 True\n", "4 NaN\n", "dtype: object" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1.str.contains('\\\\d', regex=True)" ] }, { "cell_type": "markdown", "id": "1bc49e54", "metadata": {}, "source": [ "參考：\n", "\n", "[The Ultimate Guide to using the Python regex module](https://towardsdatascience.com/the-ultimate-guide-to-using-the-python-regex-module-69aad9e9ba56)\n", "\n", "[Regular Expressions: Regexes in Python (Part 1) – Real Python](https://realpython.com/regex-python/)\n", "\n", "[Regular Expressions: Regexes in Python (Part 2) – Real Python](https://realpython.com/regex-python-part-2/)\n", "\n", "**字串取代**" ] }, { "cell_type": "code", "execution_count": 45, "id": "d0bc7c78", "metadata": {}, "outputs": [], "source": [ "s = pd.Series(['foo', 'fuz', np.nan]).str.replace('f.', 'ba', regex=True)" ] }, { "cell_type": "markdown", "id": "30271c07", "metadata": {}, "source": [ "可以單純將字串替換成別的字串。" ] }, { "cell_type": "code", "execution_count": 46, "id": "7efc7feb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 bao\n", "1 baz\n", "2 NaN\n", "dtype: object" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.str.replace('f', 'b')" ] }, { "cell_type": "markdown", "id": "e9f39edd", "metadata": {}, "source": [ "也可以使用正規表達式，```.```代表任一字元，所以\"f\" 以及\"f\"後面1個字元被取代成\"ba\"。" ] }, { "cell_type": "code", "execution_count": 47, "id": "d0fa6a8f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 bao\n", "1 baz\n", "2 NaN\n", "dtype: object" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.str.replace('f.', 'ba', regex=True)" ] }, { "cell_type": "markdown", "id": "b1a12d41", "metadata": {}, "source": [ "## 時間欄位操作\n", "\n", "**生成時間序列**\n", "\n", "可以透過指定起始與結束日期來建立連續的日期序列：" ] }, { "cell_type": "code", "execution_count": 48, "id": "1cbc60a9", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DatetimeIndex(['2023-11-29', '2023-11-30', '2023-12-01', '2023-12-02',\n", " '2023-12-03', '2023-12-04', '2023-12-05', '2023-12-06',\n", " '2023-12-07'],\n", " dtype='datetime64[ns]', freq='D')" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.date_range(start='20231129', end='20231207')" ] }, { "cell_type": "markdown", "id": "079791f4", "metadata": {}, "source": [ "也可以只指定起始日期，然後利用```periods```參數指定長度：" ] }, { "cell_type": "code", "execution_count": 49, "id": "8301b531", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DatetimeIndex(['2023-11-29', '2023-11-30', '2023-12-01', '2023-12-02',\n", " '2023-12-03', '2023-12-04', '2023-12-05', '2023-12-06',\n", " '2023-12-07'],\n", " dtype='datetime64[ns]', freq='D')" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.date_range(start='20231129', periods=9)" ] }, { "cell_type": "markdown", "id": "4cbd1786", "metadata": {}, "source": [ "也可以使用參數```freq=\"M\"```：" ] }, { "cell_type": "code", "execution_count": 50, "id": "9637104c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DatetimeIndex(['2023-01-31', '2023-02-28', '2023-03-31', '2023-04-30',\n", " '2023-05-31', '2023-06-30', '2023-07-31', '2023-08-31',\n", " '2023-09-30', '2023-10-31', '2023-11-30', '2023-12-31'],\n", " dtype='datetime64[ns]', freq='M')" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.date_range(start='20230101', periods=12, freq='M')" ] }, { "cell_type": "markdown", "id": "160f0a83", "metadata": {}, "source": [ "但結果會是月底。可改用```DateOffset```物件來處理：" ] }, { "cell_type": "code", "execution_count": 51, "id": "087a8d18", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DatetimeIndex(['2023-01-01', '2023-02-01', '2023-03-01', '2023-04-01',\n", " '2023-05-01', '2023-06-01', '2023-07-01', '2023-08-01',\n", " '2023-09-01', '2023-10-01', '2023-11-01', '2023-12-01'],\n", " dtype='datetime64[ns]', freq='')" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pandas.tseries.offsets import DateOffset\n", "\n", "pd.date_range(start='20230101', periods=12, freq=DateOffset(months=1))" ] }, { "cell_type": "markdown", "id": "479f021a", "metadata": {}, "source": [ "若是要固定隔n天(或n週, n秒等等)，可以用```Timedelta```物件來處理：" ] }, { "cell_type": "code", "execution_count": 52, "id": "d53e7f2d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DatetimeIndex(['2023-01-01', '2023-01-16', '2023-01-31', '2023-02-15',\n", " '2023-03-02', '2023-03-17', '2023-04-01', '2023-04-16',\n", " '2023-05-01', '2023-05-16', '2023-05-31', '2023-06-15',\n", " '2023-06-30', '2023-07-15', '2023-07-30', '2023-08-14',\n", " '2023-08-29', '2023-09-13', '2023-09-28', '2023-10-13',\n", " '2023-10-28', '2023-11-12', '2023-11-27', '2023-12-12',\n", " '2023-12-27'],\n", " dtype='datetime64[ns]', freq='15D')" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.date_range(start='20230101', end='20231231', freq=pd.Timedelta(days=15))" ] }, { "cell_type": "markdown", "id": "7dd0ab45", "metadata": {}, "source": [ "**計算日期**\n", "\n", "利用```Timedelta```物件來處理：" ] }, { "cell_type": "code", "execution_count": 53, "id": "99a366fb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Timestamp('1990-11-30 00:00:00')" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.to_datetime(\"19930103\", format='%Y%m%d') - pd.Timedelta(days=765)" ] }, { "cell_type": "markdown", "id": "f3c6c7ef", "metadata": {}, "source": [ "```DateOffset```物件同樣可以做到：" ] }, { "cell_type": "code", "execution_count": 54, "id": "c04a504a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Timestamp('1990-11-30 00:00:00')" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.to_datetime(\"19930103\", format='%Y%m%d') - DateOffset(days=765)" ] }, { "cell_type": "markdown", "id": "25a60925", "metadata": {}, "source": [ "**計算兩個日期之間的天數**\n", "\n", "其實就是直接相減就可以了：" ] }, { "cell_type": "code", "execution_count": 55, "id": "cf1258ff", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1390" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pd.to_datetime(\"20240314\", format='%Y%m%d') - \n", " pd.to_datetime(\"20200524\", format='%Y%m%d')).days" ] }, { "cell_type": "markdown", "id": "1f1c2b93", "metadata": {}, "source": [ "**時間轉文字**\n", "\n", "首先透過```pd.date_range```產生日期的時間序列。" ] }, { "cell_type": "code", "execution_count": 56, "id": "7ebdcd0a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DatetimeIndex(['2023-11-29', '2023-11-30', '2023-12-01', '2023-12-02',\n", " '2023-12-03', '2023-12-04', '2023-12-05', '2023-12-06',\n", " '2023-12-07'],\n", " dtype='datetime64[ns]', freq='D')\n" ] } ], "source": [ "dates = pd.date_range(start='20231129', end='20231207')\n", "\n", "print(dates)" ] }, { "cell_type": "markdown", "id": "0d0846bd", "metadata": {}, "source": [ "可以透過以下方法將時間轉成文字格式" ] }, { "cell_type": "code", "execution_count": 57, "id": "6d9a6882", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['20231129', '20231130', '20231201', '20231202', '20231203', '20231204',\n", " '20231205', '20231206', '20231207'],\n", " dtype='object')" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dates.strftime('%Y%m%d')" ] }, { "cell_type": "markdown", "id": "c2f2f502", "metadata": {}, "source": [ "**文字轉時間**\n", "\n", "先將上一步的結果借來用。" ] }, { "cell_type": "code", "execution_count": 58, "id": "2980943d", "metadata": {}, "outputs": [], "source": [ "dates = dates.strftime('%Y%m%d')" ] }, { "cell_type": "markdown", "id": "b414dcdf", "metadata": {}, "source": [ "需要使用的方法是```pd.to_datetime()```" ] }, { "cell_type": "code", "execution_count": 59, "id": "0814ee04", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DatetimeIndex(['2023-11-29', '2023-11-30', '2023-12-01', '2023-12-02',\n", " '2023-12-03', '2023-12-04', '2023-12-05', '2023-12-06',\n", " '2023-12-07'],\n", " dtype='datetime64[ns]', freq=None)" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.to_datetime(dates, format='%Y%m%d')" ] }, { "cell_type": "markdown", "id": "71595234", "metadata": {}, "source": [ "這邊時間日期的表示法與python內建的表示法相同，可參考前面的章節。" ] } ], "metadata": { "jupytext": { "formats": "md:myst", "text_representation": { "extension": ".md", "format_name": "myst", "format_version": 0.13, "jupytext_version": "1.16.0" } }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.0" }, "source_map": [ 13, 23, 27, 37, 39, 43, 46, 50, 52, 56, 58, 66, 68, 76, 78, 82, 86, 88, 96, 103, 107, 113, 119, 123, 125, 135, 139, 141, 145, 149, 151, 155, 159, 161, 165, 169, 171, 175, 179, 181, 185, 189, 191, 204, 208, 214, 218, 222, 226, 228, 234, 236, 246, 250, 252, 256, 258, 266, 270, 274, 278, 282, 286, 288, 292, 296, 300, 302, 306, 308, 312, 316, 318, 330, 332, 336, 338, 342, 344, 352, 354, 358, 360, 364, 366, 370, 374, 378, 380, 386, 388, 392, 394, 400, 403, 409, 413, 417, 419, 425, 427, 431, 433 ] }, "nbformat": 4, "nbformat_minor": 5 }