Intro to Pandas#
pandas 是使用Python進行資料分析,或是快速進行一些資料處理時必用的套件。
在小型專案中,可能也適合使用pandas來建立資料流程的管線。
但在較大型的專案中,資料量大或是資料處理流程的複雜度高的情況下,搭配使用其他工具可能就比較適合了(例如spark或是dbt),端看使用的情境以及專案團隊或組織的資源。
總而言之,pandas是使用Python進行資料分析必學的套件,就讓我們開始吧!
以下假設你已經建立好jupyter notebook的環境(也可以使用google colab),並且安裝好pandas套件。
什麼是 DataFrame?#
首先先來瞭解什麼是DataFrame。
DataFrame是pandas讀取資料表後的形式,跟excel資料表非常相似。
首先第一步是把pandas套件import進來。
import pandas as pd
定義一個DataFrame
# pd.Timestamp 是pandas建立時間資料的方法
# pd.Categorical 是pandas建立類別資料的方法。類別資料與字串不同,具有其他特性,之後會說明。
df = pd.DataFrame(
{
"col_1": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"col_2": [i/10 for i in range(1,11)],
"col_3": ['a', 'i', 'u', 'e', 'o', 'a', 'e', 'i', 'o', 'u'],
"col_4": pd.Timestamp("20240101"),
"col_5": [True, False] * 5,
"col_6": pd.Categorical(["low"] * 5 + ["high"] * 5)
}
)
直接執行它,就會印出DataFrame的內容。
df # or display(df)
col_1 | col_2 | col_3 | col_4 | col_5 | col_6 | |
---|---|---|---|---|---|---|
0 | 1 | 0.1 | a | 2024-01-01 | True | low |
1 | 2 | 0.2 | i | 2024-01-01 | False | low |
2 | 3 | 0.3 | u | 2024-01-01 | True | low |
3 | 4 | 0.4 | e | 2024-01-01 | False | low |
4 | 5 | 0.5 | o | 2024-01-01 | True | low |
5 | 6 | 0.6 | a | 2024-01-01 | False | high |
6 | 7 | 0.7 | e | 2024-01-01 | True | high |
7 | 8 | 0.8 | i | 2024-01-01 | False | high |
8 | 9 | 0.9 | o | 2024-01-01 | True | high |
9 | 10 | 1.0 | u | 2024-01-01 | False | high |
具有以下組成部分:
索引(index): 最左側數字0~1的部分
df.index
RangeIndex(start=0, stop=10, step=1)
欄位(column):上方col_1 ~ col_6的部分
df.columns
Index(['col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6'], dtype='object')
資料(data):資料內容本身。
df.values
array([[1, 0.1, 'a', Timestamp('2024-01-01 00:00:00'), True, 'low'],
[2, 0.2, 'i', Timestamp('2024-01-01 00:00:00'), False, 'low'],
[3, 0.3, 'u', Timestamp('2024-01-01 00:00:00'), True, 'low'],
[4, 0.4, 'e', Timestamp('2024-01-01 00:00:00'), False, 'low'],
[5, 0.5, 'o', Timestamp('2024-01-01 00:00:00'), True, 'low'],
[6, 0.6, 'a', Timestamp('2024-01-01 00:00:00'), False, 'high'],
[7, 0.7, 'e', Timestamp('2024-01-01 00:00:00'), True, 'high'],
[8, 0.8, 'i', Timestamp('2024-01-01 00:00:00'), False, 'high'],
[9, 0.9, 'o', Timestamp('2024-01-01 00:00:00'), True, 'high'],
[10, 1.0, 'u', Timestamp('2024-01-01 00:00:00'), False, 'high']],
dtype=object)
可以注意到印出的結果有array
的字樣,其實pandas是基於另外一個套件numpy而延伸出來的工具。
numpy也是一個python套件,提供大量的陣列運算功能,而且速度非常快。
numpy的基本資料結構被稱作array
。
事實上numpy array是一個可以容納多維的陣列,稱作ndarray
(n-dimensional array)。
type(df.values)
numpy.ndarray
建立一個numpy array:
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
Pandas DataFrame的每一欄具有特定的資料型態。
df.dtypes
col_1 int64
col_2 float64
col_3 object
col_4 datetime64[s]
col_5 bool
col_6 category
dtype: object
pandas資料型態大致如下:
Int64: integer
float64: float
object: strings (注意當資料型態混用的時候也會是object)
category: pandas特殊的資料型態,基本上也是string,但具有其他方便的特性。
bool: 布林值
datetime64[ns]: 時間
認識你的資料#
以下的資料內容取自在kaggle找到的資料集:credit_risk_customers (kaggle.com)。可以自行下載成csv檔。
這是一個跟銀行客戶信用風險相關的資料集,資料內容說明請見:OpenML。
首先透過以下語法讀入該csv檔:
import pandas as pd
df = pd.read_csv('./data/credit_customers.csv')
在進行資料分析的一開始,通常會先大致瞭解一下資料的長相。
快速查看DataFrame內容#
以下語法可以快速查看DataFrame的相關資訊:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 checking_status 1000 non-null object
1 duration 1000 non-null float64
2 credit_history 1000 non-null object
3 purpose 1000 non-null object
4 credit_amount 1000 non-null float64
5 savings_status 1000 non-null object
6 employment 1000 non-null object
7 installment_commitment 1000 non-null float64
8 personal_status 1000 non-null object
9 other_parties 93 non-null object
10 residence_since 1000 non-null float64
11 property_magnitude 1000 non-null object
12 age 1000 non-null float64
13 other_payment_plans 186 non-null object
14 housing 1000 non-null object
15 existing_credits 1000 non-null float64
16 job 1000 non-null object
17 num_dependents 1000 non-null float64
18 own_telephone 404 non-null object
19 foreign_worker 1000 non-null object
20 class 1000 non-null object
dtypes: float64(7), object(14)
memory usage: 164.2+ KB
查看DataFrame前幾筆資料,可以使用df.head()
,預設是顯示5筆,但也可指定筆數。
df.head(3)
checking_status | duration | credit_history | purpose | credit_amount | savings_status | employment | installment_commitment | personal_status | other_parties | ... | property_magnitude | age | other_payment_plans | housing | existing_credits | job | num_dependents | own_telephone | foreign_worker | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | <0 | 6.0 | critical/other existing credit | radio/tv | 1169.0 | no known savings | >=7 | 4.0 | male single | NaN | ... | real estate | 67.0 | NaN | own | 2.0 | skilled | 1.0 | yes | yes | good |
1 | 0<=X<200 | 48.0 | existing paid | radio/tv | 5951.0 | <100 | 1<=X<4 | 2.0 | female div/dep/mar | NaN | ... | real estate | 22.0 | NaN | own | 1.0 | skilled | 1.0 | NaN | yes | bad |
2 | no checking | 12.0 | critical/other existing credit | education | 2096.0 | <100 | 4<=X<7 | 2.0 | male single | NaN | ... | real estate | 49.0 | NaN | own | 1.0 | unskilled resident | 2.0 | NaN | yes | good |
3 rows × 21 columns
反過來就是:
df.tail()
checking_status | duration | credit_history | purpose | credit_amount | savings_status | employment | installment_commitment | personal_status | other_parties | ... | property_magnitude | age | other_payment_plans | housing | existing_credits | job | num_dependents | own_telephone | foreign_worker | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
995 | no checking | 12.0 | existing paid | furniture/equipment | 1736.0 | <100 | 4<=X<7 | 3.0 | female div/dep/mar | NaN | ... | real estate | 31.0 | NaN | own | 1.0 | unskilled resident | 1.0 | NaN | yes | good |
996 | <0 | 30.0 | existing paid | used car | 3857.0 | <100 | 1<=X<4 | 4.0 | male div/sep | NaN | ... | life insurance | 40.0 | NaN | own | 1.0 | high qualif/self emp/mgmt | 1.0 | yes | yes | good |
997 | no checking | 12.0 | existing paid | radio/tv | 804.0 | <100 | >=7 | 4.0 | male single | NaN | ... | car | 38.0 | NaN | own | 1.0 | skilled | 1.0 | NaN | yes | good |
998 | <0 | 45.0 | existing paid | radio/tv | 1845.0 | <100 | 1<=X<4 | 4.0 | male single | NaN | ... | no known property | 23.0 | NaN | for free | 1.0 | skilled | 1.0 | yes | yes | bad |
999 | 0<=X<200 | 45.0 | critical/other existing credit | used car | 4576.0 | 100<=X<500 | unemployed | 3.0 | male single | NaN | ... | car | 27.0 | NaN | own | 1.0 | skilled | 1.0 | NaN | yes | good |
5 rows × 21 columns
若要隨機查看的話:
df.sample(3)
checking_status | duration | credit_history | purpose | credit_amount | savings_status | employment | installment_commitment | personal_status | other_parties | ... | property_magnitude | age | other_payment_plans | housing | existing_credits | job | num_dependents | own_telephone | foreign_worker | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
754 | no checking | 12.0 | delayed previously | repairs | 1555.0 | >=1000 | >=7 | 4.0 | male single | NaN | ... | no known property | 55.0 | NaN | for free | 2.0 | skilled | 2.0 | NaN | yes | bad |
815 | 0<=X<200 | 36.0 | delayed previously | new car | 7432.0 | <100 | 1<=X<4 | 2.0 | female div/dep/mar | NaN | ... | life insurance | 54.0 | NaN | rent | 1.0 | skilled | 1.0 | NaN | yes | good |
366 | no checking | 18.0 | critical/other existing credit | used car | 3850.0 | <100 | 4<=X<7 | 3.0 | male single | NaN | ... | car | 27.0 | NaN | own | 2.0 | skilled | 1.0 | NaN | yes | good |
3 rows × 21 columns
注意到,由於資料的欄位數較多,實際顯示出來的欄位數較少,中間會以”…”的方式略過。
若想要查看完整的欄位,可以使用以下語法:
pd.set_option("display.max_columns", None)
意思是告訴pandas要顯示出全部的欄位數:
df.head()
checking_status | duration | credit_history | purpose | credit_amount | savings_status | employment | installment_commitment | personal_status | other_parties | residence_since | property_magnitude | age | other_payment_plans | housing | existing_credits | job | num_dependents | own_telephone | foreign_worker | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | <0 | 6.0 | critical/other existing credit | radio/tv | 1169.0 | no known savings | >=7 | 4.0 | male single | NaN | 4.0 | real estate | 67.0 | NaN | own | 2.0 | skilled | 1.0 | yes | yes | good |
1 | 0<=X<200 | 48.0 | existing paid | radio/tv | 5951.0 | <100 | 1<=X<4 | 2.0 | female div/dep/mar | NaN | 2.0 | real estate | 22.0 | NaN | own | 1.0 | skilled | 1.0 | NaN | yes | bad |
2 | no checking | 12.0 | critical/other existing credit | education | 2096.0 | <100 | 4<=X<7 | 2.0 | male single | NaN | 3.0 | real estate | 49.0 | NaN | own | 1.0 | unskilled resident | 2.0 | NaN | yes | good |
3 | <0 | 42.0 | existing paid | furniture/equipment | 7882.0 | <100 | 4<=X<7 | 2.0 | male single | guarantor | 4.0 | life insurance | 45.0 | NaN | for free | 1.0 | skilled | 2.0 | NaN | yes | good |
4 | <0 | 24.0 | delayed previously | new car | 4870.0 | <100 | 1<=X<4 | 3.0 | male single | NaN | 4.0 | no known property | 53.0 | NaN | for free | 2.0 | skilled | 2.0 | NaN | yes | bad |
查看資料的維度:
df.shape
(1000, 21)
回傳的是一個tuple,第1個元素是index的長度,就是列(row)的數量,第2個元素是column的長度,就是欄(column)的數量。
若要快速得知資料的統計資訊:
df.describe()
duration | credit_amount | installment_commitment | residence_since | age | existing_credits | num_dependents | |
---|---|---|---|---|---|---|---|
count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
mean | 20.903000 | 3271.258000 | 2.973000 | 2.845000 | 35.546000 | 1.407000 | 1.155000 |
std | 12.058814 | 2822.736876 | 1.118715 | 1.103718 | 11.375469 | 0.577654 | 0.362086 |
min | 4.000000 | 250.000000 | 1.000000 | 1.000000 | 19.000000 | 1.000000 | 1.000000 |
25% | 12.000000 | 1365.500000 | 2.000000 | 2.000000 | 27.000000 | 1.000000 | 1.000000 |
50% | 18.000000 | 2319.500000 | 3.000000 | 3.000000 | 33.000000 | 1.000000 | 1.000000 |
75% | 24.000000 | 3972.250000 | 4.000000 | 4.000000 | 42.000000 | 2.000000 | 1.000000 |
max | 72.000000 | 18424.000000 | 4.000000 | 4.000000 | 75.000000 | 4.000000 | 2.000000 |
注意到只有數值型的資料才會列出。若要納入非數值型的資料的話:
df.describe(include='all')
checking_status | duration | credit_history | purpose | credit_amount | savings_status | employment | installment_commitment | personal_status | other_parties | residence_since | property_magnitude | age | other_payment_plans | housing | existing_credits | job | num_dependents | own_telephone | foreign_worker | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1000 | 1000.000000 | 1000 | 1000 | 1000.000000 | 1000 | 1000 | 1000.000000 | 1000 | 93 | 1000.000000 | 1000 | 1000.000000 | 186 | 1000 | 1000.000000 | 1000 | 1000.000000 | 404 | 1000 | 1000 |
unique | 4 | NaN | 5 | 10 | NaN | 5 | 5 | NaN | 4 | 2 | NaN | 4 | NaN | 2 | 3 | NaN | 4 | NaN | 1 | 2 | 2 |
top | no checking | NaN | existing paid | radio/tv | NaN | <100 | 1<=X<4 | NaN | male single | guarantor | NaN | car | NaN | bank | own | NaN | skilled | NaN | yes | yes | good |
freq | 394 | NaN | 530 | 280 | NaN | 603 | 339 | NaN | 548 | 52 | NaN | 332 | NaN | 139 | 713 | NaN | 630 | NaN | 404 | 963 | 700 |
mean | NaN | 20.903000 | NaN | NaN | 3271.258000 | NaN | NaN | 2.973000 | NaN | NaN | 2.845000 | NaN | 35.546000 | NaN | NaN | 1.407000 | NaN | 1.155000 | NaN | NaN | NaN |
std | NaN | 12.058814 | NaN | NaN | 2822.736876 | NaN | NaN | 1.118715 | NaN | NaN | 1.103718 | NaN | 11.375469 | NaN | NaN | 0.577654 | NaN | 0.362086 | NaN | NaN | NaN |
min | NaN | 4.000000 | NaN | NaN | 250.000000 | NaN | NaN | 1.000000 | NaN | NaN | 1.000000 | NaN | 19.000000 | NaN | NaN | 1.000000 | NaN | 1.000000 | NaN | NaN | NaN |
25% | NaN | 12.000000 | NaN | NaN | 1365.500000 | NaN | NaN | 2.000000 | NaN | NaN | 2.000000 | NaN | 27.000000 | NaN | NaN | 1.000000 | NaN | 1.000000 | NaN | NaN | NaN |
50% | NaN | 18.000000 | NaN | NaN | 2319.500000 | NaN | NaN | 3.000000 | NaN | NaN | 3.000000 | NaN | 33.000000 | NaN | NaN | 1.000000 | NaN | 1.000000 | NaN | NaN | NaN |
75% | NaN | 24.000000 | NaN | NaN | 3972.250000 | NaN | NaN | 4.000000 | NaN | NaN | 4.000000 | NaN | 42.000000 | NaN | NaN | 2.000000 | NaN | 1.000000 | NaN | NaN | NaN |
max | NaN | 72.000000 | NaN | NaN | 18424.000000 | NaN | NaN | 4.000000 | NaN | NaN | 4.000000 | NaN | 75.000000 | NaN | NaN | 4.000000 | NaN | 2.000000 | NaN | NaN | NaN |
檢查遺漏值#
大致瞭解資料長相之後,可以開始進一步確認資料的細節。資料是否有遺漏值?欄位值有哪些取值?欄位的最大最小值等統計資訊?等等。
首先先來看跟遺漏值相關的語法:
df.isna()
checking_status | duration | credit_history | purpose | credit_amount | savings_status | employment | installment_commitment | personal_status | other_parties | residence_since | property_magnitude | age | other_payment_plans | housing | existing_credits | job | num_dependents | own_telephone | foreign_worker | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False | False | False | True | False | False | False | True | False | False | False | False | False | False | False |
1 | False | False | False | False | False | False | False | False | False | True | False | False | False | True | False | False | False | False | True | False | False |
2 | False | False | False | False | False | False | False | False | False | True | False | False | False | True | False | False | False | False | True | False | False |
3 | False | False | False | False | False | False | False | False | False | False | False | False | False | True | False | False | False | False | True | False | False |
4 | False | False | False | False | False | False | False | False | False | True | False | False | False | True | False | False | False | False | True | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | False | False | False | False | False | False | False | False | False | True | False | False | False | True | False | False | False | False | True | False | False |
996 | False | False | False | False | False | False | False | False | False | True | False | False | False | True | False | False | False | False | False | False | False |
997 | False | False | False | False | False | False | False | False | False | True | False | False | False | True | False | False | False | False | True | False | False |
998 | False | False | False | False | False | False | False | False | False | True | False | False | False | True | False | False | False | False | False | False | False |
999 | False | False | False | False | False | False | False | False | False | True | False | False | False | True | False | False | False | False | True | False | False |
1000 rows × 21 columns
以上這個語法會回傳每個欄位每一列是否為遺漏值的DataFrame,False代表該欄位值沒有遺漏值。True代表有遺漏值。
但通常我們關心的是哪些欄位有遺漏值?有多少筆資料有遺漏值?
所以會想要在把上面的資料根據欄位「加總」,因為True代表1,False代表0,加總後就可以知道遺漏值的數量。
可以這樣寫:
df.isna().sum()
checking_status 0
duration 0
credit_history 0
purpose 0
credit_amount 0
savings_status 0
employment 0
installment_commitment 0
personal_status 0
other_parties 907
residence_since 0
property_magnitude 0
age 0
other_payment_plans 814
housing 0
existing_credits 0
job 0
num_dependents 0
own_telephone 596
foreign_worker 0
class 0
dtype: int64
上面的語法結構值得說明一下,由於df.isnull()
回傳的是一個DataFrame,所以我們可以直接接著call其他DataFrame的方法。
而.sum()
是一個把DataFrame資料加總的方法,預設是by欄位加總。可以藉由調整參數``axis```來改變加總的方向。
不指定axis參數的話,預設是axis=0
,代表沿著axis 0, 把值一個一個加起來,所以是by欄位加總。
axis=1
代表by列加總,就可以知道哪幾筆資料有遺漏值。
df.isna().sum(axis=1)
0 2
1 3
2 3
3 2
4 3
..
995 3
996 2
997 3
998 2
999 3
Length: 1000, dtype: int64
如果不關心筆數,也可以使用.any()
,代表只要有一筆資料為True就會輸出True:
df.isna().any()
checking_status False
duration False
credit_history False
purpose False
credit_amount False
savings_status False
employment False
installment_commitment False
personal_status False
other_parties True
residence_since False
property_magnitude False
age False
other_payment_plans True
housing False
existing_credits False
job False
num_dependents False
own_telephone True
foreign_worker False
class False
dtype: bool
如何確認資料是否有任何遺漏值?
df.isna().any().any()
True
反向的語法是:
df.notna().sum()
checking_status 1000
duration 1000
credit_history 1000
purpose 1000
credit_amount 1000
savings_status 1000
employment 1000
installment_commitment 1000
personal_status 1000
other_parties 93
residence_since 1000
property_magnitude 1000
age 1000
other_payment_plans 186
housing 1000
existing_credits 1000
job 1000
num_dependents 1000
own_telephone 404
foreign_worker 1000
class 1000
dtype: int64
確認欄位值#
如果想知道特定欄位有哪些取值,通常是要初步看一下資料的內容,以及看欄位有沒有什麼異常值。
例如想知道客戶來申辦貸款的目的有哪些?可以對purpose這個欄位使用以下操作:
df['purpose'].unique()
array(['radio/tv', 'education', 'furniture/equipment', 'new car',
'used car', 'business', 'domestic appliance', 'repairs', 'other',
'retraining'], dtype=object)
就可以把資料中purpose實際上出現哪些不同值給列出來。
其中,df['purpose']
是篩選出欄位資料的語法。也可以使用:
df.purpose
0 radio/tv
1 radio/tv
2 education
3 furniture/equipment
4 new car
...
995 furniture/equipment
996 used car
997 radio/tv
998 radio/tv
999 used car
Name: purpose, Length: 1000, dtype: object
注意到.purpose
的語法事實上是把欄位當作是DataFrame的「屬性(attribute)」來取用,所以若欄位名稱與DataFrame原本就有的屬性重複的話會無法使用。
補充說明一下,這裡回傳的東西,其實是pandas Series。Series是pandas的另一個資料結構,DataFrame中每一個欄位都儲存為Series。
type(df.purpose)
pandas.core.series.Series
可以把Series理解為單一維度的numpy array,與numpy array不同的地方在於Series內含具有標籤的索引(index)。
s = pd.Series([1, 2, 3, 4, 5], index=["a", "b", "c", "d", "e"])
print(s)
print(s.index)
a 1
b 2
c 3
d 4
e 5
dtype: int64
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
除了看哪些欄位取值,還可以直接算出該欄位有幾種不同取值:
df['purpose'].nunique()
10
或是每個取值有多少筆資料:
df['purpose'].value_counts()
purpose
radio/tv 280
new car 234
furniture/equipment 181
used car 103
business 97
education 50
repairs 22
domestic appliance 12
other 12
retraining 9
Name: count, dtype: int64
進一步看佔比:
df['purpose'].value_counts(normalize=True)
purpose
radio/tv 0.280
new car 0.234
furniture/equipment 0.181
used car 0.103
business 0.097
education 0.050
repairs 0.022
domestic appliance 0.012
other 0.012
retraining 0.009
Name: proportion, dtype: float64
注意到value_counts()
會忽略空值:
df['other_parties'].value_counts() # 明顯不到1000筆資料
other_parties
guarantor 52
co applicant 41
Name: count, dtype: int64
可以使用dropna=False
參數:
df['other_parties'].value_counts(dropna=False)
other_parties
NaN 907
guarantor 52
co applicant 41
Name: count, dtype: int64
針對數值型欄位可以使用bins
參數,放入要切分的組數。
這邊組別的切分是依據欄位值的全距去切指定的等份。
df['age'].value_counts(bins=7)
(27.0, 35.0] 297
(18.942999999999998, 27.0] 291
(35.0, 43.0] 194
(43.0, 51.0] 113
(51.0, 59.0] 54
(59.0, 67.0] 41
(67.0, 75.0] 10
Name: count, dtype: int64
如果要看個別欄位的前幾大或倒數前幾大的值是哪些(參數可指定列數):
df['duration'].nlargest()
677 72.0
29 60.0
134 60.0
255 60.0
332 60.0
Name: duration, dtype: float64
df['duration'].nsmallest()
234 4.0
314 4.0
527 4.0
734 4.0
837 4.0
Name: duration, dtype: float64
或是可以用排序的方法:
df['duration'].sort_values(ascending=False).head()
677 72.0
973 60.0
685 60.0
672 60.0
714 60.0
Name: duration, dtype: float64
df['duration'].sort_values().head()
314 4.0
734 4.0
527 4.0
943 4.0
837 4.0
Name: duration, dtype: float64
資料的統計資訊#
除了使用.describe()
快速查看資料統計數據外,還可以使用各種方法來依照需求做計算。
df['duration'].min()
4.0
df['duration'].max()
72.0
df['duration'].mean()
20.903
df['duration'].median()
18.0
df['duration'].var()
145.41500600600602
df['duration'].std()
12.058814452756375
df['duration'].quantile(0.8)
30.0
df['duration'].quantile([0.1,0.3,0.45,0.6,0.8,1])
0.10 9.0
0.30 12.0
0.45 18.0
0.60 24.0
0.80 30.0
1.00 72.0
Name: duration, dtype: float64
還可以很方便算多個欄位的共變數、相關係數:
df[['age', 'duration', 'credit_amount']].cov()
age | duration | credit_amount | |
---|---|---|---|
age | 129.401285 | -4.956995 | 1.050523e+03 |
duration | -4.956995 | 145.415006 | 2.127375e+04 |
credit_amount | 1050.522655 | 21273.749776 | 7.967843e+06 |
df[['age', 'duration', 'credit_amount']].corr()
age | duration | credit_amount | |
---|---|---|---|
age | 1.000000 | -0.036136 | 0.032716 |
duration | -0.036136 | 1.000000 | 0.624984 |
credit_amount | 0.032716 | 0.624984 | 1.000000 |
查看你的資料#
有了對資料的大致認識後,可能會想對資料的部分欄位或部份樣本做一些後續處理,例如填補遺漏值、對欄位做加工等等。
但在此之前需要先知道怎麼擷取你想要的特定資料。
抓特定欄位的資料#
首先第一種方法已經介紹過了,就是直接在DataFrame後面加[想要抓的欄位]
:
df['class']
0 good
1 bad
2 good
3 good
4 bad
...
995 good
996 good
997 good
998 bad
999 good
Name: class, Length: 1000, dtype: object
當要一次抓取多個欄位時,必須放入一個欄位名稱的list:
df[['age', 'duration', 'credit_amount']]
age | duration | credit_amount | |
---|---|---|---|
0 | 67.0 | 6.0 | 1169.0 |
1 | 22.0 | 48.0 | 5951.0 |
2 | 49.0 | 12.0 | 2096.0 |
3 | 45.0 | 42.0 | 7882.0 |
4 | 53.0 | 24.0 | 4870.0 |
... | ... | ... | ... |
995 | 31.0 | 12.0 | 1736.0 |
996 | 40.0 | 30.0 | 3857.0 |
997 | 38.0 | 12.0 | 804.0 |
998 | 23.0 | 45.0 | 1845.0 |
999 | 27.0 | 45.0 | 4576.0 |
1000 rows × 3 columns
另一種方法是使用.loc[]
,使用方式如下:
df.loc[:, 'class']
0 good
1 bad
2 good
3 good
4 bad
...
995 good
996 good
997 good
998 bad
999 good
Name: class, Length: 1000, dtype: object
多個欄位的話一樣是要放list:
df.loc[:, ['age', 'duration', 'credit_amount']]
age | duration | credit_amount | |
---|---|---|---|
0 | 67.0 | 6.0 | 1169.0 |
1 | 22.0 | 48.0 | 5951.0 |
2 | 49.0 | 12.0 | 2096.0 |
3 | 45.0 | 42.0 | 7882.0 |
4 | 53.0 | 24.0 | 4870.0 |
... | ... | ... | ... |
995 | 31.0 | 12.0 | 1736.0 |
996 | 40.0 | 30.0 | 3857.0 |
997 | 38.0 | 12.0 | 804.0 |
998 | 23.0 | 45.0 | 1845.0 |
999 | 27.0 | 45.0 | 4576.0 |
1000 rows × 3 columns
與.loc[]
類似的,iloc[]
也可以做到篩選,只是要指定的是「第幾個」欄位:
df.iloc[:, [12, 1, 4]]
age | duration | credit_amount | |
---|---|---|---|
0 | 67.0 | 6.0 | 1169.0 |
1 | 22.0 | 48.0 | 5951.0 |
2 | 49.0 | 12.0 | 2096.0 |
3 | 45.0 | 42.0 | 7882.0 |
4 | 53.0 | 24.0 | 4870.0 |
... | ... | ... | ... |
995 | 31.0 | 12.0 | 1736.0 |
996 | 40.0 | 30.0 | 3857.0 |
997 | 38.0 | 12.0 | 804.0 |
998 | 23.0 | 45.0 | 1845.0 |
999 | 27.0 | 45.0 | 4576.0 |
1000 rows × 3 columns
注意到:
的使用,可以一次指定多個欄位,使用方式跟list的index一樣:
df.iloc[:, 0:4]
checking_status | duration | credit_history | purpose | |
---|---|---|---|---|
0 | <0 | 6.0 | critical/other existing credit | radio/tv |
1 | 0<=X<200 | 48.0 | existing paid | radio/tv |
2 | no checking | 12.0 | critical/other existing credit | education |
3 | <0 | 42.0 | existing paid | furniture/equipment |
4 | <0 | 24.0 | delayed previously | new car |
... | ... | ... | ... | ... |
995 | no checking | 12.0 | existing paid | furniture/equipment |
996 | <0 | 30.0 | existing paid | used car |
997 | no checking | 12.0 | existing paid | radio/tv |
998 | <0 | 45.0 | existing paid | radio/tv |
999 | 0<=X<200 | 45.0 | critical/other existing credit | used car |
1000 rows × 4 columns
.loc[]
也可以使用:
,但就必須符合原本的欄位順序:
print(df.columns)
# 順序錯了,選不到東西
df.loc[:, "purpose":"duration"]
Index(['checking_status', 'duration', 'credit_history', 'purpose',
'credit_amount', 'savings_status', 'employment',
'installment_commitment', 'personal_status', 'other_parties',
'residence_since', 'property_magnitude', 'age', 'other_payment_plans',
'housing', 'existing_credits', 'job', 'num_dependents', 'own_telephone',
'foreign_worker', 'class'],
dtype='object')
0 |
---|
1 |
2 |
3 |
4 |
... |
995 |
996 |
997 |
998 |
999 |
1000 rows × 0 columns
df.loc[:, "duration":"purpose"]
duration | credit_history | purpose | |
---|---|---|---|
0 | 6.0 | critical/other existing credit | radio/tv |
1 | 48.0 | existing paid | radio/tv |
2 | 12.0 | critical/other existing credit | education |
3 | 42.0 | existing paid | furniture/equipment |
4 | 24.0 | delayed previously | new car |
... | ... | ... | ... |
995 | 12.0 | existing paid | furniture/equipment |
996 | 30.0 | existing paid | used car |
997 | 12.0 | existing paid | radio/tv |
998 | 45.0 | existing paid | radio/tv |
999 | 45.0 | critical/other existing credit | used car |
1000 rows × 3 columns
再來一種方法是使用.filter()
搭配items=[欄位名稱]
的參數:
df.filter(items=['age', 'duration', 'credit_amount'])
age | duration | credit_amount | |
---|---|---|---|
0 | 67.0 | 6.0 | 1169.0 |
1 | 22.0 | 48.0 | 5951.0 |
2 | 49.0 | 12.0 | 2096.0 |
3 | 45.0 | 42.0 | 7882.0 |
4 | 53.0 | 24.0 | 4870.0 |
... | ... | ... | ... |
995 | 31.0 | 12.0 | 1736.0 |
996 | 40.0 | 30.0 | 3857.0 |
997 | 38.0 | 12.0 | 804.0 |
998 | 23.0 | 45.0 | 1845.0 |
999 | 27.0 | 45.0 | 4576.0 |
1000 rows × 3 columns
根據資料型態抓欄位:
df.select_dtypes(include='float')
duration | credit_amount | installment_commitment | residence_since | age | existing_credits | num_dependents | |
---|---|---|---|---|---|---|---|
0 | 6.0 | 1169.0 | 4.0 | 4.0 | 67.0 | 2.0 | 1.0 |
1 | 48.0 | 5951.0 | 2.0 | 2.0 | 22.0 | 1.0 | 1.0 |
2 | 12.0 | 2096.0 | 2.0 | 3.0 | 49.0 | 1.0 | 2.0 |
3 | 42.0 | 7882.0 | 2.0 | 4.0 | 45.0 | 1.0 | 2.0 |
4 | 24.0 | 4870.0 | 3.0 | 4.0 | 53.0 | 2.0 | 2.0 |
... | ... | ... | ... | ... | ... | ... | ... |
995 | 12.0 | 1736.0 | 3.0 | 4.0 | 31.0 | 1.0 | 1.0 |
996 | 30.0 | 3857.0 | 4.0 | 4.0 | 40.0 | 1.0 | 1.0 |
997 | 12.0 | 804.0 | 4.0 | 4.0 | 38.0 | 1.0 | 1.0 |
998 | 45.0 | 1845.0 | 4.0 | 4.0 | 23.0 | 1.0 | 1.0 |
999 | 45.0 | 4576.0 | 3.0 | 4.0 | 27.0 | 1.0 | 1.0 |
1000 rows × 7 columns
根據欄位名稱的一些特徵來抓資料:
df.filter(like='_')
checking_status | credit_history | credit_amount | savings_status | installment_commitment | personal_status | other_parties | residence_since | property_magnitude | other_payment_plans | existing_credits | num_dependents | own_telephone | foreign_worker | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | <0 | critical/other existing credit | 1169.0 | no known savings | 4.0 | male single | NaN | 4.0 | real estate | NaN | 2.0 | 1.0 | yes | yes |
1 | 0<=X<200 | existing paid | 5951.0 | <100 | 2.0 | female div/dep/mar | NaN | 2.0 | real estate | NaN | 1.0 | 1.0 | NaN | yes |
2 | no checking | critical/other existing credit | 2096.0 | <100 | 2.0 | male single | NaN | 3.0 | real estate | NaN | 1.0 | 2.0 | NaN | yes |
3 | <0 | existing paid | 7882.0 | <100 | 2.0 | male single | guarantor | 4.0 | life insurance | NaN | 1.0 | 2.0 | NaN | yes |
4 | <0 | delayed previously | 4870.0 | <100 | 3.0 | male single | NaN | 4.0 | no known property | NaN | 2.0 | 2.0 | NaN | yes |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | no checking | existing paid | 1736.0 | <100 | 3.0 | female div/dep/mar | NaN | 4.0 | real estate | NaN | 1.0 | 1.0 | NaN | yes |
996 | <0 | existing paid | 3857.0 | <100 | 4.0 | male div/sep | NaN | 4.0 | life insurance | NaN | 1.0 | 1.0 | yes | yes |
997 | no checking | existing paid | 804.0 | <100 | 4.0 | male single | NaN | 4.0 | car | NaN | 1.0 | 1.0 | NaN | yes |
998 | <0 | existing paid | 1845.0 | <100 | 4.0 | male single | NaN | 4.0 | no known property | NaN | 1.0 | 1.0 | yes | yes |
999 | 0<=X<200 | critical/other existing credit | 4576.0 | 100<=X<500 | 3.0 | male single | NaN | 4.0 | car | NaN | 1.0 | 1.0 | NaN | yes |
1000 rows × 14 columns
若要把DataFrame欄位根據指定的順序做排序的話,就必須先設定好欄位順序在list內,然後再把資料抓出來:
ordered_cols = ['class', 'credit_amount', 'age', 'duration', 'existing_credits']
df[ordered_cols]
class | credit_amount | age | duration | existing_credits | |
---|---|---|---|---|---|
0 | good | 1169.0 | 67.0 | 6.0 | 2.0 |
1 | bad | 5951.0 | 22.0 | 48.0 | 1.0 |
2 | good | 2096.0 | 49.0 | 12.0 | 1.0 |
3 | good | 7882.0 | 45.0 | 42.0 | 1.0 |
4 | bad | 4870.0 | 53.0 | 24.0 | 2.0 |
... | ... | ... | ... | ... | ... |
995 | good | 1736.0 | 31.0 | 12.0 | 1.0 |
996 | good | 3857.0 | 40.0 | 30.0 | 1.0 |
997 | good | 804.0 | 38.0 | 12.0 | 1.0 |
998 | bad | 1845.0 | 23.0 | 45.0 | 1.0 |
999 | good | 4576.0 | 27.0 | 45.0 | 1.0 |
1000 rows × 5 columns
抓特定列的資料#
直接根據列名稱去抓資料的情境雖然比較少,但一樣可以透過.loc[]
或.iloc[]
做到:
df.loc[7:10, :]
checking_status | duration | credit_history | purpose | credit_amount | savings_status | employment | installment_commitment | personal_status | other_parties | residence_since | property_magnitude | age | other_payment_plans | housing | existing_credits | job | num_dependents | own_telephone | foreign_worker | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7 | 0<=X<200 | 36.0 | existing paid | used car | 6948.0 | <100 | 1<=X<4 | 2.0 | male single | NaN | 2.0 | car | 35.0 | NaN | rent | 1.0 | high qualif/self emp/mgmt | 1.0 | yes | yes | good |
8 | no checking | 12.0 | existing paid | radio/tv | 3059.0 | >=1000 | 4<=X<7 | 2.0 | male div/sep | NaN | 4.0 | real estate | 61.0 | NaN | own | 1.0 | unskilled resident | 1.0 | NaN | yes | good |
9 | 0<=X<200 | 30.0 | critical/other existing credit | new car | 5234.0 | <100 | unemployed | 4.0 | male mar/wid | NaN | 2.0 | car | 28.0 | NaN | own | 2.0 | high qualif/self emp/mgmt | 1.0 | NaN | yes | bad |
10 | 0<=X<200 | 12.0 | existing paid | new car | 1295.0 | <100 | <1 | 3.0 | female div/dep/mar | NaN | 1.0 | car | 25.0 | NaN | rent | 1.0 | skilled | 1.0 | NaN | yes | bad |
在這邊,注意到因為index的順序跟label名稱一樣,但.loc[]
的行為是判斷label名稱,.iloc[]
則是類似list做indexing的概念,並不會包含到指定的最後一個index:
df.iloc[7:10, :]
checking_status | duration | credit_history | purpose | credit_amount | savings_status | employment | installment_commitment | personal_status | other_parties | residence_since | property_magnitude | age | other_payment_plans | housing | existing_credits | job | num_dependents | own_telephone | foreign_worker | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7 | 0<=X<200 | 36.0 | existing paid | used car | 6948.0 | <100 | 1<=X<4 | 2.0 | male single | NaN | 2.0 | car | 35.0 | NaN | rent | 1.0 | high qualif/self emp/mgmt | 1.0 | yes | yes | good |
8 | no checking | 12.0 | existing paid | radio/tv | 3059.0 | >=1000 | 4<=X<7 | 2.0 | male div/sep | NaN | 4.0 | real estate | 61.0 | NaN | own | 1.0 | unskilled resident | 1.0 | NaN | yes | good |
9 | 0<=X<200 | 30.0 | critical/other existing credit | new car | 5234.0 | <100 | unemployed | 4.0 | male mar/wid | NaN | 2.0 | car | 28.0 | NaN | own | 2.0 | high qualif/self emp/mgmt | 1.0 | NaN | yes | bad |
依條件判斷抓取資料#
最直接的方法,例如要抓age > 40以上的資料:
df[df['age'] > 40]
checking_status | duration | credit_history | purpose | credit_amount | savings_status | employment | installment_commitment | personal_status | other_parties | residence_since | property_magnitude | age | other_payment_plans | housing | existing_credits | job | num_dependents | own_telephone | foreign_worker | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | <0 | 6.0 | critical/other existing credit | radio/tv | 1169.0 | no known savings | >=7 | 4.0 | male single | NaN | 4.0 | real estate | 67.0 | NaN | own | 2.0 | skilled | 1.0 | yes | yes | good |
2 | no checking | 12.0 | critical/other existing credit | education | 2096.0 | <100 | 4<=X<7 | 2.0 | male single | NaN | 3.0 | real estate | 49.0 | NaN | own | 1.0 | unskilled resident | 2.0 | NaN | yes | good |
3 | <0 | 42.0 | existing paid | furniture/equipment | 7882.0 | <100 | 4<=X<7 | 2.0 | male single | guarantor | 4.0 | life insurance | 45.0 | NaN | for free | 1.0 | skilled | 2.0 | NaN | yes | good |
4 | <0 | 24.0 | delayed previously | new car | 4870.0 | <100 | 1<=X<4 | 3.0 | male single | NaN | 4.0 | no known property | 53.0 | NaN | for free | 2.0 | skilled | 2.0 | NaN | yes | bad |
6 | no checking | 24.0 | existing paid | furniture/equipment | 2835.0 | 500<=X<1000 | >=7 | 3.0 | male single | NaN | 4.0 | life insurance | 53.0 | NaN | own | 1.0 | skilled | 1.0 | NaN | yes | good |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
978 | no checking | 24.0 | delayed previously | new car | 2538.0 | <100 | >=7 | 4.0 | male single | NaN | 4.0 | car | 47.0 | NaN | own | 2.0 | unskilled resident | 2.0 | NaN | yes | bad |
980 | 0<=X<200 | 30.0 | critical/other existing credit | furniture/equipment | 8386.0 | <100 | 4<=X<7 | 2.0 | male single | NaN | 2.0 | life insurance | 49.0 | NaN | own | 1.0 | skilled | 1.0 | NaN | yes | bad |
987 | no checking | 13.0 | existing paid | radio/tv | 1409.0 | 100<=X<500 | unemployed | 2.0 | female div/dep/mar | NaN | 4.0 | real estate | 64.0 | NaN | own | 1.0 | skilled | 1.0 | NaN | yes | good |
989 | 0<=X<200 | 24.0 | critical/other existing credit | radio/tv | 1743.0 | <100 | >=7 | 4.0 | male single | NaN | 2.0 | life insurance | 48.0 | NaN | own | 2.0 | unskilled resident | 1.0 | NaN | yes | good |
994 | no checking | 12.0 | existing paid | new car | 2390.0 | no known savings | >=7 | 4.0 | male single | NaN | 3.0 | car | 50.0 | NaN | own | 1.0 | skilled | 1.0 | yes | yes | good |
274 rows × 21 columns
也可以利用.loc[]
:
df.loc[df['age'] > 40, :]
checking_status | duration | credit_history | purpose | credit_amount | savings_status | employment | installment_commitment | personal_status | other_parties | residence_since | property_magnitude | age | other_payment_plans | housing | existing_credits | job | num_dependents | own_telephone | foreign_worker | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | <0 | 6.0 | critical/other existing credit | radio/tv | 1169.0 | no known savings | >=7 | 4.0 | male single | NaN | 4.0 | real estate | 67.0 | NaN | own | 2.0 | skilled | 1.0 | yes | yes | good |
2 | no checking | 12.0 | critical/other existing credit | education | 2096.0 | <100 | 4<=X<7 | 2.0 | male single | NaN | 3.0 | real estate | 49.0 | NaN | own | 1.0 | unskilled resident | 2.0 | NaN | yes | good |
3 | <0 | 42.0 | existing paid | furniture/equipment | 7882.0 | <100 | 4<=X<7 | 2.0 | male single | guarantor | 4.0 | life insurance | 45.0 | NaN | for free | 1.0 | skilled | 2.0 | NaN | yes | good |
4 | <0 | 24.0 | delayed previously | new car | 4870.0 | <100 | 1<=X<4 | 3.0 | male single | NaN | 4.0 | no known property | 53.0 | NaN | for free | 2.0 | skilled | 2.0 | NaN | yes | bad |
6 | no checking | 24.0 | existing paid | furniture/equipment | 2835.0 | 500<=X<1000 | >=7 | 3.0 | male single | NaN | 4.0 | life insurance | 53.0 | NaN | own | 1.0 | skilled | 1.0 | NaN | yes | good |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
978 | no checking | 24.0 | delayed previously | new car | 2538.0 | <100 | >=7 | 4.0 | male single | NaN | 4.0 | car | 47.0 | NaN | own | 2.0 | unskilled resident | 2.0 | NaN | yes | bad |
980 | 0<=X<200 | 30.0 | critical/other existing credit | furniture/equipment | 8386.0 | <100 | 4<=X<7 | 2.0 | male single | NaN | 2.0 | life insurance | 49.0 | NaN | own | 1.0 | skilled | 1.0 | NaN | yes | bad |
987 | no checking | 13.0 | existing paid | radio/tv | 1409.0 | 100<=X<500 | unemployed | 2.0 | female div/dep/mar | NaN | 4.0 | real estate | 64.0 | NaN | own | 1.0 | skilled | 1.0 | NaN | yes | good |
989 | 0<=X<200 | 24.0 | critical/other existing credit | radio/tv | 1743.0 | <100 | >=7 | 4.0 | male single | NaN | 2.0 | life insurance | 48.0 | NaN | own | 2.0 | unskilled resident | 1.0 | NaN | yes | good |
994 | no checking | 12.0 | existing paid | new car | 2390.0 | no known savings | >=7 | 4.0 | male single | NaN | 3.0 | car | 50.0 | NaN | own | 1.0 | skilled | 1.0 | yes | yes | good |
274 rows × 21 columns
或是使用.query()
方法:
df.query('age > 40')
checking_status | duration | credit_history | purpose | credit_amount | savings_status | employment | installment_commitment | personal_status | other_parties | residence_since | property_magnitude | age | other_payment_plans | housing | existing_credits | job | num_dependents | own_telephone | foreign_worker | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | <0 | 6.0 | critical/other existing credit | radio/tv | 1169.0 | no known savings | >=7 | 4.0 | male single | NaN | 4.0 | real estate | 67.0 | NaN | own | 2.0 | skilled | 1.0 | yes | yes | good |
2 | no checking | 12.0 | critical/other existing credit | education | 2096.0 | <100 | 4<=X<7 | 2.0 | male single | NaN | 3.0 | real estate | 49.0 | NaN | own | 1.0 | unskilled resident | 2.0 | NaN | yes | good |
3 | <0 | 42.0 | existing paid | furniture/equipment | 7882.0 | <100 | 4<=X<7 | 2.0 | male single | guarantor | 4.0 | life insurance | 45.0 | NaN | for free | 1.0 | skilled | 2.0 | NaN | yes | good |
4 | <0 | 24.0 | delayed previously | new car | 4870.0 | <100 | 1<=X<4 | 3.0 | male single | NaN | 4.0 | no known property | 53.0 | NaN | for free | 2.0 | skilled | 2.0 | NaN | yes | bad |
6 | no checking | 24.0 | existing paid | furniture/equipment | 2835.0 | 500<=X<1000 | >=7 | 3.0 | male single | NaN | 4.0 | life insurance | 53.0 | NaN | own | 1.0 | skilled | 1.0 | NaN | yes | good |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
978 | no checking | 24.0 | delayed previously | new car | 2538.0 | <100 | >=7 | 4.0 | male single | NaN | 4.0 | car | 47.0 | NaN | own | 2.0 | unskilled resident | 2.0 | NaN | yes | bad |
980 | 0<=X<200 | 30.0 | critical/other existing credit | furniture/equipment | 8386.0 | <100 | 4<=X<7 | 2.0 | male single | NaN | 2.0 | life insurance | 49.0 | NaN | own | 1.0 | skilled | 1.0 | NaN | yes | bad |
987 | no checking | 13.0 | existing paid | radio/tv | 1409.0 | 100<=X<500 | unemployed | 2.0 | female div/dep/mar | NaN | 4.0 | real estate | 64.0 | NaN | own | 1.0 | skilled | 1.0 | NaN | yes | good |
989 | 0<=X<200 | 24.0 | critical/other existing credit | radio/tv | 1743.0 | <100 | >=7 | 4.0 | male single | NaN | 2.0 | life insurance | 48.0 | NaN | own | 2.0 | unskilled resident | 1.0 | NaN | yes | good |
994 | no checking | 12.0 | existing paid | new car | 2390.0 | no known savings | >=7 | 4.0 | male single | NaN | 3.0 | car | 50.0 | NaN | own | 1.0 | skilled | 1.0 | yes | yes | good |
274 rows × 21 columns
邏輯條件可以放多個:
df[(df['age'] > 40) & (df['duration'] > 50)]
checking_status | duration | credit_history | purpose | credit_amount | savings_status | employment | installment_commitment | personal_status | other_parties | residence_since | property_magnitude | age | other_payment_plans | housing | existing_credits | job | num_dependents | own_telephone | foreign_worker | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
29 | <0 | 60.0 | delayed previously | business | 6836.0 | <100 | >=7 | 3.0 | male single | NaN | 4.0 | no known property | 63.0 | NaN | own | 2.0 | skilled | 1.0 | yes | yes | bad |
95 | 0<=X<200 | 54.0 | no credits/all paid | business | 15945.0 | <100 | <1 | 3.0 | male single | NaN | 4.0 | no known property | 58.0 | NaN | rent | 1.0 | skilled | 1.0 | yes | yes | bad |
373 | no checking | 60.0 | critical/other existing credit | new car | 13756.0 | no known savings | >=7 | 2.0 | male single | NaN | 4.0 | no known property | 63.0 | bank | for free | 1.0 | high qualif/self emp/mgmt | 1.0 | yes | yes | good |
374 | 0<=X<200 | 60.0 | all paid | other | 14782.0 | 100<=X<500 | >=7 | 3.0 | female div/dep/mar | NaN | 4.0 | no known property | 60.0 | bank | for free | 2.0 | high qualif/self emp/mgmt | 1.0 | yes | yes | bad |
672 | no checking | 60.0 | existing paid | new car | 10366.0 | <100 | >=7 | 2.0 | male single | NaN | 4.0 | life insurance | 42.0 | NaN | own | 1.0 | high qualif/self emp/mgmt | 1.0 | yes | yes | good |
938 | 0<=X<200 | 60.0 | existing paid | education | 6288.0 | <100 | 1<=X<4 | 4.0 | male single | NaN | 4.0 | no known property | 42.0 | NaN | for free | 1.0 | skilled | 1.0 | NaN | yes | bad |
df.query('age > 40 & duration > 50')
checking_status | duration | credit_history | purpose | credit_amount | savings_status | employment | installment_commitment | personal_status | other_parties | residence_since | property_magnitude | age | other_payment_plans | housing | existing_credits | job | num_dependents | own_telephone | foreign_worker | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
29 | <0 | 60.0 | delayed previously | business | 6836.0 | <100 | >=7 | 3.0 | male single | NaN | 4.0 | no known property | 63.0 | NaN | own | 2.0 | skilled | 1.0 | yes | yes | bad |
95 | 0<=X<200 | 54.0 | no credits/all paid | business | 15945.0 | <100 | <1 | 3.0 | male single | NaN | 4.0 | no known property | 58.0 | NaN | rent | 1.0 | skilled | 1.0 | yes | yes | bad |
373 | no checking | 60.0 | critical/other existing credit | new car | 13756.0 | no known savings | >=7 | 2.0 | male single | NaN | 4.0 | no known property | 63.0 | bank | for free | 1.0 | high qualif/self emp/mgmt | 1.0 | yes | yes | good |
374 | 0<=X<200 | 60.0 | all paid | other | 14782.0 | 100<=X<500 | >=7 | 3.0 | female div/dep/mar | NaN | 4.0 | no known property | 60.0 | bank | for free | 2.0 | high qualif/self emp/mgmt | 1.0 | yes | yes | bad |
672 | no checking | 60.0 | existing paid | new car | 10366.0 | <100 | >=7 | 2.0 | male single | NaN | 4.0 | life insurance | 42.0 | NaN | own | 1.0 | high qualif/self emp/mgmt | 1.0 | yes | yes | good |
938 | 0<=X<200 | 60.0 | existing paid | education | 6288.0 | <100 | 1<=X<4 | 4.0 | male single | NaN | 4.0 | no known property | 42.0 | NaN | for free | 1.0 | skilled | 1.0 | NaN | yes | bad |
為了避免條件太多,造成閱讀困難,可以做一些前處理:
c1 = df['age'] > 40
c2 = df['duration'] > 50
c3 = df['credit_amount'] > 10000
cond = c1 & c2 & c3
df[cond]
checking_status | duration | credit_history | purpose | credit_amount | savings_status | employment | installment_commitment | personal_status | other_parties | residence_since | property_magnitude | age | other_payment_plans | housing | existing_credits | job | num_dependents | own_telephone | foreign_worker | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
95 | 0<=X<200 | 54.0 | no credits/all paid | business | 15945.0 | <100 | <1 | 3.0 | male single | NaN | 4.0 | no known property | 58.0 | NaN | rent | 1.0 | skilled | 1.0 | yes | yes | bad |
373 | no checking | 60.0 | critical/other existing credit | new car | 13756.0 | no known savings | >=7 | 2.0 | male single | NaN | 4.0 | no known property | 63.0 | bank | for free | 1.0 | high qualif/self emp/mgmt | 1.0 | yes | yes | good |
374 | 0<=X<200 | 60.0 | all paid | other | 14782.0 | 100<=X<500 | >=7 | 3.0 | female div/dep/mar | NaN | 4.0 | no known property | 60.0 | bank | for free | 2.0 | high qualif/self emp/mgmt | 1.0 | yes | yes | bad |
672 | no checking | 60.0 | existing paid | new car | 10366.0 | <100 | >=7 | 2.0 | male single | NaN | 4.0 | life insurance | 42.0 | NaN | own | 1.0 | high qualif/self emp/mgmt | 1.0 | yes | yes | good |
或是搭配f-string以及.query()
:
c1 = 'age > 40'
c2 = 'duration > 50'
c3 = 'credit_amount > 10000'
cond = f'{c1} & {c2} & {c3}'
df.query(cond)
checking_status | duration | credit_history | purpose | credit_amount | savings_status | employment | installment_commitment | personal_status | other_parties | residence_since | property_magnitude | age | other_payment_plans | housing | existing_credits | job | num_dependents | own_telephone | foreign_worker | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
95 | 0<=X<200 | 54.0 | no credits/all paid | business | 15945.0 | <100 | <1 | 3.0 | male single | NaN | 4.0 | no known property | 58.0 | NaN | rent | 1.0 | skilled | 1.0 | yes | yes | bad |
373 | no checking | 60.0 | critical/other existing credit | new car | 13756.0 | no known savings | >=7 | 2.0 | male single | NaN | 4.0 | no known property | 63.0 | bank | for free | 1.0 | high qualif/self emp/mgmt | 1.0 | yes | yes | good |
374 | 0<=X<200 | 60.0 | all paid | other | 14782.0 | 100<=X<500 | >=7 | 3.0 | female div/dep/mar | NaN | 4.0 | no known property | 60.0 | bank | for free | 2.0 | high qualif/self emp/mgmt | 1.0 | yes | yes | bad |
672 | no checking | 60.0 | existing paid | new car | 10366.0 | <100 | >=7 | 2.0 | male single | NaN | 4.0 | life insurance | 42.0 | NaN | own | 1.0 | high qualif/self emp/mgmt | 1.0 | yes | yes | good |
其他根據欄位值篩選,常用的方法有:
df[df['age'].between(30, 35)]
checking_status | duration | credit_history | purpose | credit_amount | savings_status | employment | installment_commitment | personal_status | other_parties | residence_since | property_magnitude | age | other_payment_plans | housing | existing_credits | job | num_dependents | own_telephone | foreign_worker | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | no checking | 36.0 | existing paid | education | 9055.0 | no known savings | 1<=X<4 | 2.0 | male single | NaN | 4.0 | no known property | 35.0 | NaN | for free | 1.0 | unskilled resident | 2.0 | yes | yes | good |
7 | 0<=X<200 | 36.0 | existing paid | used car | 6948.0 | <100 | 1<=X<4 | 2.0 | male single | NaN | 2.0 | car | 35.0 | NaN | rent | 1.0 | high qualif/self emp/mgmt | 1.0 | yes | yes | good |
15 | <0 | 24.0 | existing paid | radio/tv | 1282.0 | 100<=X<500 | 1<=X<4 | 4.0 | female div/dep/mar | NaN | 2.0 | car | 32.0 | NaN | own | 1.0 | unskilled resident | 1.0 | NaN | yes | bad |
19 | no checking | 24.0 | existing paid | radio/tv | 3430.0 | 500<=X<1000 | >=7 | 3.0 | male single | NaN | 2.0 | car | 31.0 | NaN | own | 1.0 | skilled | 2.0 | yes | yes | good |
28 | 0<=X<200 | 7.0 | existing paid | radio/tv | 2415.0 | <100 | 1<=X<4 | 3.0 | male single | guarantor | 2.0 | real estate | 34.0 | NaN | own | 1.0 | skilled | 1.0 | NaN | yes | good |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
984 | no checking | 24.0 | critical/other existing credit | furniture/equipment | 2028.0 | <100 | 4<=X<7 | 2.0 | male single | NaN | 2.0 | life insurance | 30.0 | NaN | own | 2.0 | unskilled resident | 1.0 | NaN | yes | good |
986 | >=200 | 42.0 | no credits/all paid | business | 6289.0 | <100 | <1 | 2.0 | male div/sep | NaN | 1.0 | life insurance | 33.0 | NaN | own | 2.0 | skilled | 1.0 | NaN | yes | good |
991 | no checking | 15.0 | all paid | radio/tv | 1569.0 | 100<=X<500 | >=7 | 4.0 | male single | NaN | 4.0 | car | 34.0 | bank | own | 1.0 | unskilled resident | 2.0 | NaN | yes | good |
993 | <0 | 36.0 | existing paid | furniture/equipment | 3959.0 | <100 | unemployed | 4.0 | male single | NaN | 3.0 | life insurance | 30.0 | NaN | own | 1.0 | high qualif/self emp/mgmt | 1.0 | yes | yes | good |
995 | no checking | 12.0 | existing paid | furniture/equipment | 1736.0 | <100 | 4<=X<7 | 3.0 | female div/dep/mar | NaN | 4.0 | real estate | 31.0 | NaN | own | 1.0 | unskilled resident | 1.0 | NaN | yes | good |
217 rows × 21 columns
或是根據欄位值清單篩選:
df[df['purpose'].isin(['new car', 'used card'])]
checking_status | duration | credit_history | purpose | credit_amount | savings_status | employment | installment_commitment | personal_status | other_parties | residence_since | property_magnitude | age | other_payment_plans | housing | existing_credits | job | num_dependents | own_telephone | foreign_worker | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | <0 | 24.0 | delayed previously | new car | 4870.0 | <100 | 1<=X<4 | 3.0 | male single | NaN | 4.0 | no known property | 53.0 | NaN | for free | 2.0 | skilled | 2.0 | NaN | yes | bad |
9 | 0<=X<200 | 30.0 | critical/other existing credit | new car | 5234.0 | <100 | unemployed | 4.0 | male mar/wid | NaN | 2.0 | car | 28.0 | NaN | own | 2.0 | high qualif/self emp/mgmt | 1.0 | NaN | yes | bad |
10 | 0<=X<200 | 12.0 | existing paid | new car | 1295.0 | <100 | <1 | 3.0 | female div/dep/mar | NaN | 1.0 | car | 25.0 | NaN | rent | 1.0 | skilled | 1.0 | NaN | yes | bad |
13 | <0 | 24.0 | critical/other existing credit | new car | 1199.0 | <100 | >=7 | 4.0 | male single | NaN | 4.0 | car | 60.0 | NaN | own | 2.0 | unskilled resident | 1.0 | NaN | yes | bad |
14 | <0 | 15.0 | existing paid | new car | 1403.0 | <100 | 1<=X<4 | 2.0 | female div/dep/mar | NaN | 4.0 | car | 28.0 | NaN | rent | 1.0 | skilled | 1.0 | NaN | yes | good |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
972 | <0 | 24.0 | all paid | new car | 1193.0 | <100 | unemployed | 1.0 | female div/dep/mar | co applicant | 4.0 | no known property | 29.0 | NaN | rent | 2.0 | unemp/unskilled non res | 1.0 | NaN | yes | bad |
978 | no checking | 24.0 | delayed previously | new car | 2538.0 | <100 | >=7 | 4.0 | male single | NaN | 4.0 | car | 47.0 | NaN | own | 2.0 | unskilled resident | 2.0 | NaN | yes | bad |
979 | 0<=X<200 | 15.0 | all paid | new car | 1264.0 | 100<=X<500 | 1<=X<4 | 2.0 | male mar/wid | NaN | 2.0 | life insurance | 25.0 | NaN | rent | 1.0 | skilled | 1.0 | NaN | yes | bad |
982 | >=200 | 21.0 | existing paid | new car | 2923.0 | 100<=X<500 | 1<=X<4 | 1.0 | female div/dep/mar | NaN | 1.0 | car | 28.0 | bank | own | 1.0 | high qualif/self emp/mgmt | 1.0 | yes | yes | good |
994 | no checking | 12.0 | existing paid | new car | 2390.0 | no known savings | >=7 | 4.0 | male single | NaN | 3.0 | car | 50.0 | NaN | own | 1.0 | skilled | 1.0 | yes | yes | good |
234 rows × 21 columns
排序資料#
要排序資料可以使用前面用過的.sort_values()
語法,參數by
指定欄位,ascending
指定方向:
df.sort_values(by='credit_amount', ascending=False)
checking_status | duration | credit_history | purpose | credit_amount | savings_status | employment | installment_commitment | personal_status | other_parties | residence_since | property_magnitude | age | other_payment_plans | housing | existing_credits | job | num_dependents | own_telephone | foreign_worker | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
915 | 0<=X<200 | 48.0 | no credits/all paid | other | 18424.0 | <100 | 1<=X<4 | 1.0 | female div/dep/mar | NaN | 2.0 | life insurance | 32.0 | bank | own | 1.0 | high qualif/self emp/mgmt | 1.0 | yes | no | bad |
95 | 0<=X<200 | 54.0 | no credits/all paid | business | 15945.0 | <100 | <1 | 3.0 | male single | NaN | 4.0 | no known property | 58.0 | NaN | rent | 1.0 | skilled | 1.0 | yes | yes | bad |
818 | <0 | 36.0 | existing paid | other | 15857.0 | <100 | unemployed | 2.0 | male div/sep | co applicant | 3.0 | car | 43.0 | NaN | own | 1.0 | high qualif/self emp/mgmt | 1.0 | NaN | yes | good |
887 | 0<=X<200 | 48.0 | existing paid | business | 15672.0 | <100 | 1<=X<4 | 2.0 | male single | NaN | 2.0 | car | 23.0 | NaN | own | 1.0 | skilled | 1.0 | yes | yes | bad |
637 | no checking | 60.0 | delayed previously | radio/tv | 15653.0 | <100 | 4<=X<7 | 2.0 | male single | NaN | 4.0 | car | 21.0 | NaN | own | 2.0 | skilled | 1.0 | yes | yes | good |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
458 | <0 | 6.0 | existing paid | domestic appliance | 343.0 | <100 | <1 | 4.0 | female div/dep/mar | NaN | 1.0 | real estate | 27.0 | NaN | own | 1.0 | skilled | 1.0 | NaN | yes | good |
157 | <0 | 12.0 | all paid | retraining | 339.0 | <100 | >=7 | 4.0 | male mar/wid | NaN | 1.0 | car | 45.0 | bank | own | 1.0 | unskilled resident | 1.0 | NaN | yes | good |
177 | <0 | 6.0 | critical/other existing credit | radio/tv | 338.0 | 500<=X<1000 | >=7 | 4.0 | male single | NaN | 4.0 | car | 52.0 | NaN | own | 2.0 | skilled | 1.0 | NaN | yes | good |
309 | 0<=X<200 | 9.0 | existing paid | new car | 276.0 | <100 | 1<=X<4 | 4.0 | male mar/wid | NaN | 4.0 | real estate | 22.0 | NaN | rent | 1.0 | unskilled resident | 1.0 | NaN | yes | good |
725 | no checking | 6.0 | critical/other existing credit | new car | 250.0 | >=1000 | 1<=X<4 | 2.0 | female div/dep/mar | NaN | 2.0 | real estate | 41.0 | bank | own | 2.0 | unskilled resident | 1.0 | NaN | yes | good |
1000 rows × 21 columns
另外,若要根據index排序,可以使用.sort_index()
。例如,想將資料根據年齡統計筆數後,查看年紀最小的幾個樣本數:
df['age'].value_counts().sort_index().head(10)
age
19.0 2
20.0 14
21.0 14
22.0 27
23.0 48
24.0 44
25.0 41
26.0 50
27.0 51
28.0 43
Name: count, dtype: int64