Intro to Pandas

Intro to Pandas#

pandas 是使用Python進行資料分析，或是快速進行一些資料處理時必用的套件。

在小型專案中，可能也適合使用pandas來建立資料流程的管線。

但在較大型的專案中，資料量大或是資料處理流程的複雜度高的情況下，搭配使用其他工具可能就比較適合了（例如spark或是dbt），端看使用的情境以及專案團隊或組織的資源。

總而言之，pandas是使用Python進行資料分析必學的套件，就讓我們開始吧！

以下假設你已經建立好jupyter notebook的環境（也可以使用google colab），並且安裝好pandas套件。

什麼是 DataFrame？#

首先先來瞭解什麼是DataFrame。

DataFrame是pandas讀取資料表後的形式，跟excel資料表非常相似。

首先第一步是把pandas套件import進來。

import pandas as pd

定義一個DataFrame

# pd.Timestamp 是pandas建立時間資料的方法
# pd.Categorical 是pandas建立類別資料的方法。類別資料與字串不同，具有其他特性，之後會說明。

df = pd.DataFrame(
    {
        "col_1": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        "col_2": [i/10 for i in range(1,11)],
        "col_3": ['a', 'i', 'u', 'e', 'o', 'a', 'e', 'i', 'o', 'u'],
        "col_4": pd.Timestamp("20240101"),
        "col_5": [True, False] * 5,
        "col_6": pd.Categorical(["low"] * 5 + ["high"] * 5)
    }
)

直接執行它，就會印出DataFrame的內容。

df # or display(df)

	col_1	col_2	col_3	col_4	col_5	col_6
0	1	0.1	a	2024-01-01	True	low
1	2	0.2	i	2024-01-01	False	low
2	3	0.3	u	2024-01-01	True	low
3	4	0.4	e	2024-01-01	False	low
4	5	0.5	o	2024-01-01	True	low
5	6	0.6	a	2024-01-01	False	high
6	7	0.7	e	2024-01-01	True	high
7	8	0.8	i	2024-01-01	False	high
8	9	0.9	o	2024-01-01	True	high
9	10	1.0	u	2024-01-01	False	high

具有以下組成部分：

索引(index)： 最左側數字0~1的部分

    df.index

RangeIndex(start=0, stop=10, step=1)

欄位(column)：上方col_1 ~ col_6的部分

df.columns

Index(['col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6'], dtype='object')

資料(data)：資料內容本身。

df.values

array([[1, 0.1, 'a', Timestamp('2024-01-01 00:00:00'), True, 'low'],
       [2, 0.2, 'i', Timestamp('2024-01-01 00:00:00'), False, 'low'],
       [3, 0.3, 'u', Timestamp('2024-01-01 00:00:00'), True, 'low'],
       [4, 0.4, 'e', Timestamp('2024-01-01 00:00:00'), False, 'low'],
       [5, 0.5, 'o', Timestamp('2024-01-01 00:00:00'), True, 'low'],
       [6, 0.6, 'a', Timestamp('2024-01-01 00:00:00'), False, 'high'],
       [7, 0.7, 'e', Timestamp('2024-01-01 00:00:00'), True, 'high'],
       [8, 0.8, 'i', Timestamp('2024-01-01 00:00:00'), False, 'high'],
       [9, 0.9, 'o', Timestamp('2024-01-01 00:00:00'), True, 'high'],
       [10, 1.0, 'u', Timestamp('2024-01-01 00:00:00'), False, 'high']],
      dtype=object)

可以注意到印出的結果有array的字樣，其實pandas是基於另外一個套件numpy而延伸出來的工具。

numpy也是一個python套件，提供大量的陣列運算功能，而且速度非常快。

numpy的基本資料結構被稱作array。

事實上numpy array是一個可以容納多維的陣列，稱作ndarray (n-dimensional array)。

type(df.values)

numpy.ndarray

建立一個numpy array：

import numpy as np

arr = np.array([1, 2, 3, 4, 5])

Pandas DataFrame的每一欄具有特定的資料型態。

df.dtypes

col_1            int64
col_2          float64
col_3           object
col_4    datetime64[s]
col_5             bool
col_6         category
dtype: object

pandas資料型態大致如下：

Int64: integer
float64: float
object: strings (注意當資料型態混用的時候也會是object)
category: pandas特殊的資料型態，基本上也是string，但具有其他方便的特性。
bool: 布林值
datetime64[ns]: 時間

認識你的資料#

以下的資料內容取自在kaggle找到的資料集：credit_risk_customers (kaggle.com)。可以自行下載成csv檔。

這是一個跟銀行客戶信用風險相關的資料集，資料內容說明請見：OpenML。

首先透過以下語法讀入該csv檔：

import pandas as pd

df = pd.read_csv('./data/credit_customers.csv')

在進行資料分析的一開始，通常會先大致瞭解一下資料的長相。

快速查看DataFrame內容#

以下語法可以快速查看DataFrame的相關資訊：

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 checking_status         1000 non-null   object 
 duration                1000 non-null   float64
 credit_history          1000 non-null   object 
 purpose                 1000 non-null   object 
 credit_amount           1000 non-null   float64
 savings_status          1000 non-null   object 
 employment              1000 non-null   object 
 installment_commitment  1000 non-null   float64
 personal_status         1000 non-null   object 
 other_parties           93 non-null     object 
residence_since         1000 non-null   float64
property_magnitude      1000 non-null   object 
age                     1000 non-null   float64
other_payment_plans     186 non-null    object 
housing                 1000 non-null   object 
existing_credits        1000 non-null   float64
job                     1000 non-null   object 
num_dependents          1000 non-null   float64
own_telephone           404 non-null    object 
foreign_worker          1000 non-null   object 
class                   1000 non-null   object 
dtypes: float64(7), object(14)
memory usage: 164.2+ KB

查看DataFrame前幾筆資料，可以使用df.head()，預設是顯示5筆，但也可指定筆數。

df.head(3)

	checking_status	duration	credit_history	purpose	credit_amount	savings_status	employment	installment_commitment	personal_status	other_parties	...	property_magnitude	age	other_payment_plans	housing	existing_credits	job	num_dependents	own_telephone	foreign_worker	class
0	<0	6.0	critical/other existing credit	radio/tv	1169.0	no known savings	>=7	4.0	male single	NaN	...	real estate	67.0	NaN	own	2.0	skilled	1.0	yes	yes	good
1	0<=X<200	48.0	existing paid	radio/tv	5951.0	<100	1<=X<4	2.0	female div/dep/mar	NaN	...	real estate	22.0	NaN	own	1.0	skilled	1.0	NaN	yes	bad
2	no checking	12.0	critical/other existing credit	education	2096.0	<100	4<=X<7	2.0	male single	NaN	...	real estate	49.0	NaN	own	1.0	unskilled resident	2.0	NaN	yes	good

3 rows × 21 columns

反過來就是：

df.tail()

	checking_status	duration	credit_history	purpose	credit_amount	savings_status	employment	installment_commitment	personal_status	other_parties	...	property_magnitude	age	other_payment_plans	housing	existing_credits	job	num_dependents	own_telephone	foreign_worker	class
995	no checking	12.0	existing paid	furniture/equipment	1736.0	<100	4<=X<7	3.0	female div/dep/mar	NaN	...	real estate	31.0	NaN	own	1.0	unskilled resident	1.0	NaN	yes	good
996	<0	30.0	existing paid	used car	3857.0	<100	1<=X<4	4.0	male div/sep	NaN	...	life insurance	40.0	NaN	own	1.0	high qualif/self emp/mgmt	1.0	yes	yes	good
997	no checking	12.0	existing paid	radio/tv	804.0	<100	>=7	4.0	male single	NaN	...	car	38.0	NaN	own	1.0	skilled	1.0	NaN	yes	good
998	<0	45.0	existing paid	radio/tv	1845.0	<100	1<=X<4	4.0	male single	NaN	...	no known property	23.0	NaN	for free	1.0	skilled	1.0	yes	yes	bad
999	0<=X<200	45.0	critical/other existing credit	used car	4576.0	100<=X<500	unemployed	3.0	male single	NaN	...	car	27.0	NaN	own	1.0	skilled	1.0	NaN	yes	good

5 rows × 21 columns

若要隨機查看的話：

df.sample(3)

	checking_status	duration	credit_history	purpose	credit_amount	savings_status	employment	installment_commitment	personal_status	other_parties	...	property_magnitude	age	other_payment_plans	housing	existing_credits	job	num_dependents	own_telephone	foreign_worker	class
754	no checking	12.0	delayed previously	repairs	1555.0	>=1000	>=7	4.0	male single	NaN	...	no known property	55.0	NaN	for free	2.0	skilled	2.0	NaN	yes	bad
815	0<=X<200	36.0	delayed previously	new car	7432.0	<100	1<=X<4	2.0	female div/dep/mar	NaN	...	life insurance	54.0	NaN	rent	1.0	skilled	1.0	NaN	yes	good
366	no checking	18.0	critical/other existing credit	used car	3850.0	<100	4<=X<7	3.0	male single	NaN	...	car	27.0	NaN	own	2.0	skilled	1.0	NaN	yes	good

3 rows × 21 columns

注意到，由於資料的欄位數較多，實際顯示出來的欄位數較少，中間會以”…”的方式略過。

若想要查看完整的欄位，可以使用以下語法：

pd.set_option("display.max_columns", None)

意思是告訴pandas要顯示出全部的欄位數：

df.head()

	checking_status	duration	credit_history	purpose	credit_amount	savings_status	employment	installment_commitment	personal_status	other_parties	residence_since	property_magnitude	age	other_payment_plans	housing	existing_credits	job	num_dependents	own_telephone	foreign_worker	class
0	<0	6.0	critical/other existing credit	radio/tv	1169.0	no known savings	>=7	4.0	male single	NaN	4.0	real estate	67.0	NaN	own	2.0	skilled	1.0	yes	yes	good
1	0<=X<200	48.0	existing paid	radio/tv	5951.0	<100	1<=X<4	2.0	female div/dep/mar	NaN	2.0	real estate	22.0	NaN	own	1.0	skilled	1.0	NaN	yes	bad
2	no checking	12.0	critical/other existing credit	education	2096.0	<100	4<=X<7	2.0	male single	NaN	3.0	real estate	49.0	NaN	own	1.0	unskilled resident	2.0	NaN	yes	good
3	<0	42.0	existing paid	furniture/equipment	7882.0	<100	4<=X<7	2.0	male single	guarantor	4.0	life insurance	45.0	NaN	for free	1.0	skilled	2.0	NaN	yes	good
4	<0	24.0	delayed previously	new car	4870.0	<100	1<=X<4	3.0	male single	NaN	4.0	no known property	53.0	NaN	for free	2.0	skilled	2.0	NaN	yes	bad

查看資料的維度：

df.shape

(1000, 21)

回傳的是一個tuple，第1個元素是index的長度，就是列(row)的數量，第2個元素是column的長度，就是欄(column)的數量。

若要快速得知資料的統計資訊：

df.describe()

	duration	credit_amount	installment_commitment	residence_since	age	existing_credits	num_dependents
count	1000.000000	1000.000000	1000.000000	1000.000000	1000.000000	1000.000000	1000.000000
mean	20.903000	3271.258000	2.973000	2.845000	35.546000	1.407000	1.155000
std	12.058814	2822.736876	1.118715	1.103718	11.375469	0.577654	0.362086
min	4.000000	250.000000	1.000000	1.000000	19.000000	1.000000	1.000000
25%	12.000000	1365.500000	2.000000	2.000000	27.000000	1.000000	1.000000
50%	18.000000	2319.500000	3.000000	3.000000	33.000000	1.000000	1.000000
75%	24.000000	3972.250000	4.000000	4.000000	42.000000	2.000000	1.000000
max	72.000000	18424.000000	4.000000	4.000000	75.000000	4.000000	2.000000

注意到只有數值型的資料才會列出。若要納入非數值型的資料的話：

df.describe(include='all')

	checking_status	duration	credit_history	purpose	credit_amount	savings_status	employment	installment_commitment	personal_status	other_parties	residence_since	property_magnitude	age	other_payment_plans	housing	existing_credits	job	num_dependents	own_telephone	foreign_worker	class
count	1000	1000.000000	1000	1000	1000.000000	1000	1000	1000.000000	1000	93	1000.000000	1000	1000.000000	186	1000	1000.000000	1000	1000.000000	404	1000	1000
unique	4	NaN	5	10	NaN	5	5	NaN	4	2	NaN	4	NaN	2	3	NaN	4	NaN	1	2	2
top	no checking	NaN	existing paid	radio/tv	NaN	<100	1<=X<4	NaN	male single	guarantor	NaN	car	NaN	bank	own	NaN	skilled	NaN	yes	yes	good
freq	394	NaN	530	280	NaN	603	339	NaN	548	52	NaN	332	NaN	139	713	NaN	630	NaN	404	963	700
mean	NaN	20.903000	NaN	NaN	3271.258000	NaN	NaN	2.973000	NaN	NaN	2.845000	NaN	35.546000	NaN	NaN	1.407000	NaN	1.155000	NaN	NaN	NaN
std	NaN	12.058814	NaN	NaN	2822.736876	NaN	NaN	1.118715	NaN	NaN	1.103718	NaN	11.375469	NaN	NaN	0.577654	NaN	0.362086	NaN	NaN	NaN
min	NaN	4.000000	NaN	NaN	250.000000	NaN	NaN	1.000000	NaN	NaN	1.000000	NaN	19.000000	NaN	NaN	1.000000	NaN	1.000000	NaN	NaN	NaN
25%	NaN	12.000000	NaN	NaN	1365.500000	NaN	NaN	2.000000	NaN	NaN	2.000000	NaN	27.000000	NaN	NaN	1.000000	NaN	1.000000	NaN	NaN	NaN
50%	NaN	18.000000	NaN	NaN	2319.500000	NaN	NaN	3.000000	NaN	NaN	3.000000	NaN	33.000000	NaN	NaN	1.000000	NaN	1.000000	NaN	NaN	NaN
75%	NaN	24.000000	NaN	NaN	3972.250000	NaN	NaN	4.000000	NaN	NaN	4.000000	NaN	42.000000	NaN	NaN	2.000000	NaN	1.000000	NaN	NaN	NaN
max	NaN	72.000000	NaN	NaN	18424.000000	NaN	NaN	4.000000	NaN	NaN	4.000000	NaN	75.000000	NaN	NaN	4.000000	NaN	2.000000	NaN	NaN	NaN

檢查遺漏值#

大致瞭解資料長相之後，可以開始進一步確認資料的細節。資料是否有遺漏值？欄位值有哪些取值？欄位的最大最小值等統計資訊？等等。

首先先來看跟遺漏值相關的語法：

df.isna()

	checking_status	duration	credit_history	purpose	credit_amount	savings_status	employment	installment_commitment	personal_status	other_parties	residence_since	property_magnitude	age	other_payment_plans	housing	existing_credits	job	num_dependents	own_telephone	foreign_worker	class
0	False	False	False	False	False	False	False	False	False	True	False	False	False	True	False	False	False	False	False	False	False
1	False	False	False	False	False	False	False	False	False	True	False	False	False	True	False	False	False	False	True	False	False
2	False	False	False	False	False	False	False	False	False	True	False	False	False	True	False	False	False	False	True	False	False
3	False	False	False	False	False	False	False	False	False	False	False	False	False	True	False	False	False	False	True	False	False
4	False	False	False	False	False	False	False	False	False	True	False	False	False	True	False	False	False	False	True	False	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
995	False	False	False	False	False	False	False	False	False	True	False	False	False	True	False	False	False	False	True	False	False
996	False	False	False	False	False	False	False	False	False	True	False	False	False	True	False	False	False	False	False	False	False
997	False	False	False	False	False	False	False	False	False	True	False	False	False	True	False	False	False	False	True	False	False
998	False	False	False	False	False	False	False	False	False	True	False	False	False	True	False	False	False	False	False	False	False
999	False	False	False	False	False	False	False	False	False	True	False	False	False	True	False	False	False	False	True	False	False

1000 rows × 21 columns

以上這個語法會回傳每個欄位每一列是否為遺漏值的DataFrame，False代表該欄位值沒有遺漏值。True代表有遺漏值。

但通常我們關心的是哪些欄位有遺漏值？有多少筆資料有遺漏值？

所以會想要在把上面的資料根據欄位「加總」，因為True代表1，False代表0，加總後就可以知道遺漏值的數量。

可以這樣寫：

df.isna().sum()

checking_status             0
duration                    0
credit_history              0
purpose                     0
credit_amount               0
savings_status              0
employment                  0
installment_commitment      0
personal_status             0
other_parties             907
residence_since             0
property_magnitude          0
age                         0
other_payment_plans       814
housing                     0
existing_credits            0
job                         0
num_dependents              0
own_telephone             596
foreign_worker              0
class                       0
dtype: int64

上面的語法結構值得說明一下，由於df.isnull()回傳的是一個DataFrame，所以我們可以直接接著call其他DataFrame的方法。

而.sum()是一個把DataFrame資料加總的方法，預設是by欄位加總。可以藉由調整參數``axis```來改變加總的方向。

不指定axis參數的話，預設是axis=0，代表沿著axis 0，把值一個一個加起來，所以是by欄位加總。

axis=1代表by列加總，就可以知道哪幾筆資料有遺漏值。

df.isna().sum(axis=1)

    2
    3
    3
    2
    3
      ..
  3
  2
  3
  2
  3
Length: 1000, dtype: int64

如果不關心筆數，也可以使用.any()，代表只要有一筆資料為True就會輸出True：

df.isna().any()

checking_status           False
duration                  False
credit_history            False
purpose                   False
credit_amount             False
savings_status            False
employment                False
installment_commitment    False
personal_status           False
other_parties              True
residence_since           False
property_magnitude        False
age                       False
other_payment_plans        True
housing                   False
existing_credits          False
job                       False
num_dependents            False
own_telephone              True
foreign_worker            False
class                     False
dtype: bool

如何確認資料是否有任何遺漏值？

df.isna().any().any()

True

反向的語法是：

df.notna().sum()

checking_status           1000
duration                  1000
credit_history            1000
purpose                   1000
credit_amount             1000
savings_status            1000
employment                1000
installment_commitment    1000
personal_status           1000
other_parties               93
residence_since           1000
property_magnitude        1000
age                       1000
other_payment_plans        186
housing                   1000
existing_credits          1000
job                       1000
num_dependents            1000
own_telephone              404
foreign_worker            1000
class                     1000
dtype: int64

確認欄位值#

如果想知道特定欄位有哪些取值，通常是要初步看一下資料的內容，以及看欄位有沒有什麼異常值。

例如想知道客戶來申辦貸款的目的有哪些？可以對purpose這個欄位使用以下操作：

df['purpose'].unique()

array(['radio/tv', 'education', 'furniture/equipment', 'new car',
       'used car', 'business', 'domestic appliance', 'repairs', 'other',
       'retraining'], dtype=object)

就可以把資料中purpose實際上出現哪些不同值給列出來。

其中，df['purpose']是篩選出欄位資料的語法。也可以使用：

df.purpose

               radio/tv
               radio/tv
              education
    furniture/equipment
                new car
              ...         
  furniture/equipment
             used car
             radio/tv
             radio/tv
             used car
Name: purpose, Length: 1000, dtype: object

注意到.purpose的語法事實上是把欄位當作是DataFrame的「屬性(attribute)」來取用，所以若欄位名稱與DataFrame原本就有的屬性重複的話會無法使用。

補充說明一下，這裡回傳的東西，其實是pandas Series。Series是pandas的另一個資料結構，DataFrame中每一個欄位都儲存為Series。

type(df.purpose)

pandas.core.series.Series

可以把Series理解為單一維度的numpy array，與numpy array不同的地方在於Series內含具有標籤的索引(index)。

s = pd.Series([1, 2, 3, 4, 5], index=["a", "b", "c", "d", "e"])

print(s)
print(s.index)

a    1
b    2
c    3
d    4
e    5
dtype: int64
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

除了看哪些欄位取值，還可以直接算出該欄位有幾種不同取值：

df['purpose'].nunique()

或是每個取值有多少筆資料：

df['purpose'].value_counts()

purpose
radio/tv               280
new car                234
furniture/equipment    181
used car               103
business                97
education               50
repairs                 22
domestic appliance      12
other                   12
retraining               9
Name: count, dtype: int64

進一步看佔比：

df['purpose'].value_counts(normalize=True)

purpose
radio/tv               0.280
new car                0.234
furniture/equipment    0.181
used car               0.103
business               0.097
education              0.050
repairs                0.022
domestic appliance     0.012
other                  0.012
retraining             0.009
Name: proportion, dtype: float64

注意到value_counts()會忽略空值：

df['other_parties'].value_counts() # 明顯不到1000筆資料

other_parties
guarantor       52
co applicant    41
Name: count, dtype: int64

可以使用dropna=False參數：

df['other_parties'].value_counts(dropna=False)

other_parties
NaN             907
guarantor        52
co applicant     41
Name: count, dtype: int64

針對數值型欄位可以使用bins參數，放入要切分的組數。這邊組別的切分是依據欄位值的全距去切指定的等份。

df['age'].value_counts(bins=7)

(27.0, 35.0]                  297
(18.942999999999998, 27.0]    291
(35.0, 43.0]                  194
(43.0, 51.0]                  113
(51.0, 59.0]                   54
(59.0, 67.0]                   41
(67.0, 75.0]                   10
Name: count, dtype: int64

如果要看個別欄位的前幾大或倒數前幾大的值是哪些（參數可指定列數）：

df['duration'].nlargest()

  72.0
   60.0
  60.0
  60.0
  60.0
Name: duration, dtype: float64

df['duration'].nsmallest()

  4.0
  4.0
  4.0
  4.0
  4.0
Name: duration, dtype: float64

或是可以用排序的方法：

df['duration'].sort_values(ascending=False).head()

  72.0
  60.0
  60.0
  60.0
  60.0
Name: duration, dtype: float64

df['duration'].sort_values().head()

  4.0
  4.0
  4.0
  4.0
  4.0
Name: duration, dtype: float64

資料的統計資訊#

除了使用.describe()快速查看資料統計數據外，還可以使用各種方法來依照需求做計算。

df['duration'].min()

4.0

df['duration'].max()

72.0

df['duration'].mean()

20.903

df['duration'].median()

18.0

df['duration'].var()

145.41500600600602

df['duration'].std()

12.058814452756375

df['duration'].quantile(0.8)

30.0

df['duration'].quantile([0.1,0.3,0.45,0.6,0.8,1])

10     9.0
30    12.0
45    18.0
60    24.0
80    30.0
00    72.0
Name: duration, dtype: float64

還可以很方便算多個欄位的共變數、相關係數：

df[['age', 'duration', 'credit_amount']].cov()

	age	duration	credit_amount
age	129.401285	-4.956995	1.050523e+03
duration	-4.956995	145.415006	2.127375e+04
credit_amount	1050.522655	21273.749776	7.967843e+06

df[['age', 'duration', 'credit_amount']].corr()

	age	duration	credit_amount
age	1.000000	-0.036136	0.032716
duration	-0.036136	1.000000	0.624984
credit_amount	0.032716	0.624984	1.000000

查看你的資料#

有了對資料的大致認識後，可能會想對資料的部分欄位或部份樣本做一些後續處理，例如填補遺漏值、對欄位做加工等等。

但在此之前需要先知道怎麼擷取你想要的特定資料。

抓特定欄位的資料#

首先第一種方法已經介紹過了，就是直接在DataFrame後面加[想要抓的欄位]：

df['class']

    good
     bad
    good
    good
     bad
       ... 
  good
  good
  good
   bad
  good
Name: class, Length: 1000, dtype: object

當要一次抓取多個欄位時，必須放入一個欄位名稱的list：

df[['age', 'duration', 'credit_amount']]

	age	duration	credit_amount
0	67.0	6.0	1169.0
1	22.0	48.0	5951.0
2	49.0	12.0	2096.0
3	45.0	42.0	7882.0
4	53.0	24.0	4870.0
...	...	...	...
995	31.0	12.0	1736.0
996	40.0	30.0	3857.0
997	38.0	12.0	804.0
998	23.0	45.0	1845.0
999	27.0	45.0	4576.0

1000 rows × 3 columns

另一種方法是使用.loc[]，使用方式如下：

df.loc[:, 'class']

    good
     bad
    good
    good
     bad
       ... 
  good
  good
  good
   bad
  good
Name: class, Length: 1000, dtype: object

多個欄位的話一樣是要放list：

df.loc[:, ['age', 'duration', 'credit_amount']]

	age	duration	credit_amount
0	67.0	6.0	1169.0
1	22.0	48.0	5951.0
2	49.0	12.0	2096.0
3	45.0	42.0	7882.0
4	53.0	24.0	4870.0
...	...	...	...
995	31.0	12.0	1736.0
996	40.0	30.0	3857.0
997	38.0	12.0	804.0
998	23.0	45.0	1845.0
999	27.0	45.0	4576.0

1000 rows × 3 columns

與.loc[]類似的，iloc[]也可以做到篩選，只是要指定的是「第幾個」欄位：

df.iloc[:, [12, 1, 4]]

	age	duration	credit_amount
0	67.0	6.0	1169.0
1	22.0	48.0	5951.0
2	49.0	12.0	2096.0
3	45.0	42.0	7882.0
4	53.0	24.0	4870.0
...	...	...	...
995	31.0	12.0	1736.0
996	40.0	30.0	3857.0
997	38.0	12.0	804.0
998	23.0	45.0	1845.0
999	27.0	45.0	4576.0

1000 rows × 3 columns

注意到:的使用，可以一次指定多個欄位，使用方式跟list的index一樣：

df.iloc[:, 0:4]

	checking_status	duration	credit_history	purpose
0	<0	6.0	critical/other existing credit	radio/tv
1	0<=X<200	48.0	existing paid	radio/tv
2	no checking	12.0	critical/other existing credit	education
3	<0	42.0	existing paid	furniture/equipment
4	<0	24.0	delayed previously	new car
...	...	...	...	...
995	no checking	12.0	existing paid	furniture/equipment
996	<0	30.0	existing paid	used car
997	no checking	12.0	existing paid	radio/tv
998	<0	45.0	existing paid	radio/tv
999	0<=X<200	45.0	critical/other existing credit	used car

1000 rows × 4 columns

.loc[]也可以使用:，但就必須符合原本的欄位順序：

print(df.columns)

# 順序錯了，選不到東西
df.loc[:, "purpose":"duration"]

Index(['checking_status', 'duration', 'credit_history', 'purpose',
       'credit_amount', 'savings_status', 'employment',
       'installment_commitment', 'personal_status', 'other_parties',
       'residence_since', 'property_magnitude', 'age', 'other_payment_plans',
       'housing', 'existing_credits', 'job', 'num_dependents', 'own_telephone',
       'foreign_worker', 'class'],
      dtype='object')


0
1
2
3
4
...
995
996
997
998
999

1000 rows × 0 columns

df.loc[:, "duration":"purpose"]

	duration	credit_history	purpose
0	6.0	critical/other existing credit	radio/tv
1	48.0	existing paid	radio/tv
2	12.0	critical/other existing credit	education
3	42.0	existing paid	furniture/equipment
4	24.0	delayed previously	new car
...	...	...	...
995	12.0	existing paid	furniture/equipment
996	30.0	existing paid	used car
997	12.0	existing paid	radio/tv
998	45.0	existing paid	radio/tv
999	45.0	critical/other existing credit	used car

1000 rows × 3 columns

再來一種方法是使用.filter()搭配items=[欄位名稱]的參數：

df.filter(items=['age', 'duration', 'credit_amount'])

	age	duration	credit_amount
0	67.0	6.0	1169.0
1	22.0	48.0	5951.0
2	49.0	12.0	2096.0
3	45.0	42.0	7882.0
4	53.0	24.0	4870.0
...	...	...	...
995	31.0	12.0	1736.0
996	40.0	30.0	3857.0
997	38.0	12.0	804.0
998	23.0	45.0	1845.0
999	27.0	45.0	4576.0

1000 rows × 3 columns

根據資料型態抓欄位：

df.select_dtypes(include='float')

	duration	credit_amount	installment_commitment	residence_since	age	existing_credits	num_dependents
0	6.0	1169.0	4.0	4.0	67.0	2.0	1.0
1	48.0	5951.0	2.0	2.0	22.0	1.0	1.0
2	12.0	2096.0	2.0	3.0	49.0	1.0	2.0
3	42.0	7882.0	2.0	4.0	45.0	1.0	2.0
4	24.0	4870.0	3.0	4.0	53.0	2.0	2.0
...	...	...	...	...	...	...	...
995	12.0	1736.0	3.0	4.0	31.0	1.0	1.0
996	30.0	3857.0	4.0	4.0	40.0	1.0	1.0
997	12.0	804.0	4.0	4.0	38.0	1.0	1.0
998	45.0	1845.0	4.0	4.0	23.0	1.0	1.0
999	45.0	4576.0	3.0	4.0	27.0	1.0	1.0

1000 rows × 7 columns

根據欄位名稱的一些特徵來抓資料：

df.filter(like='_')

	checking_status	credit_history	credit_amount	savings_status	installment_commitment	personal_status	other_parties	residence_since	property_magnitude	other_payment_plans	existing_credits	num_dependents	own_telephone	foreign_worker
0	<0	critical/other existing credit	1169.0	no known savings	4.0	male single	NaN	4.0	real estate	NaN	2.0	1.0	yes	yes
1	0<=X<200	existing paid	5951.0	<100	2.0	female div/dep/mar	NaN	2.0	real estate	NaN	1.0	1.0	NaN	yes
2	no checking	critical/other existing credit	2096.0	<100	2.0	male single	NaN	3.0	real estate	NaN	1.0	2.0	NaN	yes
3	<0	existing paid	7882.0	<100	2.0	male single	guarantor	4.0	life insurance	NaN	1.0	2.0	NaN	yes
4	<0	delayed previously	4870.0	<100	3.0	male single	NaN	4.0	no known property	NaN	2.0	2.0	NaN	yes
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
995	no checking	existing paid	1736.0	<100	3.0	female div/dep/mar	NaN	4.0	real estate	NaN	1.0	1.0	NaN	yes
996	<0	existing paid	3857.0	<100	4.0	male div/sep	NaN	4.0	life insurance	NaN	1.0	1.0	yes	yes
997	no checking	existing paid	804.0	<100	4.0	male single	NaN	4.0	car	NaN	1.0	1.0	NaN	yes
998	<0	existing paid	1845.0	<100	4.0	male single	NaN	4.0	no known property	NaN	1.0	1.0	yes	yes
999	0<=X<200	critical/other existing credit	4576.0	100<=X<500	3.0	male single	NaN	4.0	car	NaN	1.0	1.0	NaN	yes

1000 rows × 14 columns

若要把DataFrame欄位根據指定的順序做排序的話，就必須先設定好欄位順序在list內，然後再把資料抓出來：

ordered_cols = ['class', 'credit_amount', 'age', 'duration', 'existing_credits']

df[ordered_cols]

	class	credit_amount	age	duration	existing_credits
0	good	1169.0	67.0	6.0	2.0
1	bad	5951.0	22.0	48.0	1.0
2	good	2096.0	49.0	12.0	1.0
3	good	7882.0	45.0	42.0	1.0
4	bad	4870.0	53.0	24.0	2.0
...	...	...	...	...	...
995	good	1736.0	31.0	12.0	1.0
996	good	3857.0	40.0	30.0	1.0
997	good	804.0	38.0	12.0	1.0
998	bad	1845.0	23.0	45.0	1.0
999	good	4576.0	27.0	45.0	1.0

1000 rows × 5 columns

抓特定列的資料#

直接根據列名稱去抓資料的情境雖然比較少，但一樣可以透過.loc[]或.iloc[]做到：

df.loc[7:10, :]

	checking_status	duration	credit_history	purpose	credit_amount	savings_status	employment	installment_commitment	personal_status	other_parties	residence_since	property_magnitude	age	other_payment_plans	housing	existing_credits	job	num_dependents	own_telephone	foreign_worker	class
7	0<=X<200	36.0	existing paid	used car	6948.0	<100	1<=X<4	2.0	male single	NaN	2.0	car	35.0	NaN	rent	1.0	high qualif/self emp/mgmt	1.0	yes	yes	good
8	no checking	12.0	existing paid	radio/tv	3059.0	>=1000	4<=X<7	2.0	male div/sep	NaN	4.0	real estate	61.0	NaN	own	1.0	unskilled resident	1.0	NaN	yes	good
9	0<=X<200	30.0	critical/other existing credit	new car	5234.0	<100	unemployed	4.0	male mar/wid	NaN	2.0	car	28.0	NaN	own	2.0	high qualif/self emp/mgmt	1.0	NaN	yes	bad
10	0<=X<200	12.0	existing paid	new car	1295.0	<100	<1	3.0	female div/dep/mar	NaN	1.0	car	25.0	NaN	rent	1.0	skilled	1.0	NaN	yes	bad

在這邊，注意到因為index的順序跟label名稱一樣，但.loc[]的行為是判斷label名稱，.iloc[]則是類似list做indexing的概念，並不會包含到指定的最後一個index：

df.iloc[7:10, :]

	checking_status	duration	credit_history	purpose	credit_amount	savings_status	employment	installment_commitment	personal_status	other_parties	residence_since	property_magnitude	age	other_payment_plans	housing	existing_credits	job	num_dependents	own_telephone	foreign_worker	class
7	0<=X<200	36.0	existing paid	used car	6948.0	<100	1<=X<4	2.0	male single	NaN	2.0	car	35.0	NaN	rent	1.0	high qualif/self emp/mgmt	1.0	yes	yes	good
8	no checking	12.0	existing paid	radio/tv	3059.0	>=1000	4<=X<7	2.0	male div/sep	NaN	4.0	real estate	61.0	NaN	own	1.0	unskilled resident	1.0	NaN	yes	good
9	0<=X<200	30.0	critical/other existing credit	new car	5234.0	<100	unemployed	4.0	male mar/wid	NaN	2.0	car	28.0	NaN	own	2.0	high qualif/self emp/mgmt	1.0	NaN	yes	bad

依條件判斷抓取資料#

最直接的方法，例如要抓age > 40以上的資料：

df[df['age'] > 40]

	checking_status	duration	credit_history	purpose	credit_amount	savings_status	employment	installment_commitment	personal_status	other_parties	residence_since	property_magnitude	age	other_payment_plans	housing	existing_credits	job	num_dependents	own_telephone	foreign_worker	class
0	<0	6.0	critical/other existing credit	radio/tv	1169.0	no known savings	>=7	4.0	male single	NaN	4.0	real estate	67.0	NaN	own	2.0	skilled	1.0	yes	yes	good
2	no checking	12.0	critical/other existing credit	education	2096.0	<100	4<=X<7	2.0	male single	NaN	3.0	real estate	49.0	NaN	own	1.0	unskilled resident	2.0	NaN	yes	good
3	<0	42.0	existing paid	furniture/equipment	7882.0	<100	4<=X<7	2.0	male single	guarantor	4.0	life insurance	45.0	NaN	for free	1.0	skilled	2.0	NaN	yes	good
4	<0	24.0	delayed previously	new car	4870.0	<100	1<=X<4	3.0	male single	NaN	4.0	no known property	53.0	NaN	for free	2.0	skilled	2.0	NaN	yes	bad
6	no checking	24.0	existing paid	furniture/equipment	2835.0	500<=X<1000	>=7	3.0	male single	NaN	4.0	life insurance	53.0	NaN	own	1.0	skilled	1.0	NaN	yes	good
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
978	no checking	24.0	delayed previously	new car	2538.0	<100	>=7	4.0	male single	NaN	4.0	car	47.0	NaN	own	2.0	unskilled resident	2.0	NaN	yes	bad
980	0<=X<200	30.0	critical/other existing credit	furniture/equipment	8386.0	<100	4<=X<7	2.0	male single	NaN	2.0	life insurance	49.0	NaN	own	1.0	skilled	1.0	NaN	yes	bad
987	no checking	13.0	existing paid	radio/tv	1409.0	100<=X<500	unemployed	2.0	female div/dep/mar	NaN	4.0	real estate	64.0	NaN	own	1.0	skilled	1.0	NaN	yes	good
989	0<=X<200	24.0	critical/other existing credit	radio/tv	1743.0	<100	>=7	4.0	male single	NaN	2.0	life insurance	48.0	NaN	own	2.0	unskilled resident	1.0	NaN	yes	good
994	no checking	12.0	existing paid	new car	2390.0	no known savings	>=7	4.0	male single	NaN	3.0	car	50.0	NaN	own	1.0	skilled	1.0	yes	yes	good

274 rows × 21 columns

也可以利用.loc[]：

df.loc[df['age'] > 40, :]

	checking_status	duration	credit_history	purpose	credit_amount	savings_status	employment	installment_commitment	personal_status	other_parties	residence_since	property_magnitude	age	other_payment_plans	housing	existing_credits	job	num_dependents	own_telephone	foreign_worker	class
0	<0	6.0	critical/other existing credit	radio/tv	1169.0	no known savings	>=7	4.0	male single	NaN	4.0	real estate	67.0	NaN	own	2.0	skilled	1.0	yes	yes	good
2	no checking	12.0	critical/other existing credit	education	2096.0	<100	4<=X<7	2.0	male single	NaN	3.0	real estate	49.0	NaN	own	1.0	unskilled resident	2.0	NaN	yes	good
3	<0	42.0	existing paid	furniture/equipment	7882.0	<100	4<=X<7	2.0	male single	guarantor	4.0	life insurance	45.0	NaN	for free	1.0	skilled	2.0	NaN	yes	good
4	<0	24.0	delayed previously	new car	4870.0	<100	1<=X<4	3.0	male single	NaN	4.0	no known property	53.0	NaN	for free	2.0	skilled	2.0	NaN	yes	bad
6	no checking	24.0	existing paid	furniture/equipment	2835.0	500<=X<1000	>=7	3.0	male single	NaN	4.0	life insurance	53.0	NaN	own	1.0	skilled	1.0	NaN	yes	good
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
978	no checking	24.0	delayed previously	new car	2538.0	<100	>=7	4.0	male single	NaN	4.0	car	47.0	NaN	own	2.0	unskilled resident	2.0	NaN	yes	bad
980	0<=X<200	30.0	critical/other existing credit	furniture/equipment	8386.0	<100	4<=X<7	2.0	male single	NaN	2.0	life insurance	49.0	NaN	own	1.0	skilled	1.0	NaN	yes	bad
987	no checking	13.0	existing paid	radio/tv	1409.0	100<=X<500	unemployed	2.0	female div/dep/mar	NaN	4.0	real estate	64.0	NaN	own	1.0	skilled	1.0	NaN	yes	good
989	0<=X<200	24.0	critical/other existing credit	radio/tv	1743.0	<100	>=7	4.0	male single	NaN	2.0	life insurance	48.0	NaN	own	2.0	unskilled resident	1.0	NaN	yes	good
994	no checking	12.0	existing paid	new car	2390.0	no known savings	>=7	4.0	male single	NaN	3.0	car	50.0	NaN	own	1.0	skilled	1.0	yes	yes	good

274 rows × 21 columns

或是使用.query()方法：

df.query('age > 40')

	checking_status	duration	credit_history	purpose	credit_amount	savings_status	employment	installment_commitment	personal_status	other_parties	residence_since	property_magnitude	age	other_payment_plans	housing	existing_credits	job	num_dependents	own_telephone	foreign_worker	class
0	<0	6.0	critical/other existing credit	radio/tv	1169.0	no known savings	>=7	4.0	male single	NaN	4.0	real estate	67.0	NaN	own	2.0	skilled	1.0	yes	yes	good
2	no checking	12.0	critical/other existing credit	education	2096.0	<100	4<=X<7	2.0	male single	NaN	3.0	real estate	49.0	NaN	own	1.0	unskilled resident	2.0	NaN	yes	good
3	<0	42.0	existing paid	furniture/equipment	7882.0	<100	4<=X<7	2.0	male single	guarantor	4.0	life insurance	45.0	NaN	for free	1.0	skilled	2.0	NaN	yes	good
4	<0	24.0	delayed previously	new car	4870.0	<100	1<=X<4	3.0	male single	NaN	4.0	no known property	53.0	NaN	for free	2.0	skilled	2.0	NaN	yes	bad
6	no checking	24.0	existing paid	furniture/equipment	2835.0	500<=X<1000	>=7	3.0	male single	NaN	4.0	life insurance	53.0	NaN	own	1.0	skilled	1.0	NaN	yes	good
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
978	no checking	24.0	delayed previously	new car	2538.0	<100	>=7	4.0	male single	NaN	4.0	car	47.0	NaN	own	2.0	unskilled resident	2.0	NaN	yes	bad
980	0<=X<200	30.0	critical/other existing credit	furniture/equipment	8386.0	<100	4<=X<7	2.0	male single	NaN	2.0	life insurance	49.0	NaN	own	1.0	skilled	1.0	NaN	yes	bad
987	no checking	13.0	existing paid	radio/tv	1409.0	100<=X<500	unemployed	2.0	female div/dep/mar	NaN	4.0	real estate	64.0	NaN	own	1.0	skilled	1.0	NaN	yes	good
989	0<=X<200	24.0	critical/other existing credit	radio/tv	1743.0	<100	>=7	4.0	male single	NaN	2.0	life insurance	48.0	NaN	own	2.0	unskilled resident	1.0	NaN	yes	good
994	no checking	12.0	existing paid	new car	2390.0	no known savings	>=7	4.0	male single	NaN	3.0	car	50.0	NaN	own	1.0	skilled	1.0	yes	yes	good

274 rows × 21 columns

邏輯條件可以放多個：

df[(df['age'] > 40) & (df['duration'] > 50)]

	checking_status	duration	credit_history	purpose	credit_amount	savings_status	employment	installment_commitment	personal_status	other_parties	residence_since	property_magnitude	age	other_payment_plans	housing	existing_credits	job	num_dependents	own_telephone	foreign_worker	class
29	<0	60.0	delayed previously	business	6836.0	<100	>=7	3.0	male single	NaN	4.0	no known property	63.0	NaN	own	2.0	skilled	1.0	yes	yes	bad
95	0<=X<200	54.0	no credits/all paid	business	15945.0	<100	<1	3.0	male single	NaN	4.0	no known property	58.0	NaN	rent	1.0	skilled	1.0	yes	yes	bad
373	no checking	60.0	critical/other existing credit	new car	13756.0	no known savings	>=7	2.0	male single	NaN	4.0	no known property	63.0	bank	for free	1.0	high qualif/self emp/mgmt	1.0	yes	yes	good
374	0<=X<200	60.0	all paid	other	14782.0	100<=X<500	>=7	3.0	female div/dep/mar	NaN	4.0	no known property	60.0	bank	for free	2.0	high qualif/self emp/mgmt	1.0	yes	yes	bad
672	no checking	60.0	existing paid	new car	10366.0	<100	>=7	2.0	male single	NaN	4.0	life insurance	42.0	NaN	own	1.0	high qualif/self emp/mgmt	1.0	yes	yes	good
938	0<=X<200	60.0	existing paid	education	6288.0	<100	1<=X<4	4.0	male single	NaN	4.0	no known property	42.0	NaN	for free	1.0	skilled	1.0	NaN	yes	bad

df.query('age > 40 & duration > 50')

	checking_status	duration	credit_history	purpose	credit_amount	savings_status	employment	installment_commitment	personal_status	other_parties	residence_since	property_magnitude	age	other_payment_plans	housing	existing_credits	job	num_dependents	own_telephone	foreign_worker	class
29	<0	60.0	delayed previously	business	6836.0	<100	>=7	3.0	male single	NaN	4.0	no known property	63.0	NaN	own	2.0	skilled	1.0	yes	yes	bad
95	0<=X<200	54.0	no credits/all paid	business	15945.0	<100	<1	3.0	male single	NaN	4.0	no known property	58.0	NaN	rent	1.0	skilled	1.0	yes	yes	bad
373	no checking	60.0	critical/other existing credit	new car	13756.0	no known savings	>=7	2.0	male single	NaN	4.0	no known property	63.0	bank	for free	1.0	high qualif/self emp/mgmt	1.0	yes	yes	good
374	0<=X<200	60.0	all paid	other	14782.0	100<=X<500	>=7	3.0	female div/dep/mar	NaN	4.0	no known property	60.0	bank	for free	2.0	high qualif/self emp/mgmt	1.0	yes	yes	bad
672	no checking	60.0	existing paid	new car	10366.0	<100	>=7	2.0	male single	NaN	4.0	life insurance	42.0	NaN	own	1.0	high qualif/self emp/mgmt	1.0	yes	yes	good
938	0<=X<200	60.0	existing paid	education	6288.0	<100	1<=X<4	4.0	male single	NaN	4.0	no known property	42.0	NaN	for free	1.0	skilled	1.0	NaN	yes	bad

為了避免條件太多，造成閱讀困難，可以做一些前處理：

c1 = df['age'] > 40
c2 = df['duration'] > 50
c3 = df['credit_amount'] > 10000

cond = c1 & c2 & c3

df[cond]

	checking_status	duration	credit_history	purpose	credit_amount	savings_status	employment	installment_commitment	personal_status	other_parties	residence_since	property_magnitude	age	other_payment_plans	housing	existing_credits	job	num_dependents	own_telephone	foreign_worker	class
95	0<=X<200	54.0	no credits/all paid	business	15945.0	<100	<1	3.0	male single	NaN	4.0	no known property	58.0	NaN	rent	1.0	skilled	1.0	yes	yes	bad
373	no checking	60.0	critical/other existing credit	new car	13756.0	no known savings	>=7	2.0	male single	NaN	4.0	no known property	63.0	bank	for free	1.0	high qualif/self emp/mgmt	1.0	yes	yes	good
374	0<=X<200	60.0	all paid	other	14782.0	100<=X<500	>=7	3.0	female div/dep/mar	NaN	4.0	no known property	60.0	bank	for free	2.0	high qualif/self emp/mgmt	1.0	yes	yes	bad
672	no checking	60.0	existing paid	new car	10366.0	<100	>=7	2.0	male single	NaN	4.0	life insurance	42.0	NaN	own	1.0	high qualif/self emp/mgmt	1.0	yes	yes	good

或是搭配f-string以及.query()：

c1 = 'age > 40'
c2 = 'duration > 50'
c3 = 'credit_amount > 10000'

cond = f'{c1} & {c2} & {c3}'

df.query(cond)

	checking_status	duration	credit_history	purpose	credit_amount	savings_status	employment	installment_commitment	personal_status	other_parties	residence_since	property_magnitude	age	other_payment_plans	housing	existing_credits	job	num_dependents	own_telephone	foreign_worker	class
95	0<=X<200	54.0	no credits/all paid	business	15945.0	<100	<1	3.0	male single	NaN	4.0	no known property	58.0	NaN	rent	1.0	skilled	1.0	yes	yes	bad
373	no checking	60.0	critical/other existing credit	new car	13756.0	no known savings	>=7	2.0	male single	NaN	4.0	no known property	63.0	bank	for free	1.0	high qualif/self emp/mgmt	1.0	yes	yes	good
374	0<=X<200	60.0	all paid	other	14782.0	100<=X<500	>=7	3.0	female div/dep/mar	NaN	4.0	no known property	60.0	bank	for free	2.0	high qualif/self emp/mgmt	1.0	yes	yes	bad
672	no checking	60.0	existing paid	new car	10366.0	<100	>=7	2.0	male single	NaN	4.0	life insurance	42.0	NaN	own	1.0	high qualif/self emp/mgmt	1.0	yes	yes	good

其他根據欄位值篩選，常用的方法有：

df[df['age'].between(30, 35)]

	checking_status	duration	credit_history	purpose	credit_amount	savings_status	employment	installment_commitment	personal_status	other_parties	residence_since	property_magnitude	age	other_payment_plans	housing	existing_credits	job	num_dependents	own_telephone	foreign_worker	class
5	no checking	36.0	existing paid	education	9055.0	no known savings	1<=X<4	2.0	male single	NaN	4.0	no known property	35.0	NaN	for free	1.0	unskilled resident	2.0	yes	yes	good
7	0<=X<200	36.0	existing paid	used car	6948.0	<100	1<=X<4	2.0	male single	NaN	2.0	car	35.0	NaN	rent	1.0	high qualif/self emp/mgmt	1.0	yes	yes	good
15	<0	24.0	existing paid	radio/tv	1282.0	100<=X<500	1<=X<4	4.0	female div/dep/mar	NaN	2.0	car	32.0	NaN	own	1.0	unskilled resident	1.0	NaN	yes	bad
19	no checking	24.0	existing paid	radio/tv	3430.0	500<=X<1000	>=7	3.0	male single	NaN	2.0	car	31.0	NaN	own	1.0	skilled	2.0	yes	yes	good
28	0<=X<200	7.0	existing paid	radio/tv	2415.0	<100	1<=X<4	3.0	male single	guarantor	2.0	real estate	34.0	NaN	own	1.0	skilled	1.0	NaN	yes	good
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
984	no checking	24.0	critical/other existing credit	furniture/equipment	2028.0	<100	4<=X<7	2.0	male single	NaN	2.0	life insurance	30.0	NaN	own	2.0	unskilled resident	1.0	NaN	yes	good
986	>=200	42.0	no credits/all paid	business	6289.0	<100	<1	2.0	male div/sep	NaN	1.0	life insurance	33.0	NaN	own	2.0	skilled	1.0	NaN	yes	good
991	no checking	15.0	all paid	radio/tv	1569.0	100<=X<500	>=7	4.0	male single	NaN	4.0	car	34.0	bank	own	1.0	unskilled resident	2.0	NaN	yes	good
993	<0	36.0	existing paid	furniture/equipment	3959.0	<100	unemployed	4.0	male single	NaN	3.0	life insurance	30.0	NaN	own	1.0	high qualif/self emp/mgmt	1.0	yes	yes	good
995	no checking	12.0	existing paid	furniture/equipment	1736.0	<100	4<=X<7	3.0	female div/dep/mar	NaN	4.0	real estate	31.0	NaN	own	1.0	unskilled resident	1.0	NaN	yes	good

217 rows × 21 columns

或是根據欄位值清單篩選：

df[df['purpose'].isin(['new car', 'used card'])]

	checking_status	duration	credit_history	purpose	credit_amount	savings_status	employment	installment_commitment	personal_status	other_parties	residence_since	property_magnitude	age	other_payment_plans	housing	existing_credits	job	num_dependents	own_telephone	foreign_worker	class
4	<0	24.0	delayed previously	new car	4870.0	<100	1<=X<4	3.0	male single	NaN	4.0	no known property	53.0	NaN	for free	2.0	skilled	2.0	NaN	yes	bad
9	0<=X<200	30.0	critical/other existing credit	new car	5234.0	<100	unemployed	4.0	male mar/wid	NaN	2.0	car	28.0	NaN	own	2.0	high qualif/self emp/mgmt	1.0	NaN	yes	bad
10	0<=X<200	12.0	existing paid	new car	1295.0	<100	<1	3.0	female div/dep/mar	NaN	1.0	car	25.0	NaN	rent	1.0	skilled	1.0	NaN	yes	bad
13	<0	24.0	critical/other existing credit	new car	1199.0	<100	>=7	4.0	male single	NaN	4.0	car	60.0	NaN	own	2.0	unskilled resident	1.0	NaN	yes	bad
14	<0	15.0	existing paid	new car	1403.0	<100	1<=X<4	2.0	female div/dep/mar	NaN	4.0	car	28.0	NaN	rent	1.0	skilled	1.0	NaN	yes	good
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
972	<0	24.0	all paid	new car	1193.0	<100	unemployed	1.0	female div/dep/mar	co applicant	4.0	no known property	29.0	NaN	rent	2.0	unemp/unskilled non res	1.0	NaN	yes	bad
978	no checking	24.0	delayed previously	new car	2538.0	<100	>=7	4.0	male single	NaN	4.0	car	47.0	NaN	own	2.0	unskilled resident	2.0	NaN	yes	bad
979	0<=X<200	15.0	all paid	new car	1264.0	100<=X<500	1<=X<4	2.0	male mar/wid	NaN	2.0	life insurance	25.0	NaN	rent	1.0	skilled	1.0	NaN	yes	bad
982	>=200	21.0	existing paid	new car	2923.0	100<=X<500	1<=X<4	1.0	female div/dep/mar	NaN	1.0	car	28.0	bank	own	1.0	high qualif/self emp/mgmt	1.0	yes	yes	good
994	no checking	12.0	existing paid	new car	2390.0	no known savings	>=7	4.0	male single	NaN	3.0	car	50.0	NaN	own	1.0	skilled	1.0	yes	yes	good

234 rows × 21 columns

排序資料#

要排序資料可以使用前面用過的.sort_values()語法，參數by指定欄位，ascending指定方向：

df.sort_values(by='credit_amount', ascending=False)

	checking_status	duration	credit_history	purpose	credit_amount	savings_status	employment	installment_commitment	personal_status	other_parties	residence_since	property_magnitude	age	other_payment_plans	housing	existing_credits	job	num_dependents	own_telephone	foreign_worker	class
915	0<=X<200	48.0	no credits/all paid	other	18424.0	<100	1<=X<4	1.0	female div/dep/mar	NaN	2.0	life insurance	32.0	bank	own	1.0	high qualif/self emp/mgmt	1.0	yes	no	bad
95	0<=X<200	54.0	no credits/all paid	business	15945.0	<100	<1	3.0	male single	NaN	4.0	no known property	58.0	NaN	rent	1.0	skilled	1.0	yes	yes	bad
818	<0	36.0	existing paid	other	15857.0	<100	unemployed	2.0	male div/sep	co applicant	3.0	car	43.0	NaN	own	1.0	high qualif/self emp/mgmt	1.0	NaN	yes	good
887	0<=X<200	48.0	existing paid	business	15672.0	<100	1<=X<4	2.0	male single	NaN	2.0	car	23.0	NaN	own	1.0	skilled	1.0	yes	yes	bad
637	no checking	60.0	delayed previously	radio/tv	15653.0	<100	4<=X<7	2.0	male single	NaN	4.0	car	21.0	NaN	own	2.0	skilled	1.0	yes	yes	good
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
458	<0	6.0	existing paid	domestic appliance	343.0	<100	<1	4.0	female div/dep/mar	NaN	1.0	real estate	27.0	NaN	own	1.0	skilled	1.0	NaN	yes	good
157	<0	12.0	all paid	retraining	339.0	<100	>=7	4.0	male mar/wid	NaN	1.0	car	45.0	bank	own	1.0	unskilled resident	1.0	NaN	yes	good
177	<0	6.0	critical/other existing credit	radio/tv	338.0	500<=X<1000	>=7	4.0	male single	NaN	4.0	car	52.0	NaN	own	2.0	skilled	1.0	NaN	yes	good
309	0<=X<200	9.0	existing paid	new car	276.0	<100	1<=X<4	4.0	male mar/wid	NaN	4.0	real estate	22.0	NaN	rent	1.0	unskilled resident	1.0	NaN	yes	good
725	no checking	6.0	critical/other existing credit	new car	250.0	>=1000	1<=X<4	2.0	female div/dep/mar	NaN	2.0	real estate	41.0	bank	own	2.0	unskilled resident	1.0	NaN	yes	good

1000 rows × 21 columns

另外，若要根據index排序，可以使用.sort_index()。例如，想將資料根據年齡統計筆數後，查看年紀最小的幾個樣本數：

df['age'].value_counts().sort_index().head(10)

age
0     2
0    14
0    14
0    27
0    48
0    44
0    41
0    50
0    51
0    43
Name: count, dtype: int64

Intro to Pandas

Contents

Intro to Pandas#

什麼是 DataFrame？#

認識你的資料#

快速查看DataFrame內容#

檢查遺漏值#

確認欄位值#

資料的統計資訊#

查看你的資料#

抓特定欄位的資料#

抓特定列的資料#

依條件判斷抓取資料#

排序資料#