Intro to Pandas#

pandas 是使用Python進行資料分析,或是快速進行一些資料處理時必用的套件。

在小型專案中,可能也適合使用pandas來建立資料流程的管線。

但在較大型的專案中,資料量大或是資料處理流程的複雜度高的情況下,搭配使用其他工具可能就比較適合了(例如spark或是dbt),端看使用的情境以及專案團隊或組織的資源。

總而言之,pandas是使用Python進行資料分析必學的套件,就讓我們開始吧!

以下假設你已經建立好jupyter notebook的環境(也可以使用google colab),並且安裝好pandas套件。

什麼是 DataFrame?#

首先先來瞭解什麼是DataFrame。

DataFrame是pandas讀取資料表後的形式,跟excel資料表非常相似。

首先第一步是把pandas套件import進來。

import pandas as pd

定義一個DataFrame

# pd.Timestamp 是pandas建立時間資料的方法
# pd.Categorical 是pandas建立類別資料的方法。類別資料與字串不同,具有其他特性,之後會說明。

df = pd.DataFrame(
    {
        "col_1": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        "col_2": [i/10 for i in range(1,11)],
        "col_3": ['a', 'i', 'u', 'e', 'o', 'a', 'e', 'i', 'o', 'u'],
        "col_4": pd.Timestamp("20240101"),
        "col_5": [True, False] * 5,
        "col_6": pd.Categorical(["low"] * 5 + ["high"] * 5)
    }
)

直接執行它,就會印出DataFrame的內容。

df # or display(df)
col_1 col_2 col_3 col_4 col_5 col_6
0 1 0.1 a 2024-01-01 True low
1 2 0.2 i 2024-01-01 False low
2 3 0.3 u 2024-01-01 True low
3 4 0.4 e 2024-01-01 False low
4 5 0.5 o 2024-01-01 True low
5 6 0.6 a 2024-01-01 False high
6 7 0.7 e 2024-01-01 True high
7 8 0.8 i 2024-01-01 False high
8 9 0.9 o 2024-01-01 True high
9 10 1.0 u 2024-01-01 False high

具有以下組成部分:

  • 索引(index): 最左側數字0~1的部分

    df.index
RangeIndex(start=0, stop=10, step=1)
  • 欄位(column):上方col_1 ~ col_6的部分

df.columns
Index(['col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6'], dtype='object')
  • 資料(data):資料內容本身。

df.values
array([[1, 0.1, 'a', Timestamp('2024-01-01 00:00:00'), True, 'low'],
       [2, 0.2, 'i', Timestamp('2024-01-01 00:00:00'), False, 'low'],
       [3, 0.3, 'u', Timestamp('2024-01-01 00:00:00'), True, 'low'],
       [4, 0.4, 'e', Timestamp('2024-01-01 00:00:00'), False, 'low'],
       [5, 0.5, 'o', Timestamp('2024-01-01 00:00:00'), True, 'low'],
       [6, 0.6, 'a', Timestamp('2024-01-01 00:00:00'), False, 'high'],
       [7, 0.7, 'e', Timestamp('2024-01-01 00:00:00'), True, 'high'],
       [8, 0.8, 'i', Timestamp('2024-01-01 00:00:00'), False, 'high'],
       [9, 0.9, 'o', Timestamp('2024-01-01 00:00:00'), True, 'high'],
       [10, 1.0, 'u', Timestamp('2024-01-01 00:00:00'), False, 'high']],
      dtype=object)

可以注意到印出的結果有array的字樣,其實pandas是基於另外一個套件numpy而延伸出來的工具。

numpy也是一個python套件,提供大量的陣列運算功能,而且速度非常快。

numpy的基本資料結構被稱作array

事實上numpy array是一個可以容納多維的陣列,稱作ndarray (n-dimensional array)。

type(df.values)
numpy.ndarray

建立一個numpy array:

import numpy as np

arr = np.array([1, 2, 3, 4, 5])

Pandas DataFrame的每一欄具有特定的資料型態。

df.dtypes
col_1            int64
col_2          float64
col_3           object
col_4    datetime64[s]
col_5             bool
col_6         category
dtype: object

pandas資料型態大致如下:

  • Int64: integer

  • float64: float

  • object: strings (注意當資料型態混用的時候也會是object)

  • category: pandas特殊的資料型態,基本上也是string,但具有其他方便的特性。

  • bool: 布林值

  • datetime64[ns]: 時間

認識你的資料#

以下的資料內容取自在kaggle找到的資料集:credit_risk_customers (kaggle.com)。可以自行下載成csv檔。

這是一個跟銀行客戶信用風險相關的資料集,資料內容說明請見:OpenML

首先透過以下語法讀入該csv檔:

import pandas as pd

df = pd.read_csv('./data/credit_customers.csv')

在進行資料分析的一開始,通常會先大致瞭解一下資料的長相。

快速查看DataFrame內容#

以下語法可以快速查看DataFrame的相關資訊:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   checking_status         1000 non-null   object 
 1   duration                1000 non-null   float64
 2   credit_history          1000 non-null   object 
 3   purpose                 1000 non-null   object 
 4   credit_amount           1000 non-null   float64
 5   savings_status          1000 non-null   object 
 6   employment              1000 non-null   object 
 7   installment_commitment  1000 non-null   float64
 8   personal_status         1000 non-null   object 
 9   other_parties           93 non-null     object 
 10  residence_since         1000 non-null   float64
 11  property_magnitude      1000 non-null   object 
 12  age                     1000 non-null   float64
 13  other_payment_plans     186 non-null    object 
 14  housing                 1000 non-null   object 
 15  existing_credits        1000 non-null   float64
 16  job                     1000 non-null   object 
 17  num_dependents          1000 non-null   float64
 18  own_telephone           404 non-null    object 
 19  foreign_worker          1000 non-null   object 
 20  class                   1000 non-null   object 
dtypes: float64(7), object(14)
memory usage: 164.2+ KB

查看DataFrame前幾筆資料,可以使用df.head(),預設是顯示5筆,但也可指定筆數。

df.head(3)
checking_status duration credit_history purpose credit_amount savings_status employment installment_commitment personal_status other_parties ... property_magnitude age other_payment_plans housing existing_credits job num_dependents own_telephone foreign_worker class
0 <0 6.0 critical/other existing credit radio/tv 1169.0 no known savings >=7 4.0 male single NaN ... real estate 67.0 NaN own 2.0 skilled 1.0 yes yes good
1 0<=X<200 48.0 existing paid radio/tv 5951.0 <100 1<=X<4 2.0 female div/dep/mar NaN ... real estate 22.0 NaN own 1.0 skilled 1.0 NaN yes bad
2 no checking 12.0 critical/other existing credit education 2096.0 <100 4<=X<7 2.0 male single NaN ... real estate 49.0 NaN own 1.0 unskilled resident 2.0 NaN yes good

3 rows × 21 columns

反過來就是:

df.tail()
checking_status duration credit_history purpose credit_amount savings_status employment installment_commitment personal_status other_parties ... property_magnitude age other_payment_plans housing existing_credits job num_dependents own_telephone foreign_worker class
995 no checking 12.0 existing paid furniture/equipment 1736.0 <100 4<=X<7 3.0 female div/dep/mar NaN ... real estate 31.0 NaN own 1.0 unskilled resident 1.0 NaN yes good
996 <0 30.0 existing paid used car 3857.0 <100 1<=X<4 4.0 male div/sep NaN ... life insurance 40.0 NaN own 1.0 high qualif/self emp/mgmt 1.0 yes yes good
997 no checking 12.0 existing paid radio/tv 804.0 <100 >=7 4.0 male single NaN ... car 38.0 NaN own 1.0 skilled 1.0 NaN yes good
998 <0 45.0 existing paid radio/tv 1845.0 <100 1<=X<4 4.0 male single NaN ... no known property 23.0 NaN for free 1.0 skilled 1.0 yes yes bad
999 0<=X<200 45.0 critical/other existing credit used car 4576.0 100<=X<500 unemployed 3.0 male single NaN ... car 27.0 NaN own 1.0 skilled 1.0 NaN yes good

5 rows × 21 columns

若要隨機查看的話:

df.sample(3)
checking_status duration credit_history purpose credit_amount savings_status employment installment_commitment personal_status other_parties ... property_magnitude age other_payment_plans housing existing_credits job num_dependents own_telephone foreign_worker class
754 no checking 12.0 delayed previously repairs 1555.0 >=1000 >=7 4.0 male single NaN ... no known property 55.0 NaN for free 2.0 skilled 2.0 NaN yes bad
815 0<=X<200 36.0 delayed previously new car 7432.0 <100 1<=X<4 2.0 female div/dep/mar NaN ... life insurance 54.0 NaN rent 1.0 skilled 1.0 NaN yes good
366 no checking 18.0 critical/other existing credit used car 3850.0 <100 4<=X<7 3.0 male single NaN ... car 27.0 NaN own 2.0 skilled 1.0 NaN yes good

3 rows × 21 columns

注意到,由於資料的欄位數較多,實際顯示出來的欄位數較少,中間會以”…”的方式略過。

若想要查看完整的欄位,可以使用以下語法:

pd.set_option("display.max_columns", None)

意思是告訴pandas要顯示出全部的欄位數:

df.head()
checking_status duration credit_history purpose credit_amount savings_status employment installment_commitment personal_status other_parties residence_since property_magnitude age other_payment_plans housing existing_credits job num_dependents own_telephone foreign_worker class
0 <0 6.0 critical/other existing credit radio/tv 1169.0 no known savings >=7 4.0 male single NaN 4.0 real estate 67.0 NaN own 2.0 skilled 1.0 yes yes good
1 0<=X<200 48.0 existing paid radio/tv 5951.0 <100 1<=X<4 2.0 female div/dep/mar NaN 2.0 real estate 22.0 NaN own 1.0 skilled 1.0 NaN yes bad
2 no checking 12.0 critical/other existing credit education 2096.0 <100 4<=X<7 2.0 male single NaN 3.0 real estate 49.0 NaN own 1.0 unskilled resident 2.0 NaN yes good
3 <0 42.0 existing paid furniture/equipment 7882.0 <100 4<=X<7 2.0 male single guarantor 4.0 life insurance 45.0 NaN for free 1.0 skilled 2.0 NaN yes good
4 <0 24.0 delayed previously new car 4870.0 <100 1<=X<4 3.0 male single NaN 4.0 no known property 53.0 NaN for free 2.0 skilled 2.0 NaN yes bad

查看資料的維度:

df.shape
(1000, 21)

回傳的是一個tuple,第1個元素是index的長度,就是列(row)的數量,第2個元素是column的長度,就是欄(column)的數量。

若要快速得知資料的統計資訊:

df.describe()
duration credit_amount installment_commitment residence_since age existing_credits num_dependents
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000
mean 20.903000 3271.258000 2.973000 2.845000 35.546000 1.407000 1.155000
std 12.058814 2822.736876 1.118715 1.103718 11.375469 0.577654 0.362086
min 4.000000 250.000000 1.000000 1.000000 19.000000 1.000000 1.000000
25% 12.000000 1365.500000 2.000000 2.000000 27.000000 1.000000 1.000000
50% 18.000000 2319.500000 3.000000 3.000000 33.000000 1.000000 1.000000
75% 24.000000 3972.250000 4.000000 4.000000 42.000000 2.000000 1.000000
max 72.000000 18424.000000 4.000000 4.000000 75.000000 4.000000 2.000000

注意到只有數值型的資料才會列出。若要納入非數值型的資料的話:

df.describe(include='all')
checking_status duration credit_history purpose credit_amount savings_status employment installment_commitment personal_status other_parties residence_since property_magnitude age other_payment_plans housing existing_credits job num_dependents own_telephone foreign_worker class
count 1000 1000.000000 1000 1000 1000.000000 1000 1000 1000.000000 1000 93 1000.000000 1000 1000.000000 186 1000 1000.000000 1000 1000.000000 404 1000 1000
unique 4 NaN 5 10 NaN 5 5 NaN 4 2 NaN 4 NaN 2 3 NaN 4 NaN 1 2 2
top no checking NaN existing paid radio/tv NaN <100 1<=X<4 NaN male single guarantor NaN car NaN bank own NaN skilled NaN yes yes good
freq 394 NaN 530 280 NaN 603 339 NaN 548 52 NaN 332 NaN 139 713 NaN 630 NaN 404 963 700
mean NaN 20.903000 NaN NaN 3271.258000 NaN NaN 2.973000 NaN NaN 2.845000 NaN 35.546000 NaN NaN 1.407000 NaN 1.155000 NaN NaN NaN
std NaN 12.058814 NaN NaN 2822.736876 NaN NaN 1.118715 NaN NaN 1.103718 NaN 11.375469 NaN NaN 0.577654 NaN 0.362086 NaN NaN NaN
min NaN 4.000000 NaN NaN 250.000000 NaN NaN 1.000000 NaN NaN 1.000000 NaN 19.000000 NaN NaN 1.000000 NaN 1.000000 NaN NaN NaN
25% NaN 12.000000 NaN NaN 1365.500000 NaN NaN 2.000000 NaN NaN 2.000000 NaN 27.000000 NaN NaN 1.000000 NaN 1.000000 NaN NaN NaN
50% NaN 18.000000 NaN NaN 2319.500000 NaN NaN 3.000000 NaN NaN 3.000000 NaN 33.000000 NaN NaN 1.000000 NaN 1.000000 NaN NaN NaN
75% NaN 24.000000 NaN NaN 3972.250000 NaN NaN 4.000000 NaN NaN 4.000000 NaN 42.000000 NaN NaN 2.000000 NaN 1.000000 NaN NaN NaN
max NaN 72.000000 NaN NaN 18424.000000 NaN NaN 4.000000 NaN NaN 4.000000 NaN 75.000000 NaN NaN 4.000000 NaN 2.000000 NaN NaN NaN

檢查遺漏值#

大致瞭解資料長相之後,可以開始進一步確認資料的細節。資料是否有遺漏值?欄位值有哪些取值?欄位的最大最小值等統計資訊?等等。

首先先來看跟遺漏值相關的語法:

df.isna()
checking_status duration credit_history purpose credit_amount savings_status employment installment_commitment personal_status other_parties residence_since property_magnitude age other_payment_plans housing existing_credits job num_dependents own_telephone foreign_worker class
0 False False False False False False False False False True False False False True False False False False False False False
1 False False False False False False False False False True False False False True False False False False True False False
2 False False False False False False False False False True False False False True False False False False True False False
3 False False False False False False False False False False False False False True False False False False True False False
4 False False False False False False False False False True False False False True False False False False True False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 False False False False False False False False False True False False False True False False False False True False False
996 False False False False False False False False False True False False False True False False False False False False False
997 False False False False False False False False False True False False False True False False False False True False False
998 False False False False False False False False False True False False False True False False False False False False False
999 False False False False False False False False False True False False False True False False False False True False False

1000 rows × 21 columns

以上這個語法會回傳每個欄位每一列是否為遺漏值的DataFrame,False代表該欄位值沒有遺漏值。True代表有遺漏值。

但通常我們關心的是哪些欄位有遺漏值?有多少筆資料有遺漏值?

所以會想要在把上面的資料根據欄位「加總」,因為True代表1,False代表0,加總後就可以知道遺漏值的數量。

可以這樣寫:

df.isna().sum()
checking_status             0
duration                    0
credit_history              0
purpose                     0
credit_amount               0
savings_status              0
employment                  0
installment_commitment      0
personal_status             0
other_parties             907
residence_since             0
property_magnitude          0
age                         0
other_payment_plans       814
housing                     0
existing_credits            0
job                         0
num_dependents              0
own_telephone             596
foreign_worker              0
class                       0
dtype: int64

上面的語法結構值得說明一下,由於df.isnull()回傳的是一個DataFrame,所以我們可以直接接著call其他DataFrame的方法。

.sum()是一個把DataFrame資料加總的方法,預設是by欄位加總。可以藉由調整參數``axis```來改變加總的方向。

不指定axis參數的話,預設是axis=0,代表沿著axis 0, 把值一個一個加起來,所以是by欄位加總。

axis=1代表by列加總,就可以知道哪幾筆資料有遺漏值。

df.isna().sum(axis=1)
0      2
1      3
2      3
3      2
4      3
      ..
995    3
996    2
997    3
998    2
999    3
Length: 1000, dtype: int64

如果不關心筆數,也可以使用.any(),代表只要有一筆資料為True就會輸出True:

df.isna().any()
checking_status           False
duration                  False
credit_history            False
purpose                   False
credit_amount             False
savings_status            False
employment                False
installment_commitment    False
personal_status           False
other_parties              True
residence_since           False
property_magnitude        False
age                       False
other_payment_plans        True
housing                   False
existing_credits          False
job                       False
num_dependents            False
own_telephone              True
foreign_worker            False
class                     False
dtype: bool

如何確認資料是否有任何遺漏值?

df.isna().any().any()
True

反向的語法是:

df.notna().sum()
checking_status           1000
duration                  1000
credit_history            1000
purpose                   1000
credit_amount             1000
savings_status            1000
employment                1000
installment_commitment    1000
personal_status           1000
other_parties               93
residence_since           1000
property_magnitude        1000
age                       1000
other_payment_plans        186
housing                   1000
existing_credits          1000
job                       1000
num_dependents            1000
own_telephone              404
foreign_worker            1000
class                     1000
dtype: int64

確認欄位值#

如果想知道特定欄位有哪些取值,通常是要初步看一下資料的內容,以及看欄位有沒有什麼異常值。

例如想知道客戶來申辦貸款的目的有哪些?可以對purpose這個欄位使用以下操作:

df['purpose'].unique()
array(['radio/tv', 'education', 'furniture/equipment', 'new car',
       'used car', 'business', 'domestic appliance', 'repairs', 'other',
       'retraining'], dtype=object)

就可以把資料中purpose實際上出現哪些不同值給列出來。

其中,df['purpose']是篩選出欄位資料的語法。也可以使用:

df.purpose
0                 radio/tv
1                 radio/tv
2                education
3      furniture/equipment
4                  new car
              ...         
995    furniture/equipment
996               used car
997               radio/tv
998               radio/tv
999               used car
Name: purpose, Length: 1000, dtype: object

注意到.purpose的語法事實上是把欄位當作是DataFrame的「屬性(attribute)」來取用,所以若欄位名稱與DataFrame原本就有的屬性重複的話會無法使用。

補充說明一下,這裡回傳的東西,其實是pandas Series。Series是pandas的另一個資料結構,DataFrame中每一個欄位都儲存為Series。

type(df.purpose)
pandas.core.series.Series

可以把Series理解為單一維度的numpy array,與numpy array不同的地方在於Series內含具有標籤的索引(index)。

s = pd.Series([1, 2, 3, 4, 5], index=["a", "b", "c", "d", "e"])

print(s)
print(s.index)
a    1
b    2
c    3
d    4
e    5
dtype: int64
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

除了看哪些欄位取值,還可以直接算出該欄位有幾種不同取值:

df['purpose'].nunique()
10

或是每個取值有多少筆資料:

df['purpose'].value_counts()
purpose
radio/tv               280
new car                234
furniture/equipment    181
used car               103
business                97
education               50
repairs                 22
domestic appliance      12
other                   12
retraining               9
Name: count, dtype: int64

進一步看佔比:

df['purpose'].value_counts(normalize=True)
purpose
radio/tv               0.280
new car                0.234
furniture/equipment    0.181
used car               0.103
business               0.097
education              0.050
repairs                0.022
domestic appliance     0.012
other                  0.012
retraining             0.009
Name: proportion, dtype: float64

注意到value_counts()會忽略空值:

df['other_parties'].value_counts() # 明顯不到1000筆資料
other_parties
guarantor       52
co applicant    41
Name: count, dtype: int64

可以使用dropna=False參數:

df['other_parties'].value_counts(dropna=False)
other_parties
NaN             907
guarantor        52
co applicant     41
Name: count, dtype: int64

針對數值型欄位可以使用bins參數,放入要切分的組數。 這邊組別的切分是依據欄位值的全距去切指定的等份。

df['age'].value_counts(bins=7)
(27.0, 35.0]                  297
(18.942999999999998, 27.0]    291
(35.0, 43.0]                  194
(43.0, 51.0]                  113
(51.0, 59.0]                   54
(59.0, 67.0]                   41
(67.0, 75.0]                   10
Name: count, dtype: int64

如果要看個別欄位的前幾大或倒數前幾大的值是哪些(參數可指定列數):

df['duration'].nlargest()
677    72.0
29     60.0
134    60.0
255    60.0
332    60.0
Name: duration, dtype: float64
df['duration'].nsmallest()
234    4.0
314    4.0
527    4.0
734    4.0
837    4.0
Name: duration, dtype: float64

或是可以用排序的方法:

df['duration'].sort_values(ascending=False).head()
677    72.0
973    60.0
685    60.0
672    60.0
714    60.0
Name: duration, dtype: float64
df['duration'].sort_values().head()
314    4.0
734    4.0
527    4.0
943    4.0
837    4.0
Name: duration, dtype: float64

資料的統計資訊#

除了使用.describe()快速查看資料統計數據外,還可以使用各種方法來依照需求做計算。

df['duration'].min()
4.0
df['duration'].max()
72.0
df['duration'].mean()
20.903
df['duration'].median()
18.0
df['duration'].var()
145.41500600600602
df['duration'].std()
12.058814452756375
df['duration'].quantile(0.8)
30.0
df['duration'].quantile([0.1,0.3,0.45,0.6,0.8,1])
0.10     9.0
0.30    12.0
0.45    18.0
0.60    24.0
0.80    30.0
1.00    72.0
Name: duration, dtype: float64

還可以很方便算多個欄位的共變數、相關係數:

df[['age', 'duration', 'credit_amount']].cov()
age duration credit_amount
age 129.401285 -4.956995 1.050523e+03
duration -4.956995 145.415006 2.127375e+04
credit_amount 1050.522655 21273.749776 7.967843e+06
df[['age', 'duration', 'credit_amount']].corr()
age duration credit_amount
age 1.000000 -0.036136 0.032716
duration -0.036136 1.000000 0.624984
credit_amount 0.032716 0.624984 1.000000

查看你的資料#

有了對資料的大致認識後,可能會想對資料的部分欄位或部份樣本做一些後續處理,例如填補遺漏值、對欄位做加工等等。

但在此之前需要先知道怎麼擷取你想要的特定資料。

抓特定欄位的資料#

首先第一種方法已經介紹過了,就是直接在DataFrame後面加[想要抓的欄位]

df['class']
0      good
1       bad
2      good
3      good
4       bad
       ... 
995    good
996    good
997    good
998     bad
999    good
Name: class, Length: 1000, dtype: object

當要一次抓取多個欄位時,必須放入一個欄位名稱的list:

df[['age', 'duration', 'credit_amount']]
age duration credit_amount
0 67.0 6.0 1169.0
1 22.0 48.0 5951.0
2 49.0 12.0 2096.0
3 45.0 42.0 7882.0
4 53.0 24.0 4870.0
... ... ... ...
995 31.0 12.0 1736.0
996 40.0 30.0 3857.0
997 38.0 12.0 804.0
998 23.0 45.0 1845.0
999 27.0 45.0 4576.0

1000 rows × 3 columns

另一種方法是使用.loc[],使用方式如下:

df.loc[:, 'class']
0      good
1       bad
2      good
3      good
4       bad
       ... 
995    good
996    good
997    good
998     bad
999    good
Name: class, Length: 1000, dtype: object

多個欄位的話一樣是要放list:

df.loc[:, ['age', 'duration', 'credit_amount']]
age duration credit_amount
0 67.0 6.0 1169.0
1 22.0 48.0 5951.0
2 49.0 12.0 2096.0
3 45.0 42.0 7882.0
4 53.0 24.0 4870.0
... ... ... ...
995 31.0 12.0 1736.0
996 40.0 30.0 3857.0
997 38.0 12.0 804.0
998 23.0 45.0 1845.0
999 27.0 45.0 4576.0

1000 rows × 3 columns

.loc[]類似的,iloc[]也可以做到篩選,只是要指定的是「第幾個」欄位:

df.iloc[:, [12, 1, 4]]
age duration credit_amount
0 67.0 6.0 1169.0
1 22.0 48.0 5951.0
2 49.0 12.0 2096.0
3 45.0 42.0 7882.0
4 53.0 24.0 4870.0
... ... ... ...
995 31.0 12.0 1736.0
996 40.0 30.0 3857.0
997 38.0 12.0 804.0
998 23.0 45.0 1845.0
999 27.0 45.0 4576.0

1000 rows × 3 columns

注意到:的使用,可以一次指定多個欄位,使用方式跟list的index一樣:

df.iloc[:, 0:4]
checking_status duration credit_history purpose
0 <0 6.0 critical/other existing credit radio/tv
1 0<=X<200 48.0 existing paid radio/tv
2 no checking 12.0 critical/other existing credit education
3 <0 42.0 existing paid furniture/equipment
4 <0 24.0 delayed previously new car
... ... ... ... ...
995 no checking 12.0 existing paid furniture/equipment
996 <0 30.0 existing paid used car
997 no checking 12.0 existing paid radio/tv
998 <0 45.0 existing paid radio/tv
999 0<=X<200 45.0 critical/other existing credit used car

1000 rows × 4 columns

.loc[]也可以使用:,但就必須符合原本的欄位順序:

print(df.columns)

# 順序錯了,選不到東西
df.loc[:, "purpose":"duration"]
Index(['checking_status', 'duration', 'credit_history', 'purpose',
       'credit_amount', 'savings_status', 'employment',
       'installment_commitment', 'personal_status', 'other_parties',
       'residence_since', 'property_magnitude', 'age', 'other_payment_plans',
       'housing', 'existing_credits', 'job', 'num_dependents', 'own_telephone',
       'foreign_worker', 'class'],
      dtype='object')
0
1
2
3
4
...
995
996
997
998
999

1000 rows × 0 columns

df.loc[:, "duration":"purpose"]
duration credit_history purpose
0 6.0 critical/other existing credit radio/tv
1 48.0 existing paid radio/tv
2 12.0 critical/other existing credit education
3 42.0 existing paid furniture/equipment
4 24.0 delayed previously new car
... ... ... ...
995 12.0 existing paid furniture/equipment
996 30.0 existing paid used car
997 12.0 existing paid radio/tv
998 45.0 existing paid radio/tv
999 45.0 critical/other existing credit used car

1000 rows × 3 columns

再來一種方法是使用.filter()搭配items=[欄位名稱]的參數:

df.filter(items=['age', 'duration', 'credit_amount'])
age duration credit_amount
0 67.0 6.0 1169.0
1 22.0 48.0 5951.0
2 49.0 12.0 2096.0
3 45.0 42.0 7882.0
4 53.0 24.0 4870.0
... ... ... ...
995 31.0 12.0 1736.0
996 40.0 30.0 3857.0
997 38.0 12.0 804.0
998 23.0 45.0 1845.0
999 27.0 45.0 4576.0

1000 rows × 3 columns

根據資料型態抓欄位:

df.select_dtypes(include='float')
duration credit_amount installment_commitment residence_since age existing_credits num_dependents
0 6.0 1169.0 4.0 4.0 67.0 2.0 1.0
1 48.0 5951.0 2.0 2.0 22.0 1.0 1.0
2 12.0 2096.0 2.0 3.0 49.0 1.0 2.0
3 42.0 7882.0 2.0 4.0 45.0 1.0 2.0
4 24.0 4870.0 3.0 4.0 53.0 2.0 2.0
... ... ... ... ... ... ... ...
995 12.0 1736.0 3.0 4.0 31.0 1.0 1.0
996 30.0 3857.0 4.0 4.0 40.0 1.0 1.0
997 12.0 804.0 4.0 4.0 38.0 1.0 1.0
998 45.0 1845.0 4.0 4.0 23.0 1.0 1.0
999 45.0 4576.0 3.0 4.0 27.0 1.0 1.0

1000 rows × 7 columns

根據欄位名稱的一些特徵來抓資料:

df.filter(like='_')
checking_status credit_history credit_amount savings_status installment_commitment personal_status other_parties residence_since property_magnitude other_payment_plans existing_credits num_dependents own_telephone foreign_worker
0 <0 critical/other existing credit 1169.0 no known savings 4.0 male single NaN 4.0 real estate NaN 2.0 1.0 yes yes
1 0<=X<200 existing paid 5951.0 <100 2.0 female div/dep/mar NaN 2.0 real estate NaN 1.0 1.0 NaN yes
2 no checking critical/other existing credit 2096.0 <100 2.0 male single NaN 3.0 real estate NaN 1.0 2.0 NaN yes
3 <0 existing paid 7882.0 <100 2.0 male single guarantor 4.0 life insurance NaN 1.0 2.0 NaN yes
4 <0 delayed previously 4870.0 <100 3.0 male single NaN 4.0 no known property NaN 2.0 2.0 NaN yes
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 no checking existing paid 1736.0 <100 3.0 female div/dep/mar NaN 4.0 real estate NaN 1.0 1.0 NaN yes
996 <0 existing paid 3857.0 <100 4.0 male div/sep NaN 4.0 life insurance NaN 1.0 1.0 yes yes
997 no checking existing paid 804.0 <100 4.0 male single NaN 4.0 car NaN 1.0 1.0 NaN yes
998 <0 existing paid 1845.0 <100 4.0 male single NaN 4.0 no known property NaN 1.0 1.0 yes yes
999 0<=X<200 critical/other existing credit 4576.0 100<=X<500 3.0 male single NaN 4.0 car NaN 1.0 1.0 NaN yes

1000 rows × 14 columns

若要把DataFrame欄位根據指定的順序做排序的話,就必須先設定好欄位順序在list內,然後再把資料抓出來:

ordered_cols = ['class', 'credit_amount', 'age', 'duration', 'existing_credits']

df[ordered_cols]
class credit_amount age duration existing_credits
0 good 1169.0 67.0 6.0 2.0
1 bad 5951.0 22.0 48.0 1.0
2 good 2096.0 49.0 12.0 1.0
3 good 7882.0 45.0 42.0 1.0
4 bad 4870.0 53.0 24.0 2.0
... ... ... ... ... ...
995 good 1736.0 31.0 12.0 1.0
996 good 3857.0 40.0 30.0 1.0
997 good 804.0 38.0 12.0 1.0
998 bad 1845.0 23.0 45.0 1.0
999 good 4576.0 27.0 45.0 1.0

1000 rows × 5 columns

抓特定列的資料#

直接根據列名稱去抓資料的情境雖然比較少,但一樣可以透過.loc[].iloc[]做到:

df.loc[7:10, :]
checking_status duration credit_history purpose credit_amount savings_status employment installment_commitment personal_status other_parties residence_since property_magnitude age other_payment_plans housing existing_credits job num_dependents own_telephone foreign_worker class
7 0<=X<200 36.0 existing paid used car 6948.0 <100 1<=X<4 2.0 male single NaN 2.0 car 35.0 NaN rent 1.0 high qualif/self emp/mgmt 1.0 yes yes good
8 no checking 12.0 existing paid radio/tv 3059.0 >=1000 4<=X<7 2.0 male div/sep NaN 4.0 real estate 61.0 NaN own 1.0 unskilled resident 1.0 NaN yes good
9 0<=X<200 30.0 critical/other existing credit new car 5234.0 <100 unemployed 4.0 male mar/wid NaN 2.0 car 28.0 NaN own 2.0 high qualif/self emp/mgmt 1.0 NaN yes bad
10 0<=X<200 12.0 existing paid new car 1295.0 <100 <1 3.0 female div/dep/mar NaN 1.0 car 25.0 NaN rent 1.0 skilled 1.0 NaN yes bad

在這邊,注意到因為index的順序跟label名稱一樣,但.loc[]的行為是判斷label名稱,.iloc[]則是類似list做indexing的概念,並不會包含到指定的最後一個index:

df.iloc[7:10, :]
checking_status duration credit_history purpose credit_amount savings_status employment installment_commitment personal_status other_parties residence_since property_magnitude age other_payment_plans housing existing_credits job num_dependents own_telephone foreign_worker class
7 0<=X<200 36.0 existing paid used car 6948.0 <100 1<=X<4 2.0 male single NaN 2.0 car 35.0 NaN rent 1.0 high qualif/self emp/mgmt 1.0 yes yes good
8 no checking 12.0 existing paid radio/tv 3059.0 >=1000 4<=X<7 2.0 male div/sep NaN 4.0 real estate 61.0 NaN own 1.0 unskilled resident 1.0 NaN yes good
9 0<=X<200 30.0 critical/other existing credit new car 5234.0 <100 unemployed 4.0 male mar/wid NaN 2.0 car 28.0 NaN own 2.0 high qualif/self emp/mgmt 1.0 NaN yes bad

依條件判斷抓取資料#

最直接的方法,例如要抓age > 40以上的資料:

df[df['age'] > 40]
checking_status duration credit_history purpose credit_amount savings_status employment installment_commitment personal_status other_parties residence_since property_magnitude age other_payment_plans housing existing_credits job num_dependents own_telephone foreign_worker class
0 <0 6.0 critical/other existing credit radio/tv 1169.0 no known savings >=7 4.0 male single NaN 4.0 real estate 67.0 NaN own 2.0 skilled 1.0 yes yes good
2 no checking 12.0 critical/other existing credit education 2096.0 <100 4<=X<7 2.0 male single NaN 3.0 real estate 49.0 NaN own 1.0 unskilled resident 2.0 NaN yes good
3 <0 42.0 existing paid furniture/equipment 7882.0 <100 4<=X<7 2.0 male single guarantor 4.0 life insurance 45.0 NaN for free 1.0 skilled 2.0 NaN yes good
4 <0 24.0 delayed previously new car 4870.0 <100 1<=X<4 3.0 male single NaN 4.0 no known property 53.0 NaN for free 2.0 skilled 2.0 NaN yes bad
6 no checking 24.0 existing paid furniture/equipment 2835.0 500<=X<1000 >=7 3.0 male single NaN 4.0 life insurance 53.0 NaN own 1.0 skilled 1.0 NaN yes good
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
978 no checking 24.0 delayed previously new car 2538.0 <100 >=7 4.0 male single NaN 4.0 car 47.0 NaN own 2.0 unskilled resident 2.0 NaN yes bad
980 0<=X<200 30.0 critical/other existing credit furniture/equipment 8386.0 <100 4<=X<7 2.0 male single NaN 2.0 life insurance 49.0 NaN own 1.0 skilled 1.0 NaN yes bad
987 no checking 13.0 existing paid radio/tv 1409.0 100<=X<500 unemployed 2.0 female div/dep/mar NaN 4.0 real estate 64.0 NaN own 1.0 skilled 1.0 NaN yes good
989 0<=X<200 24.0 critical/other existing credit radio/tv 1743.0 <100 >=7 4.0 male single NaN 2.0 life insurance 48.0 NaN own 2.0 unskilled resident 1.0 NaN yes good
994 no checking 12.0 existing paid new car 2390.0 no known savings >=7 4.0 male single NaN 3.0 car 50.0 NaN own 1.0 skilled 1.0 yes yes good

274 rows × 21 columns

也可以利用.loc[]

df.loc[df['age'] > 40, :]
checking_status duration credit_history purpose credit_amount savings_status employment installment_commitment personal_status other_parties residence_since property_magnitude age other_payment_plans housing existing_credits job num_dependents own_telephone foreign_worker class
0 <0 6.0 critical/other existing credit radio/tv 1169.0 no known savings >=7 4.0 male single NaN 4.0 real estate 67.0 NaN own 2.0 skilled 1.0 yes yes good
2 no checking 12.0 critical/other existing credit education 2096.0 <100 4<=X<7 2.0 male single NaN 3.0 real estate 49.0 NaN own 1.0 unskilled resident 2.0 NaN yes good
3 <0 42.0 existing paid furniture/equipment 7882.0 <100 4<=X<7 2.0 male single guarantor 4.0 life insurance 45.0 NaN for free 1.0 skilled 2.0 NaN yes good
4 <0 24.0 delayed previously new car 4870.0 <100 1<=X<4 3.0 male single NaN 4.0 no known property 53.0 NaN for free 2.0 skilled 2.0 NaN yes bad
6 no checking 24.0 existing paid furniture/equipment 2835.0 500<=X<1000 >=7 3.0 male single NaN 4.0 life insurance 53.0 NaN own 1.0 skilled 1.0 NaN yes good
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
978 no checking 24.0 delayed previously new car 2538.0 <100 >=7 4.0 male single NaN 4.0 car 47.0 NaN own 2.0 unskilled resident 2.0 NaN yes bad
980 0<=X<200 30.0 critical/other existing credit furniture/equipment 8386.0 <100 4<=X<7 2.0 male single NaN 2.0 life insurance 49.0 NaN own 1.0 skilled 1.0 NaN yes bad
987 no checking 13.0 existing paid radio/tv 1409.0 100<=X<500 unemployed 2.0 female div/dep/mar NaN 4.0 real estate 64.0 NaN own 1.0 skilled 1.0 NaN yes good
989 0<=X<200 24.0 critical/other existing credit radio/tv 1743.0 <100 >=7 4.0 male single NaN 2.0 life insurance 48.0 NaN own 2.0 unskilled resident 1.0 NaN yes good
994 no checking 12.0 existing paid new car 2390.0 no known savings >=7 4.0 male single NaN 3.0 car 50.0 NaN own 1.0 skilled 1.0 yes yes good

274 rows × 21 columns

或是使用.query()方法:

df.query('age > 40')
checking_status duration credit_history purpose credit_amount savings_status employment installment_commitment personal_status other_parties residence_since property_magnitude age other_payment_plans housing existing_credits job num_dependents own_telephone foreign_worker class
0 <0 6.0 critical/other existing credit radio/tv 1169.0 no known savings >=7 4.0 male single NaN 4.0 real estate 67.0 NaN own 2.0 skilled 1.0 yes yes good
2 no checking 12.0 critical/other existing credit education 2096.0 <100 4<=X<7 2.0 male single NaN 3.0 real estate 49.0 NaN own 1.0 unskilled resident 2.0 NaN yes good
3 <0 42.0 existing paid furniture/equipment 7882.0 <100 4<=X<7 2.0 male single guarantor 4.0 life insurance 45.0 NaN for free 1.0 skilled 2.0 NaN yes good
4 <0 24.0 delayed previously new car 4870.0 <100 1<=X<4 3.0 male single NaN 4.0 no known property 53.0 NaN for free 2.0 skilled 2.0 NaN yes bad
6 no checking 24.0 existing paid furniture/equipment 2835.0 500<=X<1000 >=7 3.0 male single NaN 4.0 life insurance 53.0 NaN own 1.0 skilled 1.0 NaN yes good
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
978 no checking 24.0 delayed previously new car 2538.0 <100 >=7 4.0 male single NaN 4.0 car 47.0 NaN own 2.0 unskilled resident 2.0 NaN yes bad
980 0<=X<200 30.0 critical/other existing credit furniture/equipment 8386.0 <100 4<=X<7 2.0 male single NaN 2.0 life insurance 49.0 NaN own 1.0 skilled 1.0 NaN yes bad
987 no checking 13.0 existing paid radio/tv 1409.0 100<=X<500 unemployed 2.0 female div/dep/mar NaN 4.0 real estate 64.0 NaN own 1.0 skilled 1.0 NaN yes good
989 0<=X<200 24.0 critical/other existing credit radio/tv 1743.0 <100 >=7 4.0 male single NaN 2.0 life insurance 48.0 NaN own 2.0 unskilled resident 1.0 NaN yes good
994 no checking 12.0 existing paid new car 2390.0 no known savings >=7 4.0 male single NaN 3.0 car 50.0 NaN own 1.0 skilled 1.0 yes yes good

274 rows × 21 columns

邏輯條件可以放多個:

df[(df['age'] > 40) & (df['duration'] > 50)]
checking_status duration credit_history purpose credit_amount savings_status employment installment_commitment personal_status other_parties residence_since property_magnitude age other_payment_plans housing existing_credits job num_dependents own_telephone foreign_worker class
29 <0 60.0 delayed previously business 6836.0 <100 >=7 3.0 male single NaN 4.0 no known property 63.0 NaN own 2.0 skilled 1.0 yes yes bad
95 0<=X<200 54.0 no credits/all paid business 15945.0 <100 <1 3.0 male single NaN 4.0 no known property 58.0 NaN rent 1.0 skilled 1.0 yes yes bad
373 no checking 60.0 critical/other existing credit new car 13756.0 no known savings >=7 2.0 male single NaN 4.0 no known property 63.0 bank for free 1.0 high qualif/self emp/mgmt 1.0 yes yes good
374 0<=X<200 60.0 all paid other 14782.0 100<=X<500 >=7 3.0 female div/dep/mar NaN 4.0 no known property 60.0 bank for free 2.0 high qualif/self emp/mgmt 1.0 yes yes bad
672 no checking 60.0 existing paid new car 10366.0 <100 >=7 2.0 male single NaN 4.0 life insurance 42.0 NaN own 1.0 high qualif/self emp/mgmt 1.0 yes yes good
938 0<=X<200 60.0 existing paid education 6288.0 <100 1<=X<4 4.0 male single NaN 4.0 no known property 42.0 NaN for free 1.0 skilled 1.0 NaN yes bad
df.query('age > 40 & duration > 50')
checking_status duration credit_history purpose credit_amount savings_status employment installment_commitment personal_status other_parties residence_since property_magnitude age other_payment_plans housing existing_credits job num_dependents own_telephone foreign_worker class
29 <0 60.0 delayed previously business 6836.0 <100 >=7 3.0 male single NaN 4.0 no known property 63.0 NaN own 2.0 skilled 1.0 yes yes bad
95 0<=X<200 54.0 no credits/all paid business 15945.0 <100 <1 3.0 male single NaN 4.0 no known property 58.0 NaN rent 1.0 skilled 1.0 yes yes bad
373 no checking 60.0 critical/other existing credit new car 13756.0 no known savings >=7 2.0 male single NaN 4.0 no known property 63.0 bank for free 1.0 high qualif/self emp/mgmt 1.0 yes yes good
374 0<=X<200 60.0 all paid other 14782.0 100<=X<500 >=7 3.0 female div/dep/mar NaN 4.0 no known property 60.0 bank for free 2.0 high qualif/self emp/mgmt 1.0 yes yes bad
672 no checking 60.0 existing paid new car 10366.0 <100 >=7 2.0 male single NaN 4.0 life insurance 42.0 NaN own 1.0 high qualif/self emp/mgmt 1.0 yes yes good
938 0<=X<200 60.0 existing paid education 6288.0 <100 1<=X<4 4.0 male single NaN 4.0 no known property 42.0 NaN for free 1.0 skilled 1.0 NaN yes bad

為了避免條件太多,造成閱讀困難,可以做一些前處理:

c1 = df['age'] > 40
c2 = df['duration'] > 50
c3 = df['credit_amount'] > 10000

cond = c1 & c2 & c3

df[cond]
checking_status duration credit_history purpose credit_amount savings_status employment installment_commitment personal_status other_parties residence_since property_magnitude age other_payment_plans housing existing_credits job num_dependents own_telephone foreign_worker class
95 0<=X<200 54.0 no credits/all paid business 15945.0 <100 <1 3.0 male single NaN 4.0 no known property 58.0 NaN rent 1.0 skilled 1.0 yes yes bad
373 no checking 60.0 critical/other existing credit new car 13756.0 no known savings >=7 2.0 male single NaN 4.0 no known property 63.0 bank for free 1.0 high qualif/self emp/mgmt 1.0 yes yes good
374 0<=X<200 60.0 all paid other 14782.0 100<=X<500 >=7 3.0 female div/dep/mar NaN 4.0 no known property 60.0 bank for free 2.0 high qualif/self emp/mgmt 1.0 yes yes bad
672 no checking 60.0 existing paid new car 10366.0 <100 >=7 2.0 male single NaN 4.0 life insurance 42.0 NaN own 1.0 high qualif/self emp/mgmt 1.0 yes yes good

或是搭配f-string以及.query()

c1 = 'age > 40'
c2 = 'duration > 50'
c3 = 'credit_amount > 10000'

cond = f'{c1} & {c2} & {c3}'

df.query(cond)
checking_status duration credit_history purpose credit_amount savings_status employment installment_commitment personal_status other_parties residence_since property_magnitude age other_payment_plans housing existing_credits job num_dependents own_telephone foreign_worker class
95 0<=X<200 54.0 no credits/all paid business 15945.0 <100 <1 3.0 male single NaN 4.0 no known property 58.0 NaN rent 1.0 skilled 1.0 yes yes bad
373 no checking 60.0 critical/other existing credit new car 13756.0 no known savings >=7 2.0 male single NaN 4.0 no known property 63.0 bank for free 1.0 high qualif/self emp/mgmt 1.0 yes yes good
374 0<=X<200 60.0 all paid other 14782.0 100<=X<500 >=7 3.0 female div/dep/mar NaN 4.0 no known property 60.0 bank for free 2.0 high qualif/self emp/mgmt 1.0 yes yes bad
672 no checking 60.0 existing paid new car 10366.0 <100 >=7 2.0 male single NaN 4.0 life insurance 42.0 NaN own 1.0 high qualif/self emp/mgmt 1.0 yes yes good

其他根據欄位值篩選,常用的方法有:

df[df['age'].between(30, 35)]
checking_status duration credit_history purpose credit_amount savings_status employment installment_commitment personal_status other_parties residence_since property_magnitude age other_payment_plans housing existing_credits job num_dependents own_telephone foreign_worker class
5 no checking 36.0 existing paid education 9055.0 no known savings 1<=X<4 2.0 male single NaN 4.0 no known property 35.0 NaN for free 1.0 unskilled resident 2.0 yes yes good
7 0<=X<200 36.0 existing paid used car 6948.0 <100 1<=X<4 2.0 male single NaN 2.0 car 35.0 NaN rent 1.0 high qualif/self emp/mgmt 1.0 yes yes good
15 <0 24.0 existing paid radio/tv 1282.0 100<=X<500 1<=X<4 4.0 female div/dep/mar NaN 2.0 car 32.0 NaN own 1.0 unskilled resident 1.0 NaN yes bad
19 no checking 24.0 existing paid radio/tv 3430.0 500<=X<1000 >=7 3.0 male single NaN 2.0 car 31.0 NaN own 1.0 skilled 2.0 yes yes good
28 0<=X<200 7.0 existing paid radio/tv 2415.0 <100 1<=X<4 3.0 male single guarantor 2.0 real estate 34.0 NaN own 1.0 skilled 1.0 NaN yes good
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
984 no checking 24.0 critical/other existing credit furniture/equipment 2028.0 <100 4<=X<7 2.0 male single NaN 2.0 life insurance 30.0 NaN own 2.0 unskilled resident 1.0 NaN yes good
986 >=200 42.0 no credits/all paid business 6289.0 <100 <1 2.0 male div/sep NaN 1.0 life insurance 33.0 NaN own 2.0 skilled 1.0 NaN yes good
991 no checking 15.0 all paid radio/tv 1569.0 100<=X<500 >=7 4.0 male single NaN 4.0 car 34.0 bank own 1.0 unskilled resident 2.0 NaN yes good
993 <0 36.0 existing paid furniture/equipment 3959.0 <100 unemployed 4.0 male single NaN 3.0 life insurance 30.0 NaN own 1.0 high qualif/self emp/mgmt 1.0 yes yes good
995 no checking 12.0 existing paid furniture/equipment 1736.0 <100 4<=X<7 3.0 female div/dep/mar NaN 4.0 real estate 31.0 NaN own 1.0 unskilled resident 1.0 NaN yes good

217 rows × 21 columns

或是根據欄位值清單篩選:

df[df['purpose'].isin(['new car', 'used card'])]
checking_status duration credit_history purpose credit_amount savings_status employment installment_commitment personal_status other_parties residence_since property_magnitude age other_payment_plans housing existing_credits job num_dependents own_telephone foreign_worker class
4 <0 24.0 delayed previously new car 4870.0 <100 1<=X<4 3.0 male single NaN 4.0 no known property 53.0 NaN for free 2.0 skilled 2.0 NaN yes bad
9 0<=X<200 30.0 critical/other existing credit new car 5234.0 <100 unemployed 4.0 male mar/wid NaN 2.0 car 28.0 NaN own 2.0 high qualif/self emp/mgmt 1.0 NaN yes bad
10 0<=X<200 12.0 existing paid new car 1295.0 <100 <1 3.0 female div/dep/mar NaN 1.0 car 25.0 NaN rent 1.0 skilled 1.0 NaN yes bad
13 <0 24.0 critical/other existing credit new car 1199.0 <100 >=7 4.0 male single NaN 4.0 car 60.0 NaN own 2.0 unskilled resident 1.0 NaN yes bad
14 <0 15.0 existing paid new car 1403.0 <100 1<=X<4 2.0 female div/dep/mar NaN 4.0 car 28.0 NaN rent 1.0 skilled 1.0 NaN yes good
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
972 <0 24.0 all paid new car 1193.0 <100 unemployed 1.0 female div/dep/mar co applicant 4.0 no known property 29.0 NaN rent 2.0 unemp/unskilled non res 1.0 NaN yes bad
978 no checking 24.0 delayed previously new car 2538.0 <100 >=7 4.0 male single NaN 4.0 car 47.0 NaN own 2.0 unskilled resident 2.0 NaN yes bad
979 0<=X<200 15.0 all paid new car 1264.0 100<=X<500 1<=X<4 2.0 male mar/wid NaN 2.0 life insurance 25.0 NaN rent 1.0 skilled 1.0 NaN yes bad
982 >=200 21.0 existing paid new car 2923.0 100<=X<500 1<=X<4 1.0 female div/dep/mar NaN 1.0 car 28.0 bank own 1.0 high qualif/self emp/mgmt 1.0 yes yes good
994 no checking 12.0 existing paid new car 2390.0 no known savings >=7 4.0 male single NaN 3.0 car 50.0 NaN own 1.0 skilled 1.0 yes yes good

234 rows × 21 columns

排序資料#

要排序資料可以使用前面用過的.sort_values()語法,參數by指定欄位,ascending指定方向:

df.sort_values(by='credit_amount', ascending=False)
checking_status duration credit_history purpose credit_amount savings_status employment installment_commitment personal_status other_parties residence_since property_magnitude age other_payment_plans housing existing_credits job num_dependents own_telephone foreign_worker class
915 0<=X<200 48.0 no credits/all paid other 18424.0 <100 1<=X<4 1.0 female div/dep/mar NaN 2.0 life insurance 32.0 bank own 1.0 high qualif/self emp/mgmt 1.0 yes no bad
95 0<=X<200 54.0 no credits/all paid business 15945.0 <100 <1 3.0 male single NaN 4.0 no known property 58.0 NaN rent 1.0 skilled 1.0 yes yes bad
818 <0 36.0 existing paid other 15857.0 <100 unemployed 2.0 male div/sep co applicant 3.0 car 43.0 NaN own 1.0 high qualif/self emp/mgmt 1.0 NaN yes good
887 0<=X<200 48.0 existing paid business 15672.0 <100 1<=X<4 2.0 male single NaN 2.0 car 23.0 NaN own 1.0 skilled 1.0 yes yes bad
637 no checking 60.0 delayed previously radio/tv 15653.0 <100 4<=X<7 2.0 male single NaN 4.0 car 21.0 NaN own 2.0 skilled 1.0 yes yes good
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
458 <0 6.0 existing paid domestic appliance 343.0 <100 <1 4.0 female div/dep/mar NaN 1.0 real estate 27.0 NaN own 1.0 skilled 1.0 NaN yes good
157 <0 12.0 all paid retraining 339.0 <100 >=7 4.0 male mar/wid NaN 1.0 car 45.0 bank own 1.0 unskilled resident 1.0 NaN yes good
177 <0 6.0 critical/other existing credit radio/tv 338.0 500<=X<1000 >=7 4.0 male single NaN 4.0 car 52.0 NaN own 2.0 skilled 1.0 NaN yes good
309 0<=X<200 9.0 existing paid new car 276.0 <100 1<=X<4 4.0 male mar/wid NaN 4.0 real estate 22.0 NaN rent 1.0 unskilled resident 1.0 NaN yes good
725 no checking 6.0 critical/other existing credit new car 250.0 >=1000 1<=X<4 2.0 female div/dep/mar NaN 2.0 real estate 41.0 bank own 2.0 unskilled resident 1.0 NaN yes good

1000 rows × 21 columns

另外,若要根據index排序,可以使用.sort_index()。例如,想將資料根據年齡統計筆數後,查看年紀最小的幾個樣本數:

df['age'].value_counts().sort_index().head(10)
age
19.0     2
20.0    14
21.0    14
22.0    27
23.0    48
24.0    44
25.0    41
26.0    50
27.0    51
28.0    43
Name: count, dtype: int64