Pandas常用操作

Pandas常用操作#

上一章的內容主要介紹如何擷取資料的基本資訊，例如資料的維度、大略地看資料內容、統計值、遺漏值、欄位有無異常值等等。

本節開始要介紹對DataFrame或是欄位的一些操作及轉換。

同樣使用相同的資料，首先讀入資料：

import pandas as pd

df = pd.read_csv('./data/credit_customers.csv')

基本操作#

改變欄位名稱

由於其中一個欄位名稱”class”與python關鍵字相同，故建議是更改名稱，避免後許使用的困擾。

此外，剛好該欄位是該資料集用來預測是否違約的標籤，因此可以命名為”label”。

df.rename(columns={'class': 'label'})

	checking_status	duration	credit_history	purpose	credit_amount	savings_status	employment	installment_commitment	personal_status	other_parties	...	property_magnitude	age	other_payment_plans	housing	existing_credits	job	num_dependents	own_telephone	foreign_worker	label
0	<0	6.0	critical/other existing credit	radio/tv	1169.0	no known savings	>=7	4.0	male single	NaN	...	real estate	67.0	NaN	own	2.0	skilled	1.0	yes	yes	good
1	0<=X<200	48.0	existing paid	radio/tv	5951.0	<100	1<=X<4	2.0	female div/dep/mar	NaN	...	real estate	22.0	NaN	own	1.0	skilled	1.0	NaN	yes	bad
2	no checking	12.0	critical/other existing credit	education	2096.0	<100	4<=X<7	2.0	male single	NaN	...	real estate	49.0	NaN	own	1.0	unskilled resident	2.0	NaN	yes	good
3	<0	42.0	existing paid	furniture/equipment	7882.0	<100	4<=X<7	2.0	male single	guarantor	...	life insurance	45.0	NaN	for free	1.0	skilled	2.0	NaN	yes	good
4	<0	24.0	delayed previously	new car	4870.0	<100	1<=X<4	3.0	male single	NaN	...	no known property	53.0	NaN	for free	2.0	skilled	2.0	NaN	yes	bad
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
995	no checking	12.0	existing paid	furniture/equipment	1736.0	<100	4<=X<7	3.0	female div/dep/mar	NaN	...	real estate	31.0	NaN	own	1.0	unskilled resident	1.0	NaN	yes	good
996	<0	30.0	existing paid	used car	3857.0	<100	1<=X<4	4.0	male div/sep	NaN	...	life insurance	40.0	NaN	own	1.0	high qualif/self emp/mgmt	1.0	yes	yes	good
997	no checking	12.0	existing paid	radio/tv	804.0	<100	>=7	4.0	male single	NaN	...	car	38.0	NaN	own	1.0	skilled	1.0	NaN	yes	good
998	<0	45.0	existing paid	radio/tv	1845.0	<100	1<=X<4	4.0	male single	NaN	...	no known property	23.0	NaN	for free	1.0	skilled	1.0	yes	yes	bad
999	0<=X<200	45.0	critical/other existing credit	used car	4576.0	100<=X<500	unemployed	3.0	male single	NaN	...	car	27.0	NaN	own	1.0	skilled	1.0	NaN	yes	good

1000 rows × 21 columns

注意到，這邊結果是回傳一個DataFrame，並不會改變原本DataFrame的內容：

# 該欄位的名稱仍為"class"
df.columns

Index(['checking_status', 'duration', 'credit_history', 'purpose',
       'credit_amount', 'savings_status', 'employment',
       'installment_commitment', 'personal_status', 'other_parties',
       'residence_since', 'property_magnitude', 'age', 'other_payment_plans',
       'housing', 'existing_credits', 'job', 'num_dependents', 'own_telephone',
       'foreign_worker', 'class'],
      dtype='object')

如果要直接修改的話必須帶inplace=True參數：

df.rename(columns={'class': 'label'}, inplace=True)

發現到若加上inplace=True參數後，執行該語法並不會回傳任何東西，而是對DataFrame的內容直接做修改。

df.columns

Index(['checking_status', 'duration', 'credit_history', 'purpose',
       'credit_amount', 'savings_status', 'employment',
       'installment_commitment', 'personal_status', 'other_parties',
       'residence_since', 'property_magnitude', 'age', 'other_payment_plans',
       'housing', 'existing_credits', 'job', 'num_dependents', 'own_telephone',
       'foreign_worker', 'label'],
      dtype='object')

inplace=True在很多DataFrame的方法中都有。

然而，實務上並不建議直接使用inplace=True，原因是這會直接修改資料，若要復原操作，就要反向再執行一次，或是重新讀入DataFrame，造成時間上會付出較大的代價。建議是在有明確理由的情況下使用。

比較好的做法是，將結果賦值到新變數中。一來程式的易讀性較高（賦值的動作明確表達出有新資料產出），二來若想修改為其他名稱也只需要在這個步驟重來就好。

df_processed = df.rename(columns={'class': 'label'})

但缺點是會造成記憶體空間佔用以及需要命名新的物件，所以需要一般會是配合幾個相關的處理需求，利用chaining或是.pipe()方法一起執行(後面章節會再詳細說明)。

填補遺漏值

可以用以下語法填補資料的遺漏值，但須注意資料格式，建議用相同格式：

df['other_payment_plans'].fillna('other')

    other
    other
    other
    other
    other
       ...  
  other
  other
  other
  other
  other
Name: other_payment_plans, Length: 1000, dtype: object

一樣，要加上inplace=True才會直接改變資料內容：

df['other_payment_plans'].fillna('other', inplace=True)

df['other_payment_plans'].isna().sum()

欄位值轉換

後續若要進行建模，通常會需要把文字的欄位值轉換爲數字才能夠丟給模型。

例如我們想把label的值，從good/bad改為0/1，可以把轉換的對應關係儲存成一個字典，然後搭配.map()方法：

mapper = {
    'good': 0,
    'bad': 1
}

df['label'].map(mapper)

    0
    1
    0
    0
    1
      ..
  0
  0
  0
  1
  0
Name: label, Length: 1000, dtype: int64

注意，如果是mapper沒定義到的欄位值，.map()之後會轉為空值。

mapper = {
    'bad': 1
}

df['label'].map(mapper)

    NaN
    1.0
    NaN
    NaN
    1.0
      ... 
  NaN
  NaN
  NaN
  1.0
  NaN
Name: label, Length: 1000, dtype: float64

.map()方法是回傳一個DataFrame，但沒有修改原始的DataFrame。而.map()方法也沒有inplace參數，通常的做法會是新建一個欄位。

如何新建一個欄位？下一章會詳細說明，這邊先劇透其中一個方法：

df.loc[:, "label_new"] = df['label'].map(mapper)

df.filter(like='label')

	label	label_new
0	good	NaN
1	bad	1.0
2	good	NaN
3	good	NaN
4	bad	1.0
...	...	...
995	good	NaN
996	good	NaN
997	good	NaN
998	bad	1.0
999	good	NaN

1000 rows × 2 columns

數值欄位操作#

針對數值型的欄位，可能會有一些組合運算，接下來介紹一些方法。

欄位運算

加法

df['age'] + 1

    68.0
    23.0
    50.0
    46.0
    54.0
       ... 
  32.0
  41.0
  39.0
  24.0
  28.0
Name: age, Length: 1000, dtype: float64

df['age'].add(1)

    68.0
    23.0
    50.0
    46.0
    54.0
       ... 
  32.0
  41.0
  39.0
  24.0
  28.0
Name: age, Length: 1000, dtype: float64

減法

df['age'] - 1

    66.0
    21.0
    48.0
    44.0
    52.0
       ... 
  30.0
  39.0
  37.0
  22.0
  26.0
Name: age, Length: 1000, dtype: float64

df['age'].sub(1)

    66.0
    21.0
    48.0
    44.0
    52.0
       ... 
  30.0
  39.0
  37.0
  22.0
  26.0
Name: age, Length: 1000, dtype: float64

乘法

df['installment_commitment'] * 0.01

    0.04
    0.02
    0.02
    0.02
    0.03
       ... 
  0.03
  0.04
  0.04
  0.04
  0.03
Name: installment_commitment, Length: 1000, dtype: float64

df['installment_commitment'].mul(0.01)

    0.04
    0.02
    0.02
    0.02
    0.03
       ... 
  0.03
  0.04
  0.04
  0.04
  0.03
Name: installment_commitment, Length: 1000, dtype: float64

除法

df['duration'] / df['age']

    0.089552
    2.181818
    0.244898
    0.933333
    0.452830
         ...   
  0.387097
  0.750000
  0.315789
  1.956522
  1.666667
Length: 1000, dtype: float64

df['duration'].div(df['age'])

    0.089552
    2.181818
    0.244898
    0.933333
    0.452830
         ...   
  0.387097
  0.750000
  0.315789
  1.956522
  1.666667
Length: 1000, dtype: float64

取商數

df['age'] // 12

    5.0
    1.0
    4.0
    3.0
    4.0
      ... 
  2.0
  3.0
  3.0
  1.0
  2.0
Name: age, Length: 1000, dtype: float64

df['age'].floordiv(12)

    5.0
    1.0
    4.0
    3.0
    4.0
      ... 
  2.0
  3.0
  3.0
  1.0
  2.0
Name: age, Length: 1000, dtype: float64

取餘數

df['age'] % 12

     7.0
    10.0
     1.0
     9.0
     5.0
       ... 
   7.0
   4.0
   2.0
  11.0
   3.0
Name: age, Length: 1000, dtype: float64

df['age'].mod(12)

     7.0
    10.0
     1.0
     9.0
     5.0
       ... 
   7.0
   4.0
   2.0
  11.0
   3.0
Name: age, Length: 1000, dtype: float64

進位運算

括弧內的數字代表進位到第幾位。

注意到這邊rounding的行為是”四捨五入到最接近的偶數”，5.5捨去小數點到整數會是6，4.5捨去小數點到整數會是4。這種rounding又稱作是round-to-even。

參考：

df.loc[:, 'relative_duration'] = df['duration'] / df['age']

df['relative_duration'].round(1)

    0.1
    2.2
    0.2
    0.9
    0.5
      ... 
  0.4
  0.8
  0.3
  2.0
  1.7
Name: relative_duration, Length: 1000, dtype: float64

取次方

括弧內可指定次方項。

# 取平方

df['relative_duration'].pow(2)

    0.008020
    4.760331
    0.059975
    0.871111
    0.205055
         ...   
  0.149844
  0.562500
  0.099723
  3.827977
  2.777778
Name: relative_duration, Length: 1000, dtype: float64

取絕對值

df['age'].sub(df['age'].mean())

    31.454
   -13.546
    13.454
     9.454
    17.454
        ...  
  -4.546
   4.454
   2.454
 -12.546
  -8.546
Name: age, Length: 1000, dtype: float64

df['age'].sub(df['age'].mean()).abs()

    31.454
    13.546
    13.454
     9.454
    17.454
        ...  
   4.546
   4.454
   2.454
  12.546
   8.546
Name: age, Length: 1000, dtype: float64

限定值的範圍

第一個參數是下界，第二個參數是上界。

df['relative_duration'].clip(0.05, 2)

    0.089552
    2.000000
    0.244898
    0.933333
    0.452830
         ...   
  0.387097
  0.750000
  0.315789
  1.956522
  1.666667
Name: relative_duration, Length: 1000, dtype: float64

數值離散化

有時要將數值型變數切割，會比較方便做一些描述統計或視覺化分析。

其中參數bins是據以切割的數值，right=False代表左邊是[，而右邊是)。

以下面的例子來看，第一個bin會是：0 ≤ age < 20，第二個bin則是：20 ≤ age < 40……以此類推。

df.loc[:, "age_bins"] = pd.cut(df['age'], bins=[0, 20, 40, 60, 80], right=False)

df['age_bins'].value_counts().sort_index()

age_bins
[0, 20)       2
[20, 40)    699
[40, 60)    248
[60, 80)     51
Name: count, dtype: int64

另外一種方式則是直接透過資料的百分位數來做切割。

pd.qcut(df['age'], q=10)

      (52.0, 75.0]
    (18.999, 23.0]
      (45.0, 52.0]
      (39.0, 45.0]
      (52.0, 75.0]
            ...      
    (30.0, 33.0]
    (39.0, 45.0]
    (36.0, 39.0]
  (18.999, 23.0]
    (26.0, 28.0]
Name: age, Length: 1000, dtype: category
Categories (10, interval[float64, right]): [(18.999, 23.0] < (23.0, 26.0] < (26.0, 28.0] < (28.0, 30.0] ... (36.0, 39.0] < (39.0, 45.0] < (45.0, 52.0] < (52.0, 75.0]]

字串欄位操作#

以下範例來自pandas官方API文件：

字串截取

s = pd.Series(["koala", "dog", "chameleon"])

s.str.slice(start=1) # = s.str[1:]

      oala
        og
  hameleon
dtype: object

s.str.slice(start=-1) # = s.str[-1:]

  a
  g
  n
dtype: object

s.str.slice(stop=2) # = s.str[:2]

  ko
  do
  ch
dtype: object

s.str.slice(step=2) # = s.str[::2]

    kaa
     dg
  caeen
dtype: object

s.str.slice(start=0, stop=5, step=3) # = s.str[0:5:3]

  kl
   d
  cm
dtype: object

判斷字串是否存在

import numpy as np

s1 = pd.Series(['Mouse', 'dog', 'house and parrot', '23', np.nan])

判斷特定字串是否包含在值當中

s1.str.contains('og')

  False
   True
  False
  False
    NaN
dtype: object

設定參數na代表當遇到空值要填入什麼值，下面設定填入False

s1.str.contains('og', na=False)

  False
   True
  False
  False
  False
dtype: bool

以下用法稱作正規表達式（regular expression），正規表達式專門處理字串，但內容頗多，附上資源供自行參考。

s1.str.contains('house|dog', regex=True)

  False
   True
   True
  False
    NaN
dtype: object

s1.str.contains('\\d', regex=True)

  False
  False
  False
   True
    NaN
dtype: object

參考：

The Ultimate Guide to using the Python regex module

Regular Expressions: Regexes in Python (Part 1) – Real Python

Regular Expressions: Regexes in Python (Part 2) – Real Python

字串取代

s = pd.Series(['foo', 'fuz', np.nan]).str.replace('f.', 'ba', regex=True)

可以單純將字串替換成別的字串。

s.str.replace('f', 'b')

  bao
  baz
  NaN
dtype: object

也可以使用正規表達式，.代表任一字元，所以”f” 以及”f”後面1個字元被取代成”ba”。

s.str.replace('f.', 'ba', regex=True)

  bao
  baz
  NaN
dtype: object

時間欄位操作#

生成時間序列

可以透過指定起始與結束日期來建立連續的日期序列：

pd.date_range(start='20231129', end='20231207')

DatetimeIndex(['2023-11-29', '2023-11-30', '2023-12-01', '2023-12-02',
               '2023-12-03', '2023-12-04', '2023-12-05', '2023-12-06',
               '2023-12-07'],
              dtype='datetime64[ns]', freq='D')

也可以只指定起始日期，然後利用periods參數指定長度：

pd.date_range(start='20231129', periods=9)

DatetimeIndex(['2023-11-29', '2023-11-30', '2023-12-01', '2023-12-02',
               '2023-12-03', '2023-12-04', '2023-12-05', '2023-12-06',
               '2023-12-07'],
              dtype='datetime64[ns]', freq='D')

也可以使用參數freq="M"：

pd.date_range(start='20230101', periods=12, freq='M')

DatetimeIndex(['2023-01-31', '2023-02-28', '2023-03-31', '2023-04-30',
               '2023-05-31', '2023-06-30', '2023-07-31', '2023-08-31',
               '2023-09-30', '2023-10-31', '2023-11-30', '2023-12-31'],
              dtype='datetime64[ns]', freq='M')

但結果會是月底。可改用DateOffset物件來處理：

from pandas.tseries.offsets import DateOffset

pd.date_range(start='20230101', periods=12, freq=DateOffset(months=1))

DatetimeIndex(['2023-01-01', '2023-02-01', '2023-03-01', '2023-04-01',
               '2023-05-01', '2023-06-01', '2023-07-01', '2023-08-01',
               '2023-09-01', '2023-10-01', '2023-11-01', '2023-12-01'],
              dtype='datetime64[ns]', freq='<DateOffset: months=1>')

若是要固定隔n天(或n週, n秒等等)，可以用Timedelta物件來處理：

pd.date_range(start='20230101', end='20231231', freq=pd.Timedelta(days=15))

DatetimeIndex(['2023-01-01', '2023-01-16', '2023-01-31', '2023-02-15',
               '2023-03-02', '2023-03-17', '2023-04-01', '2023-04-16',
               '2023-05-01', '2023-05-16', '2023-05-31', '2023-06-15',
               '2023-06-30', '2023-07-15', '2023-07-30', '2023-08-14',
               '2023-08-29', '2023-09-13', '2023-09-28', '2023-10-13',
               '2023-10-28', '2023-11-12', '2023-11-27', '2023-12-12',
               '2023-12-27'],
              dtype='datetime64[ns]', freq='15D')

計算日期

利用Timedelta物件來處理：

pd.to_datetime("19930103", format='%Y%m%d') - pd.Timedelta(days=765)

Timestamp('1990-11-30 00:00:00')

DateOffset物件同樣可以做到：

pd.to_datetime("19930103", format='%Y%m%d') - DateOffset(days=765)

Timestamp('1990-11-30 00:00:00')

計算兩個日期之間的天數

其實就是直接相減就可以了：

(pd.to_datetime("20240314", format='%Y%m%d') - 
 pd.to_datetime("20200524", format='%Y%m%d')).days

時間轉文字

首先透過pd.date_range產生日期的時間序列。

dates = pd.date_range(start='20231129', end='20231207')

print(dates)

DatetimeIndex(['2023-11-29', '2023-11-30', '2023-12-01', '2023-12-02',
               '2023-12-03', '2023-12-04', '2023-12-05', '2023-12-06',
               '2023-12-07'],
              dtype='datetime64[ns]', freq='D')

可以透過以下方法將時間轉成文字格式

dates.strftime('%Y%m%d')

Index(['20231129', '20231130', '20231201', '20231202', '20231203', '20231204',
       '20231205', '20231206', '20231207'],
      dtype='object')

文字轉時間

先將上一步的結果借來用。

dates = dates.strftime('%Y%m%d')

需要使用的方法是pd.to_datetime()

pd.to_datetime(dates, format='%Y%m%d')

DatetimeIndex(['2023-11-29', '2023-11-30', '2023-12-01', '2023-12-02',
               '2023-12-03', '2023-12-04', '2023-12-05', '2023-12-06',
               '2023-12-07'],
              dtype='datetime64[ns]', freq=None)

這邊時間日期的表示法與python內建的表示法相同，可參考前面的章節。