돌연변이 위치에 따른 항체 생산량 예측하기

Taeyoon Kim

2020-11-26 08:00

0. 목적¶

mljar-supervised라는 AutoML 패키지를 사용해 돌연변이 위치에 따른 항체 발현량 예측하는 문제를 풀어보겠습니다.

0.1. 데이터 출처¶

Ohri R, Bhakta S, Fourie-O'Donohue A, et al. High-Throughput Cysteine Scanning To Identify Stable Antibody Conjugation Sites for Maleimide- and Disulfide-Based Linkers. Bioconjug Chem. 2018;29(2):473-485. doi:10.1021/acs.bioconjchem.7b00791

0.2. 데이터셋 설명¶

데이터셋은 다음 9개의 열로 구성되어 있습니다.

Linker: maleimide linker(vc) 또는 pyridyl disulfide linker(PDS) 링커 구분
HC/LC: 항체의 light chain(LC) 혹은 heavy chain(HC) 표기
Residue: 원래의 아미노산 잔기
MutationSite: 돌연변이 부위
Conc(mg/ml): 단백질 농도는 280nm에서 흡광도와 280nm에서 항체의 흡광 계수를 사용하여 계산되었습니다.
%Agg: 응집된 항체의 비율은 SE-HPLC에 의해 평가되었습니다.
DAR: 약물 대 항체 비율, mAb에 접합 된 약물의 수는 LC/MS 분석에 의해 정량화되었습니다.
Reox: LC/MS 분석을 통해 항체가 정상적으로 reoxidation되었는지 판단합니다.
Stability: 랫(Rat)의 혈장에서 37도로 48시간 두었을때 변화한 DAR 비율

1. mljar-supervised¶

mljar-supervised는 테이블 형식 데이터에 작동하는 자동화된 기계 학습 파이썬 패키지입니다. 데이터를 전처리하고, 기계 학습 모델을 구성하고, 하이퍼 매개 변수 조정을 수행하여 최상의 모델을 찾는 일반적인 방법을 추상화 해서 우리의 시간을 절약하도록 설계되었습니다😎.

1.1. 설치¶

다음과 같이 pip 명령어를 사용해 설치할 수 있습니다.

pip install mljar-supervised

더 자세한 설치방법은 다음의 링크 https://github.com/mljar/mljar-supervised 를 확인하세요.

2. 분석하기¶

먼저 pandas를 사용해 해당 데이터셋의 내용을 살펴봅니다.

In [1]:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("adc.csv")
df.head(10)

Out[1]:

	Linker	HC/LC	Residue	MutationSite	Conc.(mg/mL)	%Agg	DAR	Reox	Avg.Stability
0	PDS	HC	E	1	3.7	43	1.0	Y	3
1	PDS	HC	V	2	3.1	13	1.6	Y	36
2	PDS	HC	Q	3	2.4	9	1.5	Y	48
3	PDS	HC	L	4	2.2	26	1.4	Y	8
4	PDS	HC	V	5	3.1	12	1.6	Y	84
5	PDS	HC	E	6	2.7	11	0.3	Y	19
6	PDS	HC	S	7	2.7	25	0.8	Y	12
7	PDS	HC	G	8	2.1	30	1.7	Y	1
8	PDS	HC	G	9	3.4	35	1.5	Y	2
9	PDS	HC	G	10	2.2	9	1.5	Y	1

데이터셋의 컬럼 목록, 모양, 간단한 수치 통계도 살펴봅니다.

In [2]:

df.columns

Out[2]:

Index(['Linker', 'HC/LC', 'Residue', 'MutationSite', 'Conc.(mg/mL)', '%Agg',
       'DAR', 'Reox', 'Avg.Stability'],
      dtype='object')

In [3]:

df.shape

Out[3]:

(1325, 9)

In [4]:

df.describe()

Out[4]:

	MutationSite	Conc.(mg/mL)	%Agg	DAR	Avg.Stability
count	1325.000000	1325.000000	1325.000000	1325.000000	1325.00000
mean	187.340377	1.910189	24.507925	0.815019	40.01434
std	125.469780	0.909779	28.507051	0.705048	35.43867
min	1.000000	0.000000	0.000000	0.000000	0.00000
25%	83.000000	1.600000	7.000000	0.000000	3.00000
50%	166.000000	1.900000	13.000000	0.800000	37.00000
75%	284.000000	2.400000	27.000000	1.500000	67.00000
max	450.000000	5.100000	100.000000	2.000000	150.00000

seaborn을 사용해서 간단한 시각화도 해봅니다.

In [5]:

g = sns.PairGrid(df, diag_sharey=False)
g.map_upper(sns.scatterplot, s=15)
g.map_lower(sns.kdeplot)
g.map_diag(sns.kdeplot, lw=2)

Out[5]:

<seaborn.axisgrid.PairGrid at 0x7ffa3cf29880>

No description has been provided for this image

데이터들이 대체로 두가지 분류로 나뉘는 것을 알 수 있습니다. 예를 들면 농도가 0인 것과 2인 것으로 분류할 수 있고 %Agg 도 0에 가깝거나 100에 가까운 그룹으로 나눌 수 있겠습니다. 이런 데이터셋은 바이어스(bias)가 있을 가능성이 커서 기계 학습하기에는 적합하지 않다고 볼 수 있습니다.

3. mljar-supervised 사용하기¶

데이터셋을 학습용과 테스트용으로 나눕니다. 그리고 각각의 크기를 출력합니다.

In [6]:

from sklearn.model_selection import train_test_split
from supervised.automl import AutoML  # mljar-supervised

X = df[
    [
        "Linker",
        "HC/LC",
        "Residue",
        "MutationSite",
        "%Agg",
        "DAR",
        "Reox",
        "Avg.Stability",
    ]
]
y = df["Conc.(mg/mL)"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42
)
print(X_train.shape, X_test.shape)

(1060, 8) (265, 8)

3.1 AutoML로 모델 학습¶

In [7]:

automl = AutoML(mode="Explain")
automl.fit(X_train, y_train)

AutoML directory: AutoML_1
The task is regression with evaluation metric rmse
AutoML will use algorithms: ['Baseline', 'Linear', 'Decision Tree', 'Random Forest', 'Xgboost', 'Neural Network']
AutoML will ensemble availabe models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'ensemble']
* Step simple_algorithms will try to check up to 3 models
1_Baseline rmse 0.933119 trained in 0.12 seconds
2_DecisionTree rmse 0.604529 trained in 8.69 seconds
3_Linear rmse 0.640778 trained in 2.3 seconds
* Step default_algorithms will try to check up to 3 models
4_Default_RandomForest rmse 0.584858 trained in 5.53 seconds
5_Default_Xgboost rmse 0.582731 trained in 2.88 seconds
6_Default_NeuralNetwork rmse 0.64666 trained in 5.2 seconds
* Step ensemble will try to check up to 1 model
Ensemble rmse 0.579981 trained in 0.12 seconds
AutoML fit time: 27.45 seconds

Out[7]:

AutoML()

학습된 모델의 성능은 rmse으로 측정되었으며 가장 좋은 값은 0.579981 입니다. mljar-supervised은 결과를 AutoML_XX라는 폴더로 저장해줍니다. 폴더 안을 각각의 모델에 대한 분석값이 들어 있습니다.

3.2. 예측 결과 시각화 하기¶

x축은 실제 값으로 y축은 예측된 값인 scatter plot을 그려봅니다.

In [8]:

predict_1 = automl.predict_all(X_test)

df_plot = pd.DataFrame()
df_plot["True_value"] = y_test.values
df_plot["Pred_1"] = predict_1.prediction.values

g = sns.jointplot(
    x="True_value", y="Pred_1", data=df_plot, kind="reg", truncate=False, height=7
)

3.3. hyper parameter 최적화하기¶

mljar-supervised에는 hyper parameter를 다음과 같이 손쉽게 최적화 할 수 있습니다.

In [9]:

automl = AutoML(algorithms=["Random Forest", "Xgboost"], mode="Compete")
automl.fit(X_train, y_train)

AutoML directory: AutoML_2
The task is regression with evaluation metric rmse
AutoML will use algorithms: ['Random Forest', 'Xgboost']
AutoML will stack models
AutoML will ensemble availabe models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'not_so_random', 'golden_features', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2', 'ensemble', 'stack', 'ensemble_stacked']
Skip simple_algorithms because no parameters were generated.
* Step default_algorithms will try to check up to 2 models
1_Default_RandomForest rmse 0.596139 trained in 7.92 seconds
2_Default_Xgboost rmse 0.584039 trained in 1.59 seconds
* Step not_so_random will try to check up to 18 models
3_Xgboost rmse 0.599567 trained in 4.96 seconds
4_Xgboost rmse 0.593866 trained in 2.01 seconds
5_Xgboost rmse 0.593954 trained in 1.53 seconds
6_Xgboost rmse 0.596786 trained in 2.01 seconds
7_Xgboost rmse 0.589096 trained in 1.93 seconds
8_Xgboost rmse 0.603613 trained in 1.65 seconds
9_Xgboost rmse 0.607813 trained in 1.82 seconds
10_Xgboost rmse 0.596479 trained in 2.32 seconds
11_Xgboost rmse 0.598566 trained in 3.15 seconds
12_RandomForest rmse 0.595774 trained in 6.92 seconds
13_RandomForest rmse 0.601367 trained in 8.41 seconds
14_RandomForest rmse 0.593025 trained in 9.41 seconds
15_RandomForest rmse 0.596608 trained in 9.5 seconds
16_RandomForest rmse 0.591571 trained in 7.94 seconds
17_RandomForest rmse 0.594513 trained in 8.38 seconds
18_RandomForest rmse 0.594856 trained in 6.64 seconds
19_RandomForest rmse 0.596395 trained in 7.75 seconds
20_RandomForest rmse 0.5851 trained in 8.1 seconds
* Step golden_features will try to check up to 1 model
Add Golden Feature: %Agg_diff_DAR
Add Golden Feature: %Agg_diff_Avg.Stability
Add Golden Feature: Avg.Stability_ratio_MutationSite
Add Golden Feature: MutationSite_ratio_Avg.Stability
Add Golden Feature: DAR_diff_Avg.Stability
Created 5 Golden Features in 0.16 seconds.
2_Default_Xgboost_GoldenFeatures rmse 0.583598 trained in 2.04 seconds
* Step insert_random_feature will try to check up to 1 model
2_Default_Xgboost_GoldenFeatures_RandomFeature rmse 0.587742 trained in 3.47 seconds
Drop features ['DAR_diff_Avg.Stability', 'MutationSite_ratio_Avg.Stability', 'Avg.Stability_ratio_MutationSite', 'random_feature', 'Residue_T', 'Residue_L', 'Residue_F', 'Residue_N', 'Residue_Y', 'Linker', 'Residue_P', 'Residue_R', 'Reox', 'Residue_K', 'Residue_Q', 'Residue_I', 'Residue_H', 'Residue_C', 'Residue_D', 'Residue_G', 'Residue_W', 'Residue_A', 'Residue_M', 'Residue_S', 'Residue_E']
* Step features_selection will try to check up to 2 models
20_RandomForest_SelectedFeatures rmse 0.584829 trained in 8.33 seconds
2_Default_Xgboost_GoldenFeatures_SelectedFeatures rmse 0.577184 trained in 1.58 seconds
* Step hill_climbing_1 will try to check up to 7 models
21_RandomForest_SelectedFeatures rmse 0.585223 trained in 7.64 seconds
22_RandomForest rmse 0.591711 trained in 7.91 seconds
23_RandomForest rmse 0.593245 trained in 9.45 seconds
24_RandomForest rmse 0.590171 trained in 6.85 seconds
25_Xgboost_GoldenFeatures_SelectedFeatures rmse 0.574951 trained in 1.61 seconds
26_Xgboost_GoldenFeatures rmse 0.584706 trained in 1.81 seconds
27_Xgboost rmse 0.579199 trained in 1.72 seconds
* Step hill_climbing_2 will try to check up to 4 models
28_RandomForest_SelectedFeatures rmse 0.586703 trained in 5.99 seconds
29_Xgboost_GoldenFeatures_SelectedFeatures rmse 0.575996 trained in 1.61 seconds
30_Xgboost_GoldenFeatures_SelectedFeatures rmse 0.579304 trained in 1.57 seconds
31_Xgboost rmse 0.581736 trained in 1.63 seconds
* Step ensemble will try to check up to 1 model
Ensemble rmse 0.569821 trained in 3.57 seconds
* Step stack will try to check up to 10 models
25_Xgboost_GoldenFeatures_SelectedFeatures_Stacked rmse 0.580141 trained in 2.0 seconds
29_Xgboost_GoldenFeatures_SelectedFeatures_Stacked rmse 0.582319 trained in 2.21 seconds
2_Default_Xgboost_GoldenFeatures_SelectedFeatures_Stacked rmse 0.582473 trained in 2.06 seconds
27_Xgboost_Stacked rmse 0.582971 trained in 2.06 seconds
30_Xgboost_GoldenFeatures_SelectedFeatures_Stacked rmse 0.5845 trained in 1.84 seconds
31_Xgboost_Stacked rmse 0.581005 trained in 2.06 seconds
2_Default_Xgboost_GoldenFeatures_Stacked rmse 0.584416 trained in 2.16 seconds
2_Default_Xgboost_Stacked rmse 0.583641 trained in 2.09 seconds
26_Xgboost_GoldenFeatures_Stacked rmse 0.580362 trained in 2.2 seconds
2_Default_Xgboost_GoldenFeatures_RandomFeature_Stacked rmse 0.585285 trained in 4.73 seconds
* Step ensemble_stacked will try to check up to 1 model
Ensemble_Stacked rmse 0.569267 trained in 6.07 seconds
AutoML fit time: 210.63 seconds

Out[9]:

AutoML(algorithms=['Random Forest', 'Xgboost'], mode='Compete')

가장 좋은 rmse값은 0.569267 입니다. 큰 진전은 없어보입니다. 다시 시각화를 해서 살펴보겠습니다.

In [10]:

predict_2 = automl.predict_all(X_test)

df_plot["Pred_2"] = predict_2.prediction.values
g = sns.jointplot(
    x="True_value", y="Pred_2", data=df_plot, kind="reg", truncate=False, height=7
)

기존 모델과 최적화된 모델의 차이를 scatter plot 시각화해서 살펴보겠습니다.

In [14]:

plt.figure(figsize=(7, 7))
plt.plot(y_test, predict_1.prediction, ".", label="model1")
plt.plot(y_test, predict_2.prediction, ".", label="model2")
plt.ylim(-0.1, 4.5)
plt.xlim(-0.1, 4.5)
plt.xlabel("True value")
plt.ylabel("Predicted value")
plt.legend()

Out[14]:

<matplotlib.legend.Legend at 0x7ff97a4c3850>

큰 차이는 보이지 않지만 hyper parameter가 최적화된 모델에는 몇개의 데이터가 중심에서 벗어난 것 처럼 보입니다.

4. 마치며¶

그동안 기계 학습에서 적합한 모델을 선택하고 hyper parameter를 최적화하는 작업은 사용자가 할 일이라고 여겨져왔습니다. 그러나 AutoML이라는 개념이 등장하면서 이제 사용자들은 feature engineering에 집중할 수 있게 되었습니다. 그런 AutoML 도구 중에 mljar-supervised 패키지는 쉬운 사용법과 상당히 깔끔한 보고서를 만들 수 있게 해줍니다. 앞으로는 feature engineering 마져 자동화하는 도구가 나올지도 모른다는 생각이 듭니다.

	Linker	HC/LC	Residue	MutationSite	Conc.(mg/mL)	%Agg	DAR	Reox	Avg.Stability
0	PDS	HC	E	1	3.7	43	1.0	Y	3
1	PDS	HC	V	2	3.1	13	1.6	Y	36
2	PDS	HC	Q	3	2.4	9	1.5	Y	48
3	PDS	HC	L	4	2.2	26	1.4	Y	8
4	PDS	HC	V	5	3.1	12	1.6	Y	84
5	PDS	HC	E	6	2.7	11	0.3	Y	19
6	PDS	HC	S	7	2.7	25	0.8	Y	12
7	PDS	HC	G	8	2.1	30	1.7	Y	1
8	PDS	HC	G	9	3.4	35	1.5	Y	2
9	PDS	HC	G	10	2.2	9	1.5	Y	1

	Linker	HC/LC	Residue	MutationSite	Conc.(mg/mL)	%Agg	DAR	Reox	Avg.Stability
0	PDS	HC	E	1	3.7	43	1.0	Y	3
1	PDS	HC	V	2	3.1	13	1.6	Y	36
2	PDS	HC	Q	3	2.4	9	1.5	Y	48
3	PDS	HC	L	4	2.2	26	1.4	Y	8
4	PDS	HC	V	5	3.1	12	1.6	Y	84
5	PDS	HC	E	6	2.7	11	0.3	Y	19
6	PDS	HC	S	7	2.7	25	0.8	Y	12
7	PDS	HC	G	8	2.1	30	1.7	Y	1
8	PDS	HC	G	9	3.4	35	1.5	Y	2
9	PDS	HC	G	10	2.2	9	1.5	Y	1

	Linker	HC/LC	Residue	MutationSite	Conc.(mg/mL)	%Agg	DAR	Reox	Avg.Stability
0	PDS	HC	E	1	3.7	43	1.0	Y	3
1	PDS	HC	V	2	3.1	13	1.6	Y	36
2	PDS	HC	Q	3	2.4	9	1.5	Y	48
3	PDS	HC	L	4	2.2	26	1.4	Y	8
4	PDS	HC	V	5	3.1	12	1.6	Y	84
5	PDS	HC	E	6	2.7	11	0.3	Y	19
6	PDS	HC	S	7	2.7	25	0.8	Y	12
7	PDS	HC	G	8	2.1	30	1.7	Y	1
8	PDS	HC	G	9	3.4	35	1.5	Y	2
9	PDS	HC	G	10	2.2	9	1.5	Y	1