컬럼의 DBC 측정하기

Taeyoon Kim

2020-09-02 20:07

1. DBC(dynamic binding capacity)란 무엇인가?¶

DBC는 단백질 정제에서 크로마토그래피 컬럼의 특정 실험 조건에 결합 할 수 있는 표적 단백질의 최대 양을 나타냅니다. 그러나 DBC는 유속이나 버퍼의 조성 및 단백질 시료에 따라 달라지기 때문에 실험 조건에 따라 달라 집니다.

그렇기 때문에 단백질 정제에서 적합한 레진을 선택하기 위해서는 DBC에 대한 정보가 매우 중요합니다. 보통 레진의 제조사는 일반적인 단백질 정제 조건에서의 DBC 정보를 제공하고는 있지만 특정 단백질에 대한 최상의 레진 및 실험 조건을 찾기 위해 직접 DBC를 측정해야 할 일 이 생깁니다.

1.1. breakthrough 곡선¶

^{image from cytiva}

breakthrough 곡선은 컬럼에 결합하지 못하고 흘러 나온 단백질의 양을 그래프로 표현한 것입니다. 그리고 전체의 10%의 단백질이 흘러나온 경우를 QB10 값이라 합니다.

1.2. DBC 계산법¶

^{image from cytiva}

위 그래프의 파란색 영역은 지연 볼륨(delay volume)으로 사용한 컬럼의 볼륨과 AKTA 시스템에 의해 발생합니다. 주황색 영역은 비 결합 단백질의 양을 나타냅니다. 파란색 곡선은 단백질 breakthrough를 나타내는 것으로 녹색 영역은 X% breakthrough에서 누출 된 단백질의 양을 나타냅니다. X% 에서의 총 DBC는 회색 영역으로 표시됩니다.

위 그림을 예시로 Aoffset은 50 mAu이고 Amax는 1000 mAu 이라는 것을 알 수 있습니다.

50 ml(Vx)에서 측정된 UV 흡광도는 126 mAu 이지만 아래의 계산을 통해 실제 값은 76 mAu라고 생각할 수 있습니다.

A- Aoffset = 126 – 50 = 76 mAu

그리고 최대 UV 흡광도도 마찬가지로 950 mAu가 될 것입니다.

Amax - Aoffset = 1000 – 50 = 950 mAu

따라서 breakthrough의 비율은

(A - Aoffset) / (Amax - Aoffset) = Percentage of breakthrough

를 통해 계산합니다. 결과적으로 50ml에서 breakthrough의 비율은 8%가 됩니다.

% breakthrough = 76 / 950 = 8%

2. DBC 계산하기¶

실제 실험을 통해 얻은 데이터를 통해 DBC(QB10)을 계산해보도록 하겠습니다.

2.1. 실험 데이터 불러오기¶

In [1]:

import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline
import numpy as np
from scipy.optimize import curve_fit
from pynverse import inversefunc

delayed_vol = 2.0  # 컬럼을 거치는데 필요한 부피(ml)
protein_conc = 3.97  # 사용한 단백질의 농도(mg/ml)

df = pd.read_csv("0827_ProteinA_DBC.csv")
df

Out[1]:

	Residence_time	Volume(ml)	Breakthrough(%)
0	3	13.498873	2.998795
1	3	17.998690	4.517550
2	3	22.498837	12.963373
3	3	26.998789	29.624571
4	3	31.499197	49.684577
5	3	35.999281	60.876402
6	3	40.499182	75.469976
7	3	44.999430	82.386165
8	3	49.499411	87.710916
9	3	53.999578	87.831578
10	4	13.498536	2.192587
11	4	17.998536	2.716336
12	4	22.498784	5.259035
13	4	26.999043	14.780687
14	4	31.498960	31.632741
15	4	35.999253	48.004374
16	4	40.499124	65.935281
17	4	44.999433	77.252233
18	4	49.499291	84.863446
19	4	53.999555	88.617977

In [2]:

# 테이블 두개로 분리하기
df1 = df[df["Residence_time"] == 3]
df2 = df[df["Residence_time"] == 4]
df1

Out[2]:

	Residence_time	Volume(ml)	Breakthrough(%)
0	3	13.498873	2.998795
1	3	17.998690	4.517550
2	3	22.498837	12.963373
3	3	26.998789	29.624571
4	3	31.499197	49.684577
5	3	35.999281	60.876402
6	3	40.499182	75.469976
7	3	44.999430	82.386165
8	3	49.499411	87.710916
9	3	53.999578	87.831578

2.2. 전체 데이터 시각화¶

시각화를 통해 전체 데이터의 모습을 확인합니다.

In [3]:

plt.plot(df1["Volume(ml)"], df1["Breakthrough(%)"], label="Residence_time=3")
plt.plot(df2["Volume(ml)"], df2["Breakthrough(%)"], label="Residence_time=4")
plt.ylabel("Breakthrough(%)")
plt.xlabel("Volume(ml)")
plt.legend()

Out[3]:

<matplotlib.legend.Legend at 0x7eff93892670>

No description has been provided for this image

2.3. 그래프 피팅하기¶

시그모이드 함수 정의하고 curvefit으로 피팅하기

In [5]:

def sigmoid(x, L, x0, k, b):
    y = L / (1 + np.exp(-k * (x - x0))) + b
    return y


def fitting(xdata, ydata):
    p0 = [max(ydata), np.median(xdata), 1, min(ydata)]
    popt, pcov = curve_fit(sigmoid, xdata, ydata, p0, method="dogbox")
    residuals = ydata - sigmoid(xdata, *popt)
    ss_res = np.sum(residuals**2)
    ss_tot = np.sum((ydata - np.mean(ydata)) ** 2)
    r_squared = 1 - (ss_res / ss_tot)
    return xdata, ydata, popt, r_squared


Run1 = fitting(df1["Volume(ml)"], df1["Breakthrough(%)"])
Run2 = fitting(df2["Volume(ml)"], df2["Breakthrough(%)"])

2.3.1. Residence time 3분 DBC 계산¶

피팅이 올바른지 그림으로 확인한다.

In [8]:

x_fit = np.linspace(1, 60, 100)
y_fit = sigmoid(x_fit, *Run1[2])
plt.plot(Run1[0], Run1[1], "o", label="data", color="k")
plt.plot(x_fit, y_fit, label="fit", color="r")
plt.text(40, 20, s=f"r_squared={Run1[3]:.3f}", fontsize=10)
plt.ylabel("Breakthrough(%)")
plt.xlabel("Volume(ml)")
plt.legend()

Out[8]:

<matplotlib.legend.Legend at 0x7eff93853520>

In [9]:

qb10_run1 = inversefunc(sigmoid, args=tuple(Run1[2]), y_values=10)
print(f"10% breakthrough 일때 fraction(ml)의 값 {qb10_run1:.3f} ml 이고")
dbc_run1 = (qb10_run1 - delayed_vol) * protein_conc
print(f"따라서 DBC(QB10)의 값은 {dbc_run1:.3f} mg/ml 이다.")

10% breakthrough 일때 fraction(ml)의 값 20.363 ml 이고
따라서 DBC(QB10)의 값은 72.902 mg/ml 이다.

2.3.1. Residence time 4분 DBC 계산¶

앞서 사용한 방법과 동일하게 진행한다.

In [10]:

x_fit = np.linspace(1, 60, 100)
y_fit = sigmoid(x_fit, *Run2[2])
plt.plot(Run2[0], Run2[1], "o", label="data", color="k")
plt.plot(x_fit, y_fit, label="fit", color="r")
plt.text(40, 20, s=f"r_squared={Run2[3]:.3f}", fontsize=10)
plt.ylabel("Breakthrough(%)")
plt.xlabel("Volume(ml)")
plt.legend()

Out[10]:

<matplotlib.legend.Legend at 0x7eff93881640>

In [11]:

qb10_run2 = inversefunc(sigmoid, args=tuple(Run2[2]), y_values=10)
print(f"10% breakthrough 일때 fraction(ml)의 값 {qb10_run2:.3f} ml 이고")
dbc_run2 = (qb10_run2 - delayed_vol) * protein_conc
print(f"따라서 DBC(QB10)의 값은 {dbc_run2:.3f} mg/ml 이다.")

10% breakthrough 일때 fraction(ml)의 값 24.400 ml 이고
따라서 DBC(QB10)의 값은 88.928 mg/ml 이다.

3. 결론¶

Residence time은 평균적으로 단백질 시료가 정제 컬럼에 체류하는 시간의 값 입니다. 따라서 평균 체류 시간이 길수록 DBC는 증가할 것으로 예측할 수 있습니다. 위의 예시를 통해서도 3분일때는 72.902 mg/ml, 4분일때는 88.928 mg/ml 이라는 확인 할 수 있습니다.

3.1. 참고 자료¶

GE Healthcare Life Sciences
Pranav Gupte et. al, Dynamic Binding Capacities of Protein A Resins for Antibody Capture: A Comparative Evaluation, 2020, BioProcessInternational.

10분안에 배우는 Pingouin

Taeyoon Kim

2020-07-20 22:17

logo

Pingouin은 pandas와 numpy를 기반으로 한 오픈소스 통계 패키지입니다. 아래의 목록은 pingouin으로 할 수 있는 대표적인 기능입니다. 모든 기능에 대해서는 API 문서를 참고하세요.

분산 분석(ANOVAs): N-ways, repeated measures, mixed, ancova
Pairwise 사후 검정(post-hocs tests), pairwise 상관관계
견고한(Robust), 부분(partial), 거리(distance), 반복 측정 상관관계
선형(Linear) 회귀, 로지스틱(logistic) 회귀, 매개(mediation) 분석
베이즈 인자(Bayes factor)
다변량(Multivariate) 테스트
신뢰성과 일관성 검정
효과 크기 및 검정력 분석
효과 크기 또는 상관 계수의 모수(Parametric) 혹은 부트스트랩(bootstrapped) 신뢰구간
순환(Circular) 통계
카이제곱 검정(chi-squared test)
Bland-Altman plot, Q-Q plot, paired plot, robust correlation 시각화

Pingouin은 간단하지만 완전한 통계 기능를 위해 설계되었습니다. 예를 들어 기존의 SciPy 패키지의 ttest_ind 함수는 T-value과 p-value만 알려주지만 Pingouin의 ttest 함수는 T-value, p-value뿐만 아니라 자유도, 효과 크기(Cohen 's d), 95% 신뢰 구간, 통계적 검정력등을 동시에 출력합니다.

0. 설치법¶

Pingouin은 파이썬3 패키지이며 현재 버전 3.6, 3.7에서 테스트되었습니다. 따라서 파이썬 2.7에서는 작동하지 않습니다.

Pingouin의 주요 종속 패키지는 다음과 같습니다:

NumPy (>= 1.15)
SciPy (>= 1.3.0)
Pandas (>= 0.24)
Pandas-flavor (>= 0.1.2)
Matplotlib (>= 3.0.2)
Seaborn (>= 0.9.0)

또한 일부 기능에는 다음 패키지가 필요합니다:

Statsmodels
Scikit-learn
Mpmath

가장 쉬운 방법은 아래와 같이 pip 명령어를 사용하는 것입니다.

pip install pingouin

혹은 conda를 사용할 수도 있습니다.

conda install -c conda-forge pingouin

아직 Pingouin은 개발 중에 있으며 버그 수정을 위해 새로운 버전이 계속 배포되고 있습니다(한 달에 약 1 회). 그러니 항상 최신 버전의 Pingouin을 사용하고 있는지 확인하세요. 새로운 버전이 출시 될 때마다 터미널 창에 다음 명령어를 입력해 업그레이드 할 수 있습니다.

pip install --upgrade pingouin

# conda를 사용할 경우
conda update pingouin

자세한 내용은 Pingouin 공식 페이지을 참고하세요.

1. 예제 코드 살펴보기¶

필요한 패키지를 불러옵니다.

In [1]:

import numpy as np
import pingouin as pg

np.random.seed(42)

1.1. One-sample T-test¶

모집단의 평균은 4로 가정

In [2]:

mu = 4
x = [5.5, 2.4, 6.8, 9.6, 4.2]
pg.ttest(x, mu)

Out[2]:

	T	dof	tail	p-val	CI95%	cohen-d	BF10	power
T-test	1.397391	4	two-sided	0.234824	[-1.68, 5.08]	0.624932	0.766	0.191796

자유도(dof)는 4, T-value(T)는 1.3973 이며 p-Value가 일반적인 기준(0.05) 이상이기 때문에 표본 x의 평균은 모집단의 평균과 차이가 없다(귀무가설)고 볼 수 있다.

1.2. Paired T-test¶

꼬리를 one-sided로 설정하면 pingouin이 알아서 꼬리의 방향을 알려줍니다. 아래 코드의 경우 T-value가 음수이기 때문에 꼬리의 방향이 less로 표현됩니다.

In [3]:

pre = [5.5, 2.4, 6.8, 9.6, 4.2]
post = [6.4, 3.4, 6.4, 11.0, 4.8]
pg.ttest(pre, post, paired=True, tail="one-sided")

Out[3]:

	T	dof	tail	p-val	CI95%	cohen-d	BF10	power
T-test	-2.307832	4	less	0.041114	[-inf, -0.05]	0.250801	3.122	0.12048

꼬리의 방향이 less라는 것은 표본 x의 평균이 표본 y의 평균보다 작다는 것을 뜻합니다.

일부러 꼬리의 방향을 반대(greater)로 한 대립 가설을 확인해 봅니다.

In [4]:

pg.ttest(pre, post, paired=True, tail="greater")

Out[4]:

	T	dof	tail	p-val	CI95%	cohen-d	BF10	power
T-test	-2.307832	4	greater	0.958886	[-1.35, inf]	0.250801	0.32	0.016865

p-Value가 엉망인것을 알 수 있습니다.

1.3. Two-sample T-test¶

1.3.1. 표본 크기가 같은 경우¶

In [5]:

x = np.random.normal(loc=7, size=20)
y = np.random.normal(loc=4, size=20)
pg.ttest(x, y, correction="auto")

Out[5]:

	T	dof	tail	p-val	CI95%	cohen-d	BF10	power
T-test	10.151246	38	two-sided	2.245760e-12	[2.48, 3.71]	3.210106	2.223e+09	1.0

1.3.2. 표본 크기가 다른경우¶

In [6]:

x = np.random.normal(loc=7, size=20)
y = np.random.normal(loc=4, size=15)
pg.ttest(x, y, correction="auto")

Out[6]:

	T	dof	tail	p-val	CI95%	cohen-d	BF10	power
T-test	8.352442	24.033207	two-sided	1.443438e-08	[2.21, 3.65]	2.995554	5.808e+06	1.0

1.4. Pearson’s correlation¶

In [7]:

mean, cov, n = [4, 5], [(1, 0.6), (0.6, 1)], 30
x, y = np.random.multivariate_normal(mean, cov, n).T
pg.corr(x, y)

Out[7]:

	n	r	CI95%	r2	adj_r2	p-val	BF10	power
pearson	30	0.63893	[0.36, 0.81]	0.408231	0.364397	0.000145	220.85	0.978466

1.5. Robust correlation¶

In [8]:

# 표본 x에 아웃라이어 추가
x[5] = 18
# Use the robust Shepherd's pi correlation
pg.corr(x, y, method="shepherd")

Out[8]:

	n	outliers	r	CI95%	r2	adj_r2	p-val	power
shepherd	30	3	0.391331	[0.04, 0.66]	0.15314	0.090409	0.043538	0.586546

1.6. 데이터의 정규성 테스트¶

pingouin.normality()함수를 pandas의 데이터 프레임형식에 사용할 수 있습니다.

In [9]:

# 일변량 정규성(Univariate normality)
pg.normality(x)

Out[9]:

	W	pval	normal
0	0.485009	3.733778e-09	False

In [10]:

# 다변량 정규성(Multivariate normality)
pg.multivariate_normality(np.column_stack((x, y)))

Out[10]:

(False, 6.620602006901788e-07)

1.7. Q-Q plot 시각화¶

In [11]:

x = np.random.normal(size=50)
ax = pg.qqplot(x, dist="norm")

1.8. One-way ANOVA¶

기본 내장되어 있는 데이터프레임(mixed_anova)을 사용합니다.

In [12]:

# Read an example dataset
df = pg.read_dataset("mixed_anova")
df.tail()

Out[12]:

	Scores	Time	Group	Subject
175	6.176981	June	Meditation	55
176	8.523692	June	Meditation	56
177	6.522273	June	Meditation	57
178	4.990568	June	Meditation	58
179	7.822986	June	Meditation	59

In [13]:

# Run the ANOVA
aov = pg.anova(data=df, dv="Scores", between="Group", detailed=True)
aov

Out[13]:

	Source	SS	DF	MS	F	p-unc	np2
0	Group	5.459963	1	5.459963	5.243656	0.0232	0.028616
1	Within	185.342729	178	1.041251	NaN	NaN	NaN

1.9. Repeated measures ANOVA¶

In [14]:

pg.rm_anova(data=df, dv="Scores", within="Time", subject="Subject", detailed=True)

Out[14]:

	Source	SS	DF	MS	F	p-unc	np2	eps
0	Time	7.628428	2	3.814214	3.912796	0.022629	0.062194	0.998751
1	Error	115.027023	118	0.974805	NaN	NaN	NaN	NaN

1.10. Post-hoc tests corrected for multiple-comparisons¶

In [15]:

# FDR-corrected post hocs with Hedges'g effect size
posthoc = pg.pairwise_ttests(
    data=df,
    dv="Scores",
    within="Time",
    subject="Subject",
    parametric=True,
    padjust="fdr_bh",
    effsize="hedges",
)

posthoc

Out[15]:

	Contrast	A	B	Paired	Parametric	T	dof	Tail	p-unc	p-corr	p-adjust	BF10	hedges
0	Time	August	January	True	True	-1.740370	59.0	two-sided	0.087008	0.130512	fdr_bh	0.582	-0.327583
1	Time	August	June	True	True	-2.743238	59.0	two-sided	0.008045	0.024134	fdr_bh	4.232	-0.482547
2	Time	January	June	True	True	-1.023620	59.0	two-sided	0.310194	0.310194	fdr_bh	0.232	-0.169520

1.11. Two-way mixed ANOVA¶

In [16]:

# Compute the two-way mixed ANOVA and export to a .csv file
aov = pg.mixed_anova(
    data=df,
    dv="Scores",
    between="Group",
    within="Time",
    subject="Subject",
    correction=False,
    effsize="np2",
)
aov

Out[16]:

	Source	SS	DF1	DF2	MS	F	p-unc	np2	eps
0	Group	5.459963	1	58	5.459963	5.051709	0.028420	0.080120	NaN
1	Time	7.628428	2	116	3.814214	4.027394	0.020369	0.064929	0.998751
2	Interaction	5.167192	2	116	2.583596	2.727996	0.069545	0.044922	NaN

1.12. Bland-Altman plot¶

In [17]:

mean, cov = [10, 11], [[1, 0.8], [0.8, 1]]
x, y = np.random.multivariate_normal(mean, cov, 30).T
ax = pg.plot_blandaltman(x, y)

1.13. paired T-test 검정력 시각화¶

T-검정의 표본 크기와 효과 크기(Cohen'd)에 따른 검정력 곡선을 시각화합니다.

In [18]:

import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="ticks", context="notebook", font_scale=1.2)
d = 0.5  # Fixed effect size
n = np.arange(5, 80, 5)  # Incrementing sample size
# Compute the achieved power
pwr = pg.power_ttest(d=d, n=n, contrast="paired", tail="two-sided")
plt.plot(n, pwr, "ko-.")
plt.axhline(0.8, color="r", ls=":")
plt.xlabel("Sample size")
plt.ylabel("Power (1 - type II error)")
plt.title("Achieved power of a paired T-test")
sns.despine()

1.14. Paired plot¶

mixed_anova데이터셋을 가지고 명상이 학교 성적에 미치는 영향에 대한 시각화를 해본다.

In [19]:

df = pg.read_dataset("mixed_anova").query("Group == 'Meditation' and Time != 'January'")

pg.plot_paired(data=df, dv="Scores", within="Time", subject="Subject")

Out[19]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f24693d2dc0>

2. 마무리하며,¶

파이썬을 사용한 통계 분석은 R언어에 비해 사용법과 기능이 부족했습니다. 그렇기 때문에 pingouin의 등장이 반갑습니다. 아직 0.3.6버전이지만 기존에 존재했던 통계 분석 패키지들을 뛰어 넘는 성능과 완성도를 보여주고 있어 앞으로가 더 기대가 됩니다.

In [20]:

pg.__version__

Out[20]:

'0.3.6'