logo

Pingouin은 pandas와 numpy를 기반으로 한 오픈소스 통계 패키지입니다. 아래의 목록은 pingouin으로 할 수 있는 대표적인 기능입니다. 모든 기능에 대해서는 API 문서를 참고하세요.

분산 분석(ANOVAs): N-ways, repeated measures, mixed, ancova
Pairwise 사후 검정(post-hocs tests), pairwise 상관관계
견고한(Robust), 부분(partial), 거리(distance), 반복 측정 상관관계
선형(Linear) 회귀, 로지스틱(logistic) 회귀, 매개(mediation) 분석
베이즈 인자(Bayes factor)
다변량(Multivariate) 테스트
신뢰성과 일관성 검정
효과 크기 및 검정력 분석
효과 크기 또는 상관 계수의 모수(Parametric) 혹은 부트스트랩(bootstrapped) 신뢰구간
순환(Circular) 통계
카이제곱 검정(chi-squared test)
Bland-Altman plot, Q-Q plot, paired plot, robust correlation 시각화

Pingouin은 간단하지만 완전한 통계 기능를 위해 설계되었습니다. 예를 들어 기존의 SciPy 패키지의 ttest_ind 함수는 T-value과 p-value만 알려주지만 Pingouin의 ttest 함수는 T-value, p-value뿐만 아니라 자유도, 효과 크기(Cohen 's d), 95% 신뢰 구간, 통계적 검정력등을 동시에 출력합니다.

0. 설치법¶

Pingouin은 파이썬3 패키지이며 현재 버전 3.6, 3.7에서 테스트되었습니다. 따라서 파이썬 2.7에서는 작동하지 않습니다.

Pingouin의 주요 종속 패키지는 다음과 같습니다:

NumPy (>= 1.15)
SciPy (>= 1.3.0)
Pandas (>= 0.24)
Pandas-flavor (>= 0.1.2)
Matplotlib (>= 3.0.2)
Seaborn (>= 0.9.0)

또한 일부 기능에는 다음 패키지가 필요합니다:

Statsmodels
Scikit-learn
Mpmath

가장 쉬운 방법은 아래와 같이 pip 명령어를 사용하는 것입니다.

pip install pingouin

혹은 conda를 사용할 수도 있습니다.

conda install -c conda-forge pingouin

아직 Pingouin은 개발 중에 있으며 버그 수정을 위해 새로운 버전이 계속 배포되고 있습니다(한 달에 약 1 회). 그러니 항상 최신 버전의 Pingouin을 사용하고 있는지 확인하세요. 새로운 버전이 출시 될 때마다 터미널 창에 다음 명령어를 입력해 업그레이드 할 수 있습니다.

pip install --upgrade pingouin

# conda를 사용할 경우
conda update pingouin

자세한 내용은 Pingouin 공식 페이지을 참고하세요.

1. 예제 코드 살펴보기¶

필요한 패키지를 불러옵니다.

In [1]:

import numpy as np
import pingouin as pg

np.random.seed(42)

1.1. One-sample T-test¶

모집단의 평균은 4로 가정

In [2]:

mu = 4
x = [5.5, 2.4, 6.8, 9.6, 4.2]
pg.ttest(x, mu)

Out[2]:

	T	dof	tail	p-val	CI95%	cohen-d	BF10	power
T-test	1.397391	4	two-sided	0.234824	[-1.68, 5.08]	0.624932	0.766	0.191796

자유도(dof)는 4, T-value(T)는 1.3973 이며 p-Value가 일반적인 기준(0.05) 이상이기 때문에 표본 x의 평균은 모집단의 평균과 차이가 없다(귀무가설)고 볼 수 있다.

1.2. Paired T-test¶

꼬리를 one-sided로 설정하면 pingouin이 알아서 꼬리의 방향을 알려줍니다. 아래 코드의 경우 T-value가 음수이기 때문에 꼬리의 방향이 less로 표현됩니다.

In [3]:

pre = [5.5, 2.4, 6.8, 9.6, 4.2]
post = [6.4, 3.4, 6.4, 11.0, 4.8]
pg.ttest(pre, post, paired=True, tail="one-sided")

Out[3]:

	T	dof	tail	p-val	CI95%	cohen-d	BF10	power
T-test	-2.307832	4	less	0.041114	[-inf, -0.05]	0.250801	3.122	0.12048

꼬리의 방향이 less라는 것은 표본 x의 평균이 표본 y의 평균보다 작다는 것을 뜻합니다.

일부러 꼬리의 방향을 반대(greater)로 한 대립 가설을 확인해 봅니다.

In [4]:

pg.ttest(pre, post, paired=True, tail="greater")

Out[4]:

	T	dof	tail	p-val	CI95%	cohen-d	BF10	power
T-test	-2.307832	4	greater	0.958886	[-1.35, inf]	0.250801	0.32	0.016865

p-Value가 엉망인것을 알 수 있습니다.

1.3. Two-sample T-test¶

1.3.1. 표본 크기가 같은 경우¶

In [5]:

x = np.random.normal(loc=7, size=20)
y = np.random.normal(loc=4, size=20)
pg.ttest(x, y, correction="auto")

Out[5]:

	T	dof	tail	p-val	CI95%	cohen-d	BF10	power
T-test	10.151246	38	two-sided	2.245760e-12	[2.48, 3.71]	3.210106	2.223e+09	1.0

1.3.2. 표본 크기가 다른경우¶

In [6]:

x = np.random.normal(loc=7, size=20)
y = np.random.normal(loc=4, size=15)
pg.ttest(x, y, correction="auto")

Out[6]:

	T	dof	tail	p-val	CI95%	cohen-d	BF10	power
T-test	8.352442	24.033207	two-sided	1.443438e-08	[2.21, 3.65]	2.995554	5.808e+06	1.0

1.4. Pearson’s correlation¶

In [7]:

mean, cov, n = [4, 5], [(1, 0.6), (0.6, 1)], 30
x, y = np.random.multivariate_normal(mean, cov, n).T
pg.corr(x, y)

Out[7]:

	n	r	CI95%	r2	adj_r2	p-val	BF10	power
pearson	30	0.63893	[0.36, 0.81]	0.408231	0.364397	0.000145	220.85	0.978466

1.5. Robust correlation¶

In [8]:

# 표본 x에 아웃라이어 추가
x[5] = 18
# Use the robust Shepherd's pi correlation
pg.corr(x, y, method="shepherd")

Out[8]:

	n	outliers	r	CI95%	r2	adj_r2	p-val	power
shepherd	30	3	0.391331	[0.04, 0.66]	0.15314	0.090409	0.043538	0.586546

1.6. 데이터의 정규성 테스트¶

pingouin.normality()함수를 pandas의 데이터 프레임형식에 사용할 수 있습니다.

In [9]:

# 일변량 정규성(Univariate normality)
pg.normality(x)

Out[9]:

	W	pval	normal
0	0.485009	3.733778e-09	False

In [10]:

# 다변량 정규성(Multivariate normality)
pg.multivariate_normality(np.column_stack((x, y)))

Out[10]:

(False, 6.620602006901788e-07)

1.7. Q-Q plot 시각화¶

In [11]:

x = np.random.normal(size=50)
ax = pg.qqplot(x, dist="norm")

No description has been provided for this image

1.8. One-way ANOVA¶

기본 내장되어 있는 데이터프레임(mixed_anova)을 사용합니다.

In [12]:

# Read an example dataset
df = pg.read_dataset("mixed_anova")
df.tail()

Out[12]:

	Scores	Time	Group	Subject
175	6.176981	June	Meditation	55
176	8.523692	June	Meditation	56
177	6.522273	June	Meditation	57
178	4.990568	June	Meditation	58
179	7.822986	June	Meditation	59

In [13]:

# Run the ANOVA
aov = pg.anova(data=df, dv="Scores", between="Group", detailed=True)
aov

Out[13]:

	Source	SS	DF	MS	F	p-unc	np2
0	Group	5.459963	1	5.459963	5.243656	0.0232	0.028616
1	Within	185.342729	178	1.041251	NaN	NaN	NaN

1.9. Repeated measures ANOVA¶

In [14]:

pg.rm_anova(data=df, dv="Scores", within="Time", subject="Subject", detailed=True)

Out[14]:

	Source	SS	DF	MS	F	p-unc	np2	eps
0	Time	7.628428	2	3.814214	3.912796	0.022629	0.062194	0.998751
1	Error	115.027023	118	0.974805	NaN	NaN	NaN	NaN

1.10. Post-hoc tests corrected for multiple-comparisons¶

In [15]:

# FDR-corrected post hocs with Hedges'g effect size
posthoc = pg.pairwise_ttests(
    data=df,
    dv="Scores",
    within="Time",
    subject="Subject",
    parametric=True,
    padjust="fdr_bh",
    effsize="hedges",
)

posthoc

Out[15]:

	Contrast	A	B	Paired	Parametric	T	dof	Tail	p-unc	p-corr	p-adjust	BF10	hedges
0	Time	August	January	True	True	-1.740370	59.0	two-sided	0.087008	0.130512	fdr_bh	0.582	-0.327583
1	Time	August	June	True	True	-2.743238	59.0	two-sided	0.008045	0.024134	fdr_bh	4.232	-0.482547
2	Time	January	June	True	True	-1.023620	59.0	two-sided	0.310194	0.310194	fdr_bh	0.232	-0.169520

1.11. Two-way mixed ANOVA¶

In [16]:

# Compute the two-way mixed ANOVA and export to a .csv file
aov = pg.mixed_anova(
    data=df,
    dv="Scores",
    between="Group",
    within="Time",
    subject="Subject",
    correction=False,
    effsize="np2",
)
aov

Out[16]:

	Source	SS	DF1	DF2	MS	F	p-unc	np2	eps
0	Group	5.459963	1	58	5.459963	5.051709	0.028420	0.080120	NaN
1	Time	7.628428	2	116	3.814214	4.027394	0.020369	0.064929	0.998751
2	Interaction	5.167192	2	116	2.583596	2.727996	0.069545	0.044922	NaN

1.12. Bland-Altman plot¶

In [17]:

mean, cov = [10, 11], [[1, 0.8], [0.8, 1]]
x, y = np.random.multivariate_normal(mean, cov, 30).T
ax = pg.plot_blandaltman(x, y)

1.13. paired T-test 검정력 시각화¶

T-검정의 표본 크기와 효과 크기(Cohen'd)에 따른 검정력 곡선을 시각화합니다.

In [18]:

import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="ticks", context="notebook", font_scale=1.2)
d = 0.5  # Fixed effect size
n = np.arange(5, 80, 5)  # Incrementing sample size
# Compute the achieved power
pwr = pg.power_ttest(d=d, n=n, contrast="paired", tail="two-sided")
plt.plot(n, pwr, "ko-.")
plt.axhline(0.8, color="r", ls=":")
plt.xlabel("Sample size")
plt.ylabel("Power (1 - type II error)")
plt.title("Achieved power of a paired T-test")
sns.despine()

1.14. Paired plot¶

mixed_anova데이터셋을 가지고 명상이 학교 성적에 미치는 영향에 대한 시각화를 해본다.

In [19]:

df = pg.read_dataset("mixed_anova").query("Group == 'Meditation' and Time != 'January'")

pg.plot_paired(data=df, dv="Scores", within="Time", subject="Subject")

Out[19]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f24693d2dc0>

2. 마무리하며,¶

파이썬을 사용한 통계 분석은 R언어에 비해 사용법과 기능이 부족했습니다. 그렇기 때문에 pingouin의 등장이 반갑습니다. 아직 0.3.6버전이지만 기존에 존재했던 통계 분석 패키지들을 뛰어 넘는 성능과 완성도를 보여주고 있어 앞으로가 더 기대가 됩니다.

In [20]:

pg.__version__

Out[20]:

'0.3.6'