통계 분석 주석넣기

Taeyoon Kim

2020-03-03 18:21

0. statannot 소개¶

statannot은 파이썬 시각화 라이브러리 seaborn로 그린 박스플롯에 통계분석에 대한 주석을 자동으로 달아주는 파이썬 패키지입니다. 자세한 것은 공식 홈페이지를 참고하세요.

막대그래프에 주석을 추가하는 것도 가능하지만 아직 완전하지 않습니다.

1. statannot 특징¶

scipy.stats 메소드를 사용한 통계분석:
- Mann-Whitney
- t-test (independent and paired)
- Welch's t-test
- Levene test
- Wilcoxon test
- Kruskal-Wallis test
플롯의 내부 또는 외부에 주석 넣기
통계분석의 주석을 별표(*), p-value 등으로 사용자가 지정할 수 있음
선택적으로 사용자 정의 p- value를 입력값으로 제공(이 경우 통계 분석은 수행되지 않음).

2. statannot 의존성¶

    Python >= 3.5
    numpy >= 1.12.1
    seaborn >= 0.8.1
    matplotlib >= 2.2.2
    pandas >= 0.23.0
    scipy >= 1.1.0

3. statannot 설치¶

PyPI를 통해 설치:

pip install statannot

개발중인 최신 버전은 다음 명령어로 설치:

pip install git+https://github.com/webermarcolivier/statannot.git

4. 사용 예시¶

먼저 필요한 라이브러리를 불러옵니다.

In [1]:

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from statannot import add_stat_annotation

sns.set(style="ticks")

4.1 Boxplot non-hue¶

4.1.1 다중 비교 교정¶

기본적으로 본페로니 교정이 적용됩니다.

In [2]:

df = sns.load_dataset("tips")
x = "day"
y = "total_bill"
order = ["Sun", "Thur", "Fri", "Sat"]
ax = sns.boxplot(data=df, x=x, y=y, order=order)
ax, test_results = add_stat_annotation(
    ax,
    data=df,
    x=x,
    y=y,
    order=order,
    box_pairs=[("Thur", "Fri"), ("Thur", "Sat"), ("Fri", "Sun")],
    test="Mann-Whitney",
    text_format="star",
    loc="outside",
    verbose=2,
)
sns.despine()  # 필요없는 axis border 제거

p-value annotation legend:
ns: 5.00e-02 < p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04

Thur v.s. Fri: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=1.000e+00 U_stat=6.305e+02
Thur v.s. Sat: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=1.407e-01 U_stat=2.180e+03
Sun v.s. Fri: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=8.041e-02 U_stat=9.605e+02

No description has been provided for this image

4.1.2 통계 분석 결과¶

add_stat_annotation 함수는 튜플 ax, test_results 를 반환합니다. 여기서 test_results는 원래 데이터와 통계 분석 결과 (p-value 등)를 모두 포함하는StatResult 리스트 객체입니다.

In [3]:

for res in test_results:
    print(res)

print("\nStatResult attributes:", test_results[0].__dict__.keys())

Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=1.000e+00 U_stat=6.305e+02
Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=1.407e-01 U_stat=2.180e+03
Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=8.041e-02 U_stat=9.605e+02

StatResult attributes: dict_keys(['test_str', 'test_short_name', 'stat_str', 'stat', 'pval', 'box1', 'box2'])

4.1.3. 다중 비교 교정이 없는 경우¶

In [4]:

x = "day"
y = "total_bill"
order = ["Sun", "Thur", "Fri", "Sat"]
ax = sns.boxplot(data=df, x=x, y=y, order=order)
test_results = add_stat_annotation(
    ax,
    data=df,
    x=x,
    y=y,
    order=order,
    box_pairs=[("Thur", "Fri"), ("Thur", "Sat"), ("Fri", "Sun")],
    test="Mann-Whitney",
    comparisons_correction=None,
    text_format="star",
    loc="outside",
    verbose=2,
)
sns.despine()  # 필요없는 axis border 제거

p-value annotation legend:
ns: 5.00e-02 < p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04

Thur v.s. Fri: Mann-Whitney-Wilcoxon test two-sided, P_val=6.477e-01 U_stat=6.305e+02
Thur v.s. Sat: Mann-Whitney-Wilcoxon test two-sided, P_val=4.690e-02 U_stat=2.180e+03
Sun v.s. Fri: Mann-Whitney-Wilcoxon test two-sided, P_val=2.680e-02 U_stat=9.605e+02

4.1.4. 주석의 위치¶

통계 분석 주석은 플롯 안쪽(loc = 'inside') 또는 바깥쪽(loc ='outside')에 위치할 수 있습니다. 아래 예시는 안쪽 있는 예시입니다.

In [5]:

x = "day"
y = "total_bill"
order = ["Sun", "Thur", "Fri", "Sat"]
ax = sns.boxplot(data=df, x=x, y=y, order=order)
add_stat_annotation(
    ax,
    data=df,
    x=x,
    y=y,
    order=order,
    box_pairs=[("Sun", "Thur"), ("Sun", "Sat"), ("Fri", "Sun")],
    perform_stat_test=False,
    pvalues=[0.1, 0.1, 0.001],
    test=None,
    text_format="star",
    loc="inside",
    verbose=2,
)
sns.despine()  # 필요없는 axis border 제거

p-value annotation legend:
ns: 5.00e-02 < p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04

Sun v.s. Thur: Custom statistical test, P_val:1.000e-01
Sun v.s. Fri: Custom statistical test, P_val:1.000e-03
Sun v.s. Sat: Custom statistical test, P_val:1.000e-01

4.2. Boxplot with hue¶

ymax 위치가 다른 상자 플롯을 만듭니다.

In [6]:

df = sns.load_dataset("diamonds")
df = df[df["color"].map(lambda x: x in "EIJ")]
# Modifying data to yield unequal boxes in the hue value
df.loc[df["cut"] == "Ideal", "price"] = df.loc[df["cut"] == "Ideal", "price"].map(
    lambda x: min(x, 5000)
)
df.loc[df["cut"] == "Premium", "price"] = df.loc[df["cut"] == "Premium", "price"].map(
    lambda x: min(x, 7500)
)
df.loc[df["cut"] == "Good", "price"] = df.loc[df["cut"] == "Good", "price"].map(
    lambda x: min(x, 15000)
)
df.loc[df["cut"] == "Very Good", "price"] = df.loc[
    df["cut"] == "Very Good", "price"
].map(lambda x: min(x, 3000))
df.head()

Out[6]:

	carat	cut	color	clarity	depth	table	price	x	y	z
0	0.23	Ideal	E	SI2	61.5	55.0	326	3.95	3.98	2.43
1	0.21	Premium	E	SI1	59.8	61.0	326	3.89	3.84	2.31
2	0.23	Good	E	VS1	56.9	65.0	327	4.05	4.07	2.31
3	0.29	Premium	I	VS2	62.4	58.0	334	4.20	4.23	2.63
4	0.31	Good	J	SI2	63.3	58.0	335	4.34	4.35	2.75

In [7]:

x = "color"
y = "price"
hue = "cut"
hue_order = ["Ideal", "Premium", "Good", "Very Good", "Fair"]
box_pairs = [
    (("E", "Ideal"), ("E", "Very Good")),
    (("E", "Ideal"), ("E", "Premium")),
    (("E", "Ideal"), ("E", "Good")),
    (("I", "Ideal"), ("I", "Premium")),
    (("I", "Ideal"), ("I", "Good")),
    (("J", "Ideal"), ("J", "Premium")),
    (("J", "Ideal"), ("J", "Good")),
    (("E", "Good"), ("I", "Ideal")),
    (("I", "Premium"), ("J", "Ideal")),
]
ax = sns.boxplot(
    data=df,
    x=x,
    y=y,
    hue=hue,
)
add_stat_annotation(
    ax,
    data=df,
    x=x,
    y=y,
    hue=hue,
    box_pairs=box_pairs,
    test="Mann-Whitney",
    loc="inside",
    verbose=2,
)
plt.legend(loc="upper left", bbox_to_anchor=(1.03, 1))
sns.despine()  # 필요없는 axis border 제거

p-value annotation legend:
ns: 5.00e-02 < p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04

E_Ideal v.s. E_Premium: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=1.404e-30 U_stat=3.756e+06
I_Ideal v.s. I_Premium: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=4.627e-60 U_stat=1.009e+06
J_Ideal v.s. J_Premium: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=3.616e-36 U_stat=2.337e+05
E_Ideal v.s. E_Good: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=4.681e-18 U_stat=1.480e+06
I_Ideal v.s. I_Good: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=4.507e-12 U_stat=4.359e+05
J_Ideal v.s. J_Good: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=9.056e-04 U_stat=1.174e+05
E_Ideal v.s. E_Very Good: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=1.562e-01 U_stat=4.850e+06
E_Good v.s. I_Ideal: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=1.000e+00 U_stat=9.882e+05
I_Premium v.s. J_Ideal: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=4.643e-26 U_stat=8.084e+05

4.3. 버킷 카테고리가 포함 된 박스 플롯¶

In [8]:

df = sns.load_dataset("tips")
df["tip_bucket"] = pd.cut(df["tip"], 3)
df.head()

Out[8]:

	total_bill	tip	sex	smoker	day	time	size	tip_bucket
0	16.99	1.01	Female	No	Sun	Dinner	2	(0.991, 4.0]
1	10.34	1.66	Male	No	Sun	Dinner	3	(0.991, 4.0]
2	21.01	3.50	Male	No	Sun	Dinner	3	(0.991, 4.0]
3	23.68	3.31	Male	No	Sun	Dinner	2	(0.991, 4.0]
4	24.59	3.61	Female	No	Sun	Dinner	4	(0.991, 4.0]

In [9]:

# In this case we just have to pass the list of categories objects to the add_stat_annotation function.
tip_bucket_list = df["tip_bucket"].unique()
tip_bucket_list

Out[9]:

[(0.991, 4.0], (4.0, 7.0], (7.0, 10.0]]
Categories (3, interval[float64]): [(0.991, 4.0] < (4.0, 7.0] < (7.0, 10.0]]

In [10]:

x = "day"
y = "total_bill"
hue = "tip_bucket"
data = df
ax = sns.boxplot(data=df, x=x, y=y, hue=hue)
add_stat_annotation(
    ax,
    data=df,
    x=x,
    y=y,
    hue=hue,
    box_pairs=[(("Sat", tip_bucket_list[2]), ("Fri", tip_bucket_list[0]))],
    test="t-test_ind",
    loc="inside",
    verbose=2,
)
plt.legend(loc="upper left", bbox_to_anchor=(1.03, 1))
sns.despine()  # 필요없는 axis border 제거

p-value annotation legend:
ns: 5.00e-02 < p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04

Fri_(0.991, 4.0] v.s. Sat_(7.0, 10.0]: t-test independent samples with Bonferroni correction, P_val=6.176e-07 stat=-7.490e+00

4.4. y 오프셋 조절하기¶

In [11]:

df = sns.load_dataset("tips")
x = "day"
y = "total_bill"
hue = "smoker"
ax = sns.boxplot(data=df, x=x, y=y, hue=hue)
add_stat_annotation(
    ax,
    data=df,
    x=x,
    y=y,
    hue=hue,
    box_pairs=[
        (("Thur", "No"), ("Fri", "No")),
        (("Sat", "Yes"), ("Sat", "No")),
        (("Sun", "No"), ("Thur", "Yes")),
    ],
    test="t-test_ind",
    text_format="full",
    loc="inside",
    comparisons_correction=None,
    line_offset_to_box=0.2,
    line_offset=0.1,
    line_height=0.05,
    text_offset=8,
    verbose=2,
)
plt.legend(loc="upper left", bbox_to_anchor=(1.03, 1))
sns.despine()  # 필요없는 axis border 제거

Sat_Yes v.s. Sat_No: t-test independent samples, P_val=4.304e-01 stat=7.922e-01
Thur_No v.s. Fri_No: t-test independent samples, P_val=7.425e-01 stat=-3.305e-01
Thur_Yes v.s. Sun_No: t-test independent samples, P_val=5.623e-01 stat=-5.822e-01

4.5. 사용자 정의 p-value를 입력으로 사용하기¶

In [12]:

df = sns.load_dataset("iris")
x = "species"
y = "sepal_length"

box_pairs = [
    ("setosa", "versicolor"),
    ("setosa", "virginica"),
    ("versicolor", "virginica"),
]

from scipy.stats import bartlett

test_short_name = "Bartlett"
pvalues = []
for pair in box_pairs:
    data1 = df.groupby(x)[y].get_group(pair[0])
    data2 = df.groupby(x)[y].get_group(pair[1])
    stat, p = bartlett(data1, data2)
    print(
        "Performing Bartlett statistical test for equal variances on pair:",
        pair,
        "stat={:.2e} p-value={:.2e}".format(stat, p),
    )
    pvalues.append(p)
print("pvalues:", pvalues)

Performing Bartlett statistical test for equal variances on pair: ('setosa', 'versicolor') stat=6.89e+00 p-value=8.66e-03
Performing Bartlett statistical test for equal variances on pair: ('setosa', 'virginica') stat=1.60e+01 p-value=6.38e-05
Performing Bartlett statistical test for equal variances on pair: ('versicolor', 'virginica') stat=2.09e+00 p-value=1.48e-01
pvalues: [0.008659557933879902, 6.378941946712463e-05, 0.14778816016231236]

In [13]:

ax = sns.boxplot(data=df, x=x, y=y)
sns.despine()  # 필요없는 axis border 제거
test_results = add_stat_annotation(
    ax,
    data=df,
    x=x,
    y=y,
    box_pairs=box_pairs,
    perform_stat_test=False,
    pvalues=pvalues,
    test_short_name=test_short_name,
    text_format="star",
    verbose=2,
)

p-value annotation legend:
ns: 5.00e-02 < p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04

setosa v.s. versicolor: Custom statistical test, P_val:8.660e-03
versicolor v.s. virginica: Custom statistical test, P_val:1.478e-01
setosa v.s. virginica: Custom statistical test, P_val:6.379e-05

4.6. 사용자 정의 주석문구 넣기¶

In [14]:

df = sns.load_dataset("tips")
x = "day"
y = "total_bill"
order = ["Sun", "Thur", "Fri", "Sat"]
ax = sns.boxplot(data=df, x=x, y=y, order=order)
sns.despine()  # 필요없는 axis border 제거
test_results = add_stat_annotation(
    ax,
    data=df,
    x=x,
    y=y,
    order=order,
    box_pairs=[("Thur", "Fri"), ("Thur", "Sat"), ("Fri", "Sun")],
    text_annot_custom=["first pair", "second pair", "third pair"],
    perform_stat_test=False,
    pvalues=[0, 0, 0],
    loc="outside",
    verbose=0,
)

5. 마치며¶

기존에는 번거롭게 플랏에 통계분석에 대한 주석을 수작업으로 넣었지만, statannot 패키지를 사용하면 간편하게 주석처리를 해줍니다. 게다가 기본 설정값도 충실해서 앞으로는 statannot을 계속 사용하게 될 것 같습니다.