통계 분석 주석넣기

0. statannot 소개

statannot은 파이썬 시각화 라이브러리 seaborn로 그린 박스플롯에 통계분석에 대한 주석을 자동으로 달아주는 파이썬 패키지입니다. 자세한 것은 공식 홈페이지를 참고하세요.

막대그래프에 주석을 추가하는 것도 가능하지만 아직 완전하지 않습니다.

1. statannot 특징

  • scipy.stats 메소드를 사용한 통계분석:
    • Mann-Whitney
    • t-test (independent and paired)
    • Welch's t-test
    • Levene test
    • Wilcoxon test
    • Kruskal-Wallis test
  • 플롯의 내부 또는 외부에 주석 넣기
  • 통계분석의 주석을 별표(*), p-value 등으로 사용자가 지정할 수 있음
  • 선택적으로 사용자 정의 p- value를 입력값으로 제공(이 경우 통계 분석은 수행되지 않음).

2. statannot 의존성

    Python >= 3.5
    numpy >= 1.12.1
    seaborn >= 0.8.1
    matplotlib >= 2.2.2
    pandas >= 0.23.0
    scipy >= 1.1.0

3. statannot 설치

PyPI를 통해 설치:

pip install statannot

개발중인 최신 버전은 다음 명령어로 설치:

pip install git+https://github.com/webermarcolivier/statannot.git

4. 사용 예시

먼저 필요한 라이브러리를 불러옵니다.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from statannot import add_stat_annotation

sns.set(style="ticks")

4.1 Boxplot non-hue

4.1.1 다중 비교 교정

기본적으로 본페로니 교정이 적용됩니다.

In [2]:
df = sns.load_dataset("tips")
x = "day"
y = "total_bill"
order = ["Sun", "Thur", "Fri", "Sat"]
ax = sns.boxplot(data=df, x=x, y=y, order=order)
ax, test_results = add_stat_annotation(
    ax,
    data=df,
    x=x,
    y=y,
    order=order,
    box_pairs=[("Thur", "Fri"), ("Thur", "Sat"), ("Fri", "Sun")],
    test="Mann-Whitney",
    text_format="star",
    loc="outside",
    verbose=2,
)
sns.despine()  # 필요없는 axis border 제거
p-value annotation legend:
ns: 5.00e-02 < p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04

Thur v.s. Fri: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=1.000e+00 U_stat=6.305e+02
Thur v.s. Sat: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=1.407e-01 U_stat=2.180e+03
Sun v.s. Fri: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=8.041e-02 U_stat=9.605e+02
No description has been provided for this image

4.1.2 통계 분석 결과

add_stat_annotation 함수는 튜플 ax, test_results 를 반환합니다. 여기서 test_results는 원래 데이터와 통계 분석 결과 (p-value 등)를 모두 포함하는StatResult 리스트 객체입니다.

In [3]:
for res in test_results:
    print(res)

print("\nStatResult attributes:", test_results[0].__dict__.keys())
Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=1.000e+00 U_stat=6.305e+02
Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=1.407e-01 U_stat=2.180e+03
Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=8.041e-02 U_stat=9.605e+02

StatResult attributes: dict_keys(['test_str', 'test_short_name', 'stat_str', 'stat', 'pval', 'box1', 'box2'])

4.1.3. 다중 비교 교정이 없는 경우

In [4]:
x = "day"
y = "total_bill"
order = ["Sun", "Thur", "Fri", "Sat"]
ax = sns.boxplot(data=df, x=x, y=y, order=order)
test_results = add_stat_annotation(
    ax,
    data=df,
    x=x,
    y=y,
    order=order,
    box_pairs=[("Thur", "Fri"), ("Thur", "Sat"), ("Fri", "Sun")],
    test="Mann-Whitney",
    comparisons_correction=None,
    text_format="star",
    loc="outside",
    verbose=2,
)
sns.despine()  # 필요없는 axis border 제거
p-value annotation legend:
ns: 5.00e-02 < p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04

Thur v.s. Fri: Mann-Whitney-Wilcoxon test two-sided, P_val=6.477e-01 U_stat=6.305e+02
Thur v.s. Sat: Mann-Whitney-Wilcoxon test two-sided, P_val=4.690e-02 U_stat=2.180e+03
Sun v.s. Fri: Mann-Whitney-Wilcoxon test two-sided, P_val=2.680e-02 U_stat=9.605e+02
No description has been provided for this image

4.1.4. 주석의 위치

통계 분석 주석은 플롯 안쪽(loc = 'inside') 또는 바깥쪽(loc ='outside')에 위치할 수 있습니다. 아래 예시는 안쪽 있는 예시입니다.

In [5]:
x = "day"
y = "total_bill"
order = ["Sun", "Thur", "Fri", "Sat"]
ax = sns.boxplot(data=df, x=x, y=y, order=order)
add_stat_annotation(
    ax,
    data=df,
    x=x,
    y=y,
    order=order,
    box_pairs=[("Sun", "Thur"), ("Sun", "Sat"), ("Fri", "Sun")],
    perform_stat_test=False,
    pvalues=[0.1, 0.1, 0.001],
    test=None,
    text_format="star",
    loc="inside",
    verbose=2,
)
sns.despine()  # 필요없는 axis border 제거
p-value annotation legend:
ns: 5.00e-02 < p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04

Sun v.s. Thur: Custom statistical test, P_val:1.000e-01
Sun v.s. Fri: Custom statistical test, P_val:1.000e-03
Sun v.s. Sat: Custom statistical test, P_val:1.000e-01
No description has been provided for this image

4.2. Boxplot with hue

ymax 위치가 다른 상자 플롯을 만듭니다.

In [6]:
df = sns.load_dataset("diamonds")
df = df[df["color"].map(lambda x: x in "EIJ")]
# Modifying data to yield unequal boxes in the hue value
df.loc[df["cut"] == "Ideal", "price"] = df.loc[df["cut"] == "Ideal", "price"].map(
    lambda x: min(x, 5000)
)
df.loc[df["cut"] == "Premium", "price"] = df.loc[df["cut"] == "Premium", "price"].map(
    lambda x: min(x, 7500)
)
df.loc[df["cut"] == "Good", "price"] = df.loc[df["cut"] == "Good", "price"].map(
    lambda x: min(x, 15000)
)
df.loc[df["cut"] == "Very Good", "price"] = df.loc[
    df["cut"] == "Very Good", "price"
].map(lambda x: min(x, 3000))
df.head()
Out[6]:
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
In [7]:
x = "color"
y = "price"
hue = "cut"
hue_order = ["Ideal", "Premium", "Good", "Very Good", "Fair"]
box_pairs = [
    (("E", "Ideal"), ("E", "Very Good")),
    (("E", "Ideal"), ("E", "Premium")),
    (("E", "Ideal"), ("E", "Good")),
    (("I", "Ideal"), ("I", "Premium")),
    (("I", "Ideal"), ("I", "Good")),
    (("J", "Ideal"), ("J", "Premium")),
    (("J", "Ideal"), ("J", "Good")),
    (("E", "Good"), ("I", "Ideal")),
    (("I", "Premium"), ("J", "Ideal")),
]
ax = sns.boxplot(
    data=df,
    x=x,
    y=y,
    hue=hue,
)
add_stat_annotation(
    ax,
    data=df,
    x=x,
    y=y,
    hue=hue,
    box_pairs=box_pairs,
    test="Mann-Whitney",
    loc="inside",
    verbose=2,
)
plt.legend(loc="upper left", bbox_to_anchor=(1.03, 1))
sns.despine()  # 필요없는 axis border 제거
p-value annotation legend:
ns: 5.00e-02 < p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04

E_Ideal v.s. E_Premium: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=1.404e-30 U_stat=3.756e+06
I_Ideal v.s. I_Premium: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=4.627e-60 U_stat=1.009e+06
J_Ideal v.s. J_Premium: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=3.616e-36 U_stat=2.337e+05
E_Ideal v.s. E_Good: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=4.681e-18 U_stat=1.480e+06
I_Ideal v.s. I_Good: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=4.507e-12 U_stat=4.359e+05
J_Ideal v.s. J_Good: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=9.056e-04 U_stat=1.174e+05
E_Ideal v.s. E_Very Good: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=1.562e-01 U_stat=4.850e+06
E_Good v.s. I_Ideal: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=1.000e+00 U_stat=9.882e+05
I_Premium v.s. J_Ideal: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=4.643e-26 U_stat=8.084e+05
No description has been provided for this image

4.3. 버킷 카테고리가 포함 된 박스 플롯

In [8]:
df = sns.load_dataset("tips")
df["tip_bucket"] = pd.cut(df["tip"], 3)
df.head()
Out[8]:
total_bill tip sex smoker day time size tip_bucket
0 16.99 1.01 Female No Sun Dinner 2 (0.991, 4.0]
1 10.34 1.66 Male No Sun Dinner 3 (0.991, 4.0]
2 21.01 3.50 Male No Sun Dinner 3 (0.991, 4.0]
3 23.68 3.31 Male No Sun Dinner 2 (0.991, 4.0]
4 24.59 3.61 Female No Sun Dinner 4 (0.991, 4.0]
In [9]:
# In this case we just have to pass the list of categories objects to the add_stat_annotation function.
tip_bucket_list = df["tip_bucket"].unique()
tip_bucket_list
Out[9]:
[(0.991, 4.0], (4.0, 7.0], (7.0, 10.0]]
Categories (3, interval[float64]): [(0.991, 4.0] < (4.0, 7.0] < (7.0, 10.0]]
In [10]:
x = "day"
y = "total_bill"
hue = "tip_bucket"
data = df
ax = sns.boxplot(data=df, x=x, y=y, hue=hue)
add_stat_annotation(
    ax,
    data=df,
    x=x,
    y=y,
    hue=hue,
    box_pairs=[(("Sat", tip_bucket_list[2]), ("Fri", tip_bucket_list[0]))],
    test="t-test_ind",
    loc="inside",
    verbose=2,
)
plt.legend(loc="upper left", bbox_to_anchor=(1.03, 1))
sns.despine()  # 필요없는 axis border 제거
p-value annotation legend:
ns: 5.00e-02 < p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04

Fri_(0.991, 4.0] v.s. Sat_(7.0, 10.0]: t-test independent samples with Bonferroni correction, P_val=6.176e-07 stat=-7.490e+00
No description has been provided for this image

4.4. y 오프셋 조절하기

In [11]:
df = sns.load_dataset("tips")
x = "day"
y = "total_bill"
hue = "smoker"
ax = sns.boxplot(data=df, x=x, y=y, hue=hue)
add_stat_annotation(
    ax,
    data=df,
    x=x,
    y=y,
    hue=hue,
    box_pairs=[
        (("Thur", "No"), ("Fri", "No")),
        (("Sat", "Yes"), ("Sat", "No")),
        (("Sun", "No"), ("Thur", "Yes")),
    ],
    test="t-test_ind",
    text_format="full",
    loc="inside",
    comparisons_correction=None,
    line_offset_to_box=0.2,
    line_offset=0.1,
    line_height=0.05,
    text_offset=8,
    verbose=2,
)
plt.legend(loc="upper left", bbox_to_anchor=(1.03, 1))
sns.despine()  # 필요없는 axis border 제거
Sat_Yes v.s. Sat_No: t-test independent samples, P_val=4.304e-01 stat=7.922e-01
Thur_No v.s. Fri_No: t-test independent samples, P_val=7.425e-01 stat=-3.305e-01
Thur_Yes v.s. Sun_No: t-test independent samples, P_val=5.623e-01 stat=-5.822e-01
No description has been provided for this image

4.5. 사용자 정의 p-value를 입력으로 사용하기

In [12]:
df = sns.load_dataset("iris")
x = "species"
y = "sepal_length"

box_pairs = [
    ("setosa", "versicolor"),
    ("setosa", "virginica"),
    ("versicolor", "virginica"),
]

from scipy.stats import bartlett

test_short_name = "Bartlett"
pvalues = []
for pair in box_pairs:
    data1 = df.groupby(x)[y].get_group(pair[0])
    data2 = df.groupby(x)[y].get_group(pair[1])
    stat, p = bartlett(data1, data2)
    print(
        "Performing Bartlett statistical test for equal variances on pair:",
        pair,
        "stat={:.2e} p-value={:.2e}".format(stat, p),
    )
    pvalues.append(p)
print("pvalues:", pvalues)
Performing Bartlett statistical test for equal variances on pair: ('setosa', 'versicolor') stat=6.89e+00 p-value=8.66e-03
Performing Bartlett statistical test for equal variances on pair: ('setosa', 'virginica') stat=1.60e+01 p-value=6.38e-05
Performing Bartlett statistical test for equal variances on pair: ('versicolor', 'virginica') stat=2.09e+00 p-value=1.48e-01
pvalues: [0.008659557933879902, 6.378941946712463e-05, 0.14778816016231236]
In [13]:
ax = sns.boxplot(data=df, x=x, y=y)
sns.despine()  # 필요없는 axis border 제거
test_results = add_stat_annotation(
    ax,
    data=df,
    x=x,
    y=y,
    box_pairs=box_pairs,
    perform_stat_test=False,
    pvalues=pvalues,
    test_short_name=test_short_name,
    text_format="star",
    verbose=2,
)
p-value annotation legend:
ns: 5.00e-02 < p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04

setosa v.s. versicolor: Custom statistical test, P_val:8.660e-03
versicolor v.s. virginica: Custom statistical test, P_val:1.478e-01
setosa v.s. virginica: Custom statistical test, P_val:6.379e-05
No description has been provided for this image

4.6. 사용자 정의 주석문구 넣기

In [14]:
df = sns.load_dataset("tips")
x = "day"
y = "total_bill"
order = ["Sun", "Thur", "Fri", "Sat"]
ax = sns.boxplot(data=df, x=x, y=y, order=order)
sns.despine()  # 필요없는 axis border 제거
test_results = add_stat_annotation(
    ax,
    data=df,
    x=x,
    y=y,
    order=order,
    box_pairs=[("Thur", "Fri"), ("Thur", "Sat"), ("Fri", "Sun")],
    text_annot_custom=["first pair", "second pair", "third pair"],
    perform_stat_test=False,
    pvalues=[0, 0, 0],
    loc="outside",
    verbose=0,
)
No description has been provided for this image

5. 마치며

기존에는 번거롭게 플랏에 통계분석에 대한 주석을 수작업으로 넣었지만, statannot 패키지를 사용하면 간편하게 주석처리를 해줍니다. 게다가 기본 설정값도 충실해서 앞으로는 statannot을 계속 사용하게 될 것 같습니다.