부록

1장 - 인과추론 소개

인과추론의 핵심은 상관관계에서 인과관계를 분리해내는 것입니다. 본 섹션에서는 인과추론의 기본 개념과 목적, 그리고 머신러닝과의 차이점에 대해 학습합니다.

앞으로 내용을 인과추론을 이해하는 데 필요한 용어와 인과추론이 왜 필요한지와 무엇을 할 수 있는지에 대해서 알아보겠습니다.

1.1 인과추론의 개념

인과관계는 하나의 변수가 다른 변수를 변화시키는 직접적인 영향을 의미합니다. 단순히 두 변수가 함께 움직이는 상관관계와는 구별되어야 합니다.

연관관계는 인과관계가 아니다. 그러나 때로는 연관관계가 인과관계가 될 수 있습니다.

1.2 인과추론의 목적

인과추론의 주된 목적은 처치(Treatment)가 결과(Outcome)에 미치는 효과를 추정하는 것입니다. 이를 통해 비즈니스 의사결정의 근거를 마련합니다.

1.3 머신러닝과 인과추론

머신러닝이 ‘무엇이 일어날까?’라는 예측에 집중한다면, 인과추론은 ’왜 일어났을까?’ 또는 ’내가 개입한다면 어떻게 바뀔까?’라는 질문에 답합니다.

1.4 연관관계와 인과관계

상관관계는 인과관계를 의미하지 않습니다. 데이터 속에 숨겨진 교란 요인을 배제해야만 진정한 인과 효과를 발견할 수 있습니다.

from pathlib import Path
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from cycler import cycler

# 기본 cycler 설정
default_cycler = (
    cycler(color=["0.3", "0.5", "0.7", "0.5"])
    + cycler(linestyle=["-", "--", ":", "-."])
    + cycler(marker=["o", "v", "d", "p"])
)

# 색상, 선 스타일, 마커 리스트
color = ["0.3", "0.5", "0.7", "0.5"]
linestyle = ["-", "--", ":", "-."]
marker = ["o", "v", "d", "p"]

# matplotlib rc 설정
plt.rc("axes", prop_cycle=default_cycler)
plt.rc("font", size=20)

# pathlib를 사용하여 파일 경로 설정
data_path = Path("../data/xmas_sales.csv")

# 데이터 로드
data = pd.read_csv(data_path)
data.tail()

	store	weeks_to_xmas	avg_week_sales	is_on_sale	weekly_amount_sold
1995	499	0	23.10	1	15.60
1996	500	3	20.52	0	154.68
1997	500	2	20.52	0	93.52
1998	500	1	20.52	1	111.16
1999	500	0	20.52	0	3.77

1.4.1 처치와 결과

처치(Treatment)는 우리가 효과를 알고 싶어 하는 개입이며, 결과(Outcome)는 그로 인해 변화하는 변수입니다.

1.4.2 인과추론의 근본적인 문제

개별 개체에 대해 처치를 받았을 때와 받지 않았을 때의 결과를 동시에 관찰할 수 없다는 점이 인과추론의 근본적인 한계입니다.

def plot_weekly_sales_boxplot(data: pd.DataFrame, ax: plt.Axes = None, figsize=(3, 4)):
    """
    할인 여부에 따른 주간 판매량 박스 플롯을 그립니다.

    Args:
        data (pd.DataFrame): 시각화할 데이터프레임 ('weekly_amount_sold', 'is_on_sale' 컬럼 필요).
        ax (plt.Axes, optional): 그림을 그릴 Axes 객체. None이면 새로운 figure와 axes를 생성합니다.
    """
    if ax is None:
        fig, ax = plt.subplots(1, 1, figsize=figsize)

    sns.boxplot(y="weekly_amount_sold", x="is_on_sale", data=data, ax=ax)

    # 폰트 크기 줄임
    ax.set_xlabel("Is On Sale", fontsize=10)
    ax.set_ylabel("Weekly Amount Sold", fontsize=10)
    ax.tick_params(axis="both", which="major", labelsize=10)

    plt.tight_layout()

    if ax is None:
        plt.show()


# 시각화 함수 호출
plot_weekly_sales_boxplot(data, figsize=(4, 3))

1.4.3 인과모델

인과모델은 변수 간의 관계를 정의하여 개입의 효과를 추론할 수 있게 하는 틀입니다.

1.4.4 개입

개입(Intervention)은 시스템의 특정 변수 값을 인위적으로 조정하는 행위를 의미합니다.

1.4.5 개별 처치효과

개별 처치효과(ITE)는 특정 개체에 대해 처치가 있을 때와 없을 때의 결과 차이입니다.

1.4.6 잠재적 결과

잠재적 결과(Potential Outcomes)는 특정 처치 상태에서 나타날 수 있는 가상의 결과값들입니다.

1.4.7 일치성 및 SUTVA 가정

1.4.8 인과 추정량

ATE(Average Treatment Effect)와 같은 지표를 통해 집단 전체의 인과 효과를 요약하여 측정합니다.

1.4.9 인과 추정량 예시

pd.DataFrame(
    dict(
        i=[1, 2, 3, 4, 5, 6],
        y0=[200, 120, 300, 450, 600, 600],
        y1=[220, 140, 400, 500, 600, 800],
        t=[0, 0, 0, 1, 1, 1],
        x=[0, 0, 1, 0, 0, 1],
    )
).assign(
    y=lambda d: (d["t"] * d["y1"] + (1 - d["t"]) * d["y0"]).astype(int),
    te=lambda d: d["y1"] - d["y0"],
)

	i	y0	y1	t	x	y	te
0	1	200	220	0	0	200	20
1	2	120	140	0	0	120	20
2	3	300	400	0	1	300	100
3	4	450	500	1	0	500	50
4	5	600	600	1	0	600	0
5	6	600	800	1	1	800	200

pd.DataFrame(
    dict(
        i=[1, 2, 3, 4, 5, 6],
        y0=[
            200,
            120,
            300,
            np.nan,
            np.nan,
            np.nan,
        ],
        y1=[np.nan, np.nan, np.nan, 500, 600, 800],
        t=[0, 0, 0, 1, 1, 1],
        x=[0, 0, 1, 0, 0, 1],
    )
).assign(
    y=lambda d: np.where(d["t"] == 1, d["y1"], d["y0"]).astype(int),
    te=lambda d: d["y1"] - d["y0"],
)

	i	y0	y1	t	x	y	te
0	1	200.0	NaN	0	0	200	NaN
1	2	120.0	NaN	0	0	120	NaN
2	3	300.0	NaN	0	1	300	NaN
3	4	NaN	500.0	1	0	500	NaN
4	5	NaN	600.0	1	0	600	NaN
5	6	NaN	800.0	1	1	800	NaN

1.5 편향

1.5.1 편향식

1.5.2 편향의 시각적 가이드

from pathlib import Path
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from cycler import cycler

# Nord 테마 색상 팔레트 정의
nord_colors = [
    "#2E3440",
    "#3B4252",
    "#434C5E",
    "#4C566A",
    "#D8DEE9",
    "#E5E9F0",
    "#ECEFF4",
    "#8FBCBB",
    "#88C0D0",
    "#81A1C1",
    "#5E81AC",
]


def plot_sales_with_regression(data: pd.DataFrame, ax: plt.Axes = None):
    """
    주간 판매량과 평균 주간 판매액에 대한 산점도와 회귀선을 그립니다.
    할인 여부에 따라 다른 색상과 마커를 사용하고, Nord 테마를 적용합니다.

    Args:
        data (pd.DataFrame): 시각화할 데이터프레임 ( 'avg_week_sales', 'weekly_amount_sold', 'is_on_sale' 컬럼 필요).
        ax (plt.Axes, optional): 그림을 그릴 Axes 객체. None이면 새로운 figure와 axes를 생성합니다.
    """
    if ax is None:
        fig, ax = plt.subplots(figsize=(4, 4))

    # Nord 테마 색상 적용
    color_on_sale = nord_colors[7]
    color_not_on_sale = nord_colors[4]
    marker_on_sale = "o"
    marker_not_on_sale = "v"

    # 회귀선 그리기
    sns.regplot(
        data=data,
        x="avg_week_sales",
        y="weekly_amount_sold",
        ci=None,
        scatter=False,
        color=nord_colors[3],
        ax=ax,
    )

    # 할인 상품 산점도
    on_sale_data = data[data["is_on_sale"] == 1]
    ax.scatter(
        x=on_sale_data["avg_week_sales"],
        y=on_sale_data["weekly_amount_sold"],
        label="on sale",
        color=color_on_sale,
        alpha=0.9,
        marker=marker_on_sale,
        s=20,  # 점 크기 줄임 (선택 사항)
    )

    # 비할인 상품 산점도
    not_on_sale_data = data[data["is_on_sale"] == 0]
    ax.scatter(
        x=not_on_sale_data["avg_week_sales"],
        y=not_on_sale_data["weekly_amount_sold"],
        label="not on sale",
        color=color_not_on_sale,
        alpha=0.9,
        marker=marker_not_on_sale,
        s=20,  # 점 크기 줄임 (선택 사항)
    )

    # 범례 설정 (폰트 크기 줄임)
    ax.legend(fontsize=10)

    # 축 레이블 폰트 크기 줄임
    ax.set_xlabel("Average Weekly Sales", fontsize=8)
    ax.set_ylabel("Weekly Amount Sold", fontsize=8)
    ax.tick_params(
        axis="both", which="major", labelsize=6
    )  # 눈금 레이블 폰트 크기 줄임

    # 그래프 제목 폰트 크기 줄임 (선택 사항)
    ax.set_title("Weekly Sales vs. Average Weekly Sales", fontsize=10)

    # 레이아웃 조정
    plt.tight_layout()

    if ax is None:
        plt.show()


# 시각화 함수 호출
plot_sales_with_regression(data)

1.6 처치효과 식별하기

1.6.1독립성 가정

1.6.2 랜덤화와 식별

1.7 요약

2장 - 무작위 실험 및 기초 통계 리뷰

무작위 실험(Randomized Controlled Trial, RCT)은 인과추론의 골드 스탠다드로 불립니다. 본 섹션에서는 실험 설계와 결과 분석을 위한 통계적 기초를 다룹니다.

2.1 무작위 배정으로 독립성 확보하기

처치군과 대조군을 무작위로 나눔으로써, 처치 여부와 다른 잠재적 요인들 간의 상관관계를 끊어내고 독립성을 확보합니다.

2.2 A/B 테스트 사례

비즈니스 현장에서는 웹사이트 UI 변경이나 마케팅 메시지의 효과를 측정하기 위해 A/B 테스트를 적극적으로 활용합니다.

import pandas as pd  # for data manipulation
import numpy as np  # for numerical computation

data = pd.read_csv("../data/cross_sell_email.csv")
data

	gender	cross_sell_email	age	conversion
0	0	short	15	0
1	1	short	27	0
...	...	...	...	...
321	1	no_email	16	0
322	1	long	24	1

323 rows × 4 columns

(data.groupby(["cross_sell_email"]).mean())

	gender	age	conversion
cross_sell_email
long	0.550459	21.752294	0.055046
no_email	0.542553	20.489362	0.042553
short	0.633333	20.991667	0.125000

X = ["gender", "age"]

mu = data.groupby("cross_sell_email")[X].mean()
var = data.groupby("cross_sell_email")[X].var()

norm_diff = (mu - mu.loc["no_email"]) / np.sqrt((var + var.loc["no_email"]) / 2)

norm_diff

	gender	age
cross_sell_email
long	0.015802	0.221423
no_email	0.000000	0.000000
short	0.184341	0.087370

2.3 이상적인 실험

2.4 가장 위험한 수식

import warnings

warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
from scipy import stats
import seaborn as sns
from matplotlib import pyplot as plt
from cycler import cycler
import matplotlib

default_cycler = cycler(color=["0.1", "0.5", "1.0"])

color = ["0.3", "0.5", "0.7", "0.9"]
linestyle = ["-", "--", ":", "-."]
marker = ["o", "v", "d", "p"]

plt.rc("axes", prop_cycle=default_cycler)

matplotlib.rcParams.update({"font.size": 18})

df = pd.read_csv("data/enem_scores.csv")
df.sort_values(by="avg_score", ascending=False).head(10)

	year	school_id	number_of_students	avg_score
16670	2007	33062633	68	82.97
16796	2007	33065403	172	82.04
...	...	...	...	...
14636	2007	31311723	222	79.41
17318	2007	33087679	210	79.38

10 rows × 4 columns

plot_data = df.assign(top_school=df["avg_score"] >= np.quantile(df["avg_score"], 0.99))[
    ["top_school", "number_of_students"]
].query(
    f"number_of_students<{np.quantile(df['number_of_students'], 0.98)}"
)  # remove outliers

plt.figure(figsize=(8, 4))
ax = sns.boxplot(x="top_school", y="number_of_students", data=plot_data)

plt.title("Number of Students of 1% Top Schools (Right)")

Text(0.5, 1.0, 'Number of Students of 1% Top Schools (Right)')

q_99 = np.quantile(df["avg_score"], 0.99)
q_01 = np.quantile(df["avg_score"], 0.01)

plot_data = df.sample(10000).assign(
    Group=lambda d: np.select(
        [(d["avg_score"] > q_99) | (d["avg_score"] < q_01)],
        ["Top and Bottom"],
        "Middle",
    )
)
plt.figure(figsize=(10, 5))
sns.scatterplot(
    y="avg_score",
    x="number_of_students",
    data=plot_data.query("Group=='Middle'"),
    label="Middle",
)
ax = sns.scatterplot(
    y="avg_score",
    x="number_of_students",
    data=plot_data.query("Group!='Middle'"),
    color="0.7",
    label="Top and Bottom",
)

plt.title("ENEM Score by Number of Students in the School")

Text(0.5, 1.0, 'ENEM Score by Number of Students in the School')

2.5 추정값의 표준오차

data = pd.read_csv("../data/cross_sell_email.csv")

short_email = data.query("cross_sell_email=='short'")["conversion"]
long_email = data.query("cross_sell_email=='long'")["conversion"]
email = data.query("cross_sell_email!='no_email'")["conversion"]
no_email = data.query("cross_sell_email=='no_email'")["conversion"]

data.groupby("cross_sell_email").size()

cross_sell_email
long        109
no_email     94
short       120
dtype: int64

def se(y: pd.Series):
    return y.std() / np.sqrt(len(y))


print("SE for Long Email:", se(long_email))
print("SE for Short Email:", se(short_email))

SE for Long Email: 0.021946024609185506
SE for Short Email: 0.030316953129541618

print("SE for Long Email:", long_email.sem())
print("SE for Short Email:", short_email.sem())

SE for Long Email: 0.021946024609185506
SE for Short Email: 0.030316953129541618

2.6 신뢰구간

n = 100
conv_rate = 0.08


def run_experiment():
    return np.random.binomial(1, conv_rate, size=n)


np.random.seed(42)

experiments = [run_experiment().mean() for _ in range(10000)]

plt.figure(figsize=(10, 4))
freq, bins, img = plt.hist(experiments, bins=20, label="Experiment Means", color="0.6")
plt.vlines(
    conv_rate,
    ymin=0,
    ymax=freq.max(),
    linestyles="dashed",
    label="True Mean",
    color="0.3",
)
plt.legend()

<matplotlib.legend.Legend at 0x7fd451a4bc10>

np.random.seed(42)
plt.figure(figsize=(10, 4))
plt.hist(np.random.binomial(1, 0.08, 100), bins=20)

(array([92.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  8.]),
 array([0.  , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,
        0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ]),
 <BarContainer object of 20 artists>)

x = np.linspace(-4, 4, 100)
y = stats.norm.pdf(x, 0, 1)

plt.figure(figsize=(10, 4))
plt.plot(x, y, linestyle="solid")
plt.fill_between(x.clip(-3, +3), 0, y, alpha=0.5, label="~99.7% mass", color="C2")
plt.fill_between(x.clip(-2, +2), 0, y, alpha=0.5, label="~95% mass", color="C1")
plt.fill_between(x.clip(-1, +1), 0, y, alpha=0.5, label="~68% mass", color="C0")
plt.ylabel("Density")
plt.legend()

<matplotlib.legend.Legend at 0x7fd451c295d0>

exp_se = short_email.sem()
exp_mu = short_email.mean()
ci = (exp_mu - 2 * exp_se, exp_mu + 2 * exp_se)
print("95% CI for Short Email: ", ci)

95% CI for Short Email:  (0.06436609374091676, 0.18563390625908324)

x = np.linspace(exp_mu - 4 * exp_se, exp_mu + 4 * exp_se, 100)
y = stats.norm.pdf(x, exp_mu, exp_se)

plt.figure(figsize=(10, 4))
plt.plot(x, y, lw=3)
plt.vlines(ci[1], ymin=0, ymax=4, ls="dotted")
plt.vlines(ci[0], ymin=0, ymax=4, ls="dotted", label="95% CI")
plt.xlabel("Conversion")
plt.legend()

<matplotlib.legend.Legend at 0x7fd46289cdd0>

from scipy import stats

z = np.abs(stats.norm.ppf((1 - 0.99) / 2))
print(z)
ci = (exp_mu - z * exp_se, exp_mu + z * exp_se)
ci

2.5758293035489004

(0.04690870373460816, 0.20309129626539185)

stats.norm.ppf((1 - 0.99) / 2)

-2.5758293035489004

x = np.linspace(exp_mu - 4 * exp_se, exp_mu + 4 * exp_se, 100)
y = stats.norm.pdf(x, exp_mu, exp_se)

plt.figure(figsize=(10, 4))
plt.plot(x, y, lw=3)
plt.vlines(ci[1], ymin=0, ymax=4, ls="dotted")
plt.vlines(ci[0], ymin=0, ymax=4, ls="dotted", label="99% CI")


ci_95 = (exp_mu - 1.96 * exp_se, exp_mu + 1.96 * exp_se)

plt.vlines(ci_95[1], ymin=0, ymax=4, ls="dashed")
plt.vlines(ci_95[0], ymin=0, ymax=4, ls="dashed", label="95% CI")
plt.xlabel("Conversion")
plt.legend()

<matplotlib.legend.Legend at 0x7fd462983b50>

def ci(y: pd.Series):
    return (y.mean() - 2 * y.sem(), y.mean() + 2 * y.sem())


print("95% CI for Short Email:", ci(short_email))
print("95% CI for Long Email:", ci(long_email))
print("95% CI for No Email:", ci(no_email))

95% CI for Short Email: (0.06436609374091676, 0.18563390625908324)
95% CI for Long Email: (0.01115382234126202, 0.09893792077800403)
95% CI for No Email: (0.0006919679286838468, 0.08441441505003955)

plt.figure(figsize=(10, 4))

x = np.linspace(-0.05, 0.25, 100)
short_dist = stats.norm.pdf(x, short_email.mean(), short_email.sem())
plt.plot(x, short_dist, lw=2, label="Short", linestyle=linestyle[0])
plt.fill_between(
    x.clip(ci(short_email)[0], ci(short_email)[1]),
    0,
    short_dist,
    alpha=0.2,
    color="0.0",
)

long_dist = stats.norm.pdf(x, long_email.mean(), long_email.sem())
plt.plot(x, long_dist, lw=2, label="Long", linestyle=linestyle[1])
plt.fill_between(
    x.clip(ci(long_email)[0], ci(long_email)[1]), 0, long_dist, alpha=0.2, color="0.4"
)

no_email_dist = stats.norm.pdf(x, no_email.mean(), no_email.sem())
plt.plot(x, no_email_dist, lw=2, label="No email", linestyle=linestyle[2])
plt.fill_between(
    x.clip(ci(no_email)[0], ci(no_email)[1]), 0, no_email_dist, alpha=0.2, color="0.8"
)

plt.xlabel("Conversion")
plt.legend()

<matplotlib.legend.Legend at 0x7fd451a96810>

2.7 가설검정

import seaborn as sns
from matplotlib import pyplot as plt

np.random.seed(123)

n1 = np.random.normal(4, 3, 30000)
n2 = np.random.normal(1, 4, 30000)
n_diff = n2 - n1

plt.figure(figsize=(10, 4))
sns.distplot(
    n1, hist=False, label="$N(4,3^2)$", color="0.0", kde_kws={"linestyle": linestyle[0]}
)
sns.distplot(
    n2, hist=False, label="$N(1,4^2)$", color="0.4", kde_kws={"linestyle": linestyle[1]}
)
sns.distplot(
    n_diff,
    hist=False,
    label="$N(-3, 5^2) = N(1,4^2) - (4,3^2)$",
    color="0.8",
    kde_kws={"linestyle": linestyle[1]},
)
plt.legend();

diff_mu = short_email.mean() - no_email.mean()
diff_se = np.sqrt(no_email.sem() ** 2 + short_email.sem() ** 2)

ci = (diff_mu - 1.96 * diff_se, diff_mu + 1.96 * diff_se)
print(f"95% CI for the differece (short email - no email):\n{ci}")

95% CI for the differece (short email - no email):
(0.01023980847439844, 0.15465380854687816)

x = np.linspace(diff_mu - 4 * diff_se, diff_mu + 4 * diff_se, 100)
y = stats.norm.pdf(x, diff_mu, diff_se)

plt.figure(figsize=(10, 3))
plt.plot(x, y, lw=3)
plt.vlines(ci[1], ymin=0, ymax=4, ls="dotted")
plt.vlines(ci[0], ymin=0, ymax=4, ls="dotted", label="95% CI")
plt.xlabel("Diff. in Conversion (Short - No Email)\n")
plt.legend()
plt.subplots_adjust(bottom=0.15)

2.7.1 귀무가설

# shifting the CI
diff_mu_shifted = short_email.mean() - no_email.mean() - 0.01
diff_se = np.sqrt(no_email.sem() ** 2 + short_email.sem() ** 2)

ci = (diff_mu_shifted - 1.96 * diff_se, diff_mu_shifted + 1.96 * diff_se)
print(f"95% CI 1% difference between (short email - no email):\n{ci}")

95% CI 1% difference between (short email - no email):
(0.00023980847439844521, 0.14465380854687815)

2.7.2 검정통계량

t_stat = (diff_mu - 0) / diff_se
t_stat

2.2379512318715364

2.8 p 값

x = np.linspace(-4, 4, 100)
y = stats.norm.pdf(x, 0, 1)

plt.figure(figsize=(10, 4))
plt.plot(x, y, lw=2)
plt.vlines(t_stat, ymin=0, ymax=0.1, ls="dotted", label="T-Stat", lw=2)
plt.fill_between(x.clip(t_stat), 0, y, alpha=0.4, label="P-value")
plt.legend()

<matplotlib.legend.Legend at 0x7fd462fd6650>

print("P-value:", (1 - stats.norm.cdf(t_stat)) * 2)

P-value: 0.025224235562152142

2.9 검정력

stats.norm.cdf(0.84)

0.7995458067395503

2.10 표본 크기 계산

# in the book it is np.ceil(16 * no_email.std()**2/0.01), but it is missing the **2 in the denominator.
np.ceil(16 * (no_email.std() / 0.08) ** 2)

103.0

data.groupby("cross_sell_email").size()

cross_sell_email
long        109
no_email     94
short       120
dtype: int64

2.11 요약

3장 - 그래프 인과모델

3.1 인과관계에 대해 생각해보기

import warnings

warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import graphviz as gr

color = ["0.3", "0.5", "0.7", "0.9"]
linestyle = ["-", "--", ":", "-."]
marker = ["o", "v", "d", "p"]

pd.set_option("display.max_rows", 6)

gr.set_default_format("png");

import pandas as pd

data = pd.read_csv("../data/cross_sell_email.csv")
data

	gender	cross_sell_email	age	conversion
0	0	short	15	0
1	1	short	27	0
2	1	long	17	0
...	...	...	...	...
320	0	no_email	15	0
321	1	no_email	16	0
322	1	long	24	1

323 rows × 4 columns

3.1.1 인과관계 시각화

import graphviz as gr

g_cross_sell = gr.Digraph()

g_cross_sell.edge("U", "conversion")
g_cross_sell.edge("U", "age")
g_cross_sell.edge("U", "gender")

g_cross_sell.edge("rnd", "cross_sell_email")
g_cross_sell.edge("cross_sell_email", "conversion")
g_cross_sell.edge("age", "conversion")
g_cross_sell.edge("gender", "conversion")

g_cross_sell

g_cross_sell = gr.Digraph()

g_cross_sell.edge("U", "conversion")
g_cross_sell.edge("U", "age")
g_cross_sell.edge("U", "gender")

g_cross_sell.edge("rnd", "cross_sell_email")
g_cross_sell.edge("cross_sell_email", "conversion")
g_cross_sell.edge("age", "conversion")
g_cross_sell.edge("gender", "conversion")

g_cross_sell

# rankdir:LR layers the graph from left to right
g_cross_sell = gr.Digraph(graph_attr={"rankdir": "LR"})

g_cross_sell.edge("U", "conversion")
g_cross_sell.edge("U", "X")

g_cross_sell.edge("cross_sell_email", "conversion")
g_cross_sell.edge("X", "conversion")

g_cross_sell

g_cross_sell = gr.Digraph(graph_attr={"rankdir": "LR"})

g_cross_sell.edge("U", "conversion")
g_cross_sell.edge("U", "X")

g_cross_sell.edge("cross_sell_email", "conversion")
g_cross_sell.edge("X", "conversion")

g_cross_sell

3.1.2 컨설턴트 영입 여부 결정하기

g_consultancy = gr.Digraph(graph_attr={"rankdir": "LR"})

g_consultancy.edge("U1", "profits_next_6m")
g_consultancy.edge("U2", "consultancy")
g_consultancy.edge("U3", "profits_prev_6m")

g_consultancy.edge("consultancy", "profits_next_6m")

g_consultancy.edge("profits_prev_6m", "consultancy")
g_consultancy.edge("profits_prev_6m", "profits_next_6m")

g_consultancy

g_consultancy = gr.Digraph(graph_attr={"rankdir": "LR"})

g_consultancy.edge("consultancy", "profits_next_6m")
g_consultancy.edge("profits_prev_6m", "consultancy")
g_consultancy.edge("profits_prev_6m", "profits_next_6m")

g_consultancy

3.2 그래프 모델 집중 훈련

3.2.1 사슬

g = gr.Digraph(graph_attr={"rankdir": "LR"})

g.edge("T", "M")
g.edge("M", "Y")
g.node("M", "M")


g.edge("causal knowledge", "solve problems")
g.edge("solve problems", "job promotion")

g

g = gr.Digraph(graph_attr={"rankdir": "LR"})

g.edge("T", "M")
g.edge("M", "Y")
g.node("M", "M")
g.node("M", color="lightgrey", style="filled")


g.edge("causal knowledge", "solve problems")
g.edge("solve problems", "job promotion")
g.node("solve problems", color="lightgrey", style="filled")

g

3.2.2 분기

g = gr.Digraph()


g.edge("X", "Y")
g.edge("X", "T")
g.node("X", "X")

g.edge("statistics", "causal inference")
g.edge("statistics", "machine learning")

g

g = gr.Digraph()

g.edge("good programmer", "can invert a binary tree")
g.edge("good programmer", "good employee")

g

3.2.3 충돌부

g = gr.Digraph()

g.edge("Y", "X")
g.edge("T", "X")

g.edge("statistics", "job promotion")
g.edge("flatter", "job promotion")

g

g = gr.Digraph()

g.edge("Y", "X1")
g.edge("T", "X1")
g.edge("X1", "X2")
g.node("X2", color="lightgrey", style="filled")

g.edge("statistics", "job promotion")
g.edge("flatter", "job promotion")
g.edge("job promotion", "high salary")

g.node("high salary", color="lightgrey", style="filled")

g

3.2.4 연관관계 흐름 커닝 페이퍼

3.2.5 파이썬에서 그래프 쿼리하기

g = gr.Digraph(graph_attr={"rankdir": "LR"})
g.edge("C", "A")
g.edge("C", "B")
g.edge("D", "A")
g.edge("B", "E")
g.edge("F", "E")
g.edge("A", "G")

g

import networkx as nx

model = nx.DiGraph(
    [
        ("C", "A"),
        ("C", "B"),
        ("D", "A"),
        ("B", "E"),
        ("F", "E"),
        ("A", "G"),
    ]
)

print("Are D and C dependent?")
print(not (nx.d_separated(model, {"D"}, {"C"}, {})))

print("Are D and C dependent given A?")
print(not (nx.d_separated(model, {"D"}, {"C"}, {"A"})))

print("Are D and C dependent given G?")
print(not (nx.d_separated(model, {"D"}, {"C"}, {"G"})))

Are D and C dependent?
False
Are D and C dependent given A?
True
Are D and C dependent given G?
True

print("Are G and D dependent?")
print(not (nx.d_separated(model, {"G"}, {"D"}, {})))

print("Are G and D dependent given A?")
print(not (nx.d_separated(model, {"G"}, {"D"}, {"A"})))

Are G and D dependent?
True
Are G and D dependent given A?
False

print("Are A and B dependent?")
print(not (nx.d_separated(model, {"A"}, {"B"}, {})))

print("Are A and B dependent given C?")
print(not (nx.d_separated(model, {"A"}, {"B"}, {"C"})))

Are A and B dependent?
True
Are A and B dependent given C?
False

print("Are G and F dependent?")
print(not (nx.d_separated(model, {"G"}, {"F"}, {})))

print("Are G and F dependent given E?")
print(not (nx.d_separated(model, {"G"}, {"F"}, {"E"})))

Are G and F dependent?
False
Are G and F dependent given E?
True

3.3 식별 재해석

consultancy_sev = gr.Digraph(graph_attr={"rankdir": "LR"})
consultancy_sev.edge("profits_prev_6m", "profits_next_6m")
consultancy_sev.edge("profits_prev_6m", "consultancy")

consultancy_sev

consultancy_model_severed = nx.DiGraph(
    [
        ("profits_prev_6m", "profits_next_6m"),
        ("profits_prev_6m", "consultancy"),
        #     ("consultancy", "profits_next_6m"), # causal relationship removed
    ]
)

not (
    nx.d_separated(consultancy_model_severed, {"consultancy"}, {"profits_next_6m"}, {})
)

True

g_consultancy = gr.Digraph(graph_attr={"rankdir": "LR"})
g_consultancy.edge("profits_prev_6m", "profits_next_6m")
g_consultancy.edge("profits_prev_6m", "consultancy")
g_consultancy.edge("consultancy", "profits_next_6m")
g_consultancy.node("profits_prev_6m", color="lightgrey", style="filled")

g_consultancy

3.4 조건부 독립성 가정과 보정 공식

3.5 양수성 가정

3.6 구체적인 식별 예제

df = pd.DataFrame(
    dict(
        profits_prev_6m=[1.0, 1.0, 1.0, 5.0, 5.0, 5.0],
        consultancy=[0, 0, 1, 0, 1, 1],
        profits_next_6m=[1, 1.1, 1.2, 5.5, 5.7, 5.7],
    )
)

df

	profits_prev_6m	consultancy	profits_next_6m
0	1.0	0	1.0
1	1.0	0	1.1
2	1.0	1	1.2
3	5.0	0	5.5
4	5.0	1	5.7
5	5.0	1	5.7

(
    df.query("consultancy==1")["profits_next_6m"].mean()
    - df.query("consultancy==0")["profits_next_6m"].mean()
)

1.666666666666667

avg_df = df.groupby(["consultancy", "profits_prev_6m"])["profits_next_6m"].mean()

avg_df.loc[1] - avg_df.loc[0]

profits_prev_6m
1.0    0.15
5.0    0.20
Name: profits_next_6m, dtype: float64

g = gr.Digraph(graph_attr={"rankdir": "LR"})
g.edge("U", "T")
g.edge("U", "Y")
g.edge("T", "M")
g.edge("M", "Y")

g

3.7 교란편향

g = gr.Digraph(graph_attr={"rankdir": "LR"})
g.edge("X", "T")
g.edge("X", "Y")
g.edge("T", "Y")

(g.edge("Manager Quality", "Training"),)
(g.edge("Manager Quality", "Engagement"),)
g.edge("Training", "Engagement")

g

3.7.1 대리 교란 요인

g = gr.Digraph()
g.edge("X1", "U")
g.edge("U", "X2")
g.edge("U", "T")
g.edge("T", "Y")
g.edge("U", "Y")

g.edge("Manager Quality", "Team's Attrition")
g.edge("Manager Quality", "Team's Past Performance")
g.edge("Manager's Tenure", "Manager Quality")
g.edge("Manager's Education Level", "Manager Quality")

g.edge("Manager Quality", "Training")
g.edge("Training", "Engagement")
g.edge("Manager Quality", "Engagement")

g

3.7.2 랜덤화 재해석

g = gr.Digraph(graph_attr={"rankdir": "LR"})
g.edge("rnd", "T")
g.edge("T", "Y")
g.edge("U", "Y")

g

3.8 선택편향

3.8.1 충돌부 조건부 설정

g = gr.Digraph(graph_attr={"rankdir": "LR"})
g.edge("T", "S")
g.edge("T", "Y")
g.edge("Y", "S")
g.node("S", color="lightgrey", style="filled")

(g.edge("RND", "New Feature"),)
(g.edge("New Feature", "Customer Satisfaction"),)
(g.edge("Customer Satisfaction", "NPS"),)
(g.edge("Customer Satisfaction", "Response"),)
(g.edge("New Feature", "Response"),)
g.node("Response", "Response", color="lightgrey", style="filled")

g

nps_model = nx.DiGraph(
    [
        ("RND", "New Feature"),
        #     ("New Feature", "Customer Satisfaction"),
        ("Customer Satisfaction", "NPS"),
        ("Customer Satisfaction", "Response"),
        ("New Feature", "Response"),
    ]
)


not (nx.d_separated(nps_model, {"NPS"}, {"New Feature"}, {"Response"}))

True

np.random.seed(2)
n = 100000
new_feature = np.random.binomial(1, 0.5, n)

satisfaction_0 = np.random.normal(0, 0.5, n)
satisfaction_1 = satisfaction_0 + 0.4
satisfaction = new_feature * satisfaction_1 + (1 - new_feature) * satisfaction_0

nps_0 = np.random.normal(satisfaction_0, 1)
nps_1 = np.random.normal(satisfaction_1, 1)
nps = new_feature * nps_1 + (1 - new_feature) * nps_0


responded = (np.random.normal(0 + new_feature + satisfaction, 1) > 1).astype(int)

tr_df = pd.DataFrame(
    dict(
        new_feature=new_feature, responded=responded, nps_0=nps_0, nps_1=nps_1, nps=nps
    )
)

tr_df_measurable = pd.DataFrame(
    dict(
        new_feature=new_feature,
        responded=responded,
        nps_0=np.nan,
        nps_1=np.nan,
        nps=np.where(responded, nps, np.nan),
    )
)

tr_df.groupby("new_feature").mean()

	responded	nps_0	nps_1	nps
new_feature
0	0.183715	-0.005047	0.395015	-0.005047
1	0.639342	-0.005239	0.401082	0.401082

tr_df_measurable.groupby("new_feature").mean().assign(**{"nps": np.nan})

	responded	nps_0	nps_1	nps
new_feature
0	0.183715	NaN	NaN	NaN
1	0.639342	NaN	NaN	NaN

tr_df_measurable.groupby(["responded", "new_feature"]).mean()

		nps_0	nps_1	nps
responded	new_feature
0	0	NaN	NaN	NaN
0	1	NaN	NaN	NaN
1	0	NaN	NaN	0.314073
1	1	NaN	NaN	0.536106

tr_df.groupby(["responded", "new_feature"]).mean()

		nps_0	nps_1	nps
responded	new_feature
0	0	-0.076869	0.320616	-0.076869
0	1	-0.234852	0.161725	0.161725
1	0	0.314073	0.725585	0.314073
1	1	0.124287	0.536106	0.536106

3.8.2 선택편향 보정

g = gr.Digraph()

g.edge("U", "X")
g.edge("X", "S")
g.edge("U", "Y")
g.edge("T", "Y")
g.edge("T", "S")
g.node("S", color="lightgrey", style="filled")

(g.edge("New Feature", "Customer Satisfaction"),)
(g.edge("Unknown Stuff", "Customer Satisfaction"),)
(g.edge("Unknown Stuff", "Time in App"),)
(g.edge("Time in App", "Response"),)
(g.edge("New Feature", "Response"),)

g.node("Response", "Response", color="lightgrey", style="filled")

g

g = gr.Digraph(graph_attr={"rankdir": "LR"})

g.edge("X1", "U")
g.edge("U", "X2")
g.edge("X5", "S")
g.edge("U", "Y", style="dashed")
g.edge("U", "S", style="dashed")
g.edge("U", "X3")
g.edge("X3", "S")
g.edge("Y", "X4")
g.edge("X4", "S")
g.edge("T", "X5")
g.edge("T", "Y")
g.edge("T", "S", style="dashed")
g.node("S", color="lightgrey", style="filled")

g

g = gr.Digraph(graph_attr={"rankdir": "LR"})

g.edge("Y", "X")
g.edge("T", "X")
g.edge("T", "Y")

g;

3.8.3 매개자 조건부 설정

g = gr.Digraph(graph_attr={"rankdir": "LR"})
g.edge("T", "M")
g.edge("T", "Y")
g.edge("M", "Y")
g.node("M", color="lightgrey", style="filled")

g.edge("woman", "seniority")
g.edge("woman", "salary")
g.edge("seniority", "salary")
g.node("seniority", color="lightgrey", style="filled")

g

g = gr.Digraph(graph_attr={"rankdir": "LR"})
g.edge("T", "M")
g.edge("T", "Y")
g.edge("M", "Y")
g.edge("M", "X")
g.node("X", color="lightgrey", style="filled")

g

3.9 요약

g = gr.Digraph(graph_attr={"rankdir": "LR", "ratio": "0.3"})
g.edge("U", "T")
g.edge("U", "Y")
g.edge("T", "Y")

g

g = gr.Digraph(graph_attr={"rankdir": "LR"})
g.edge("T", "M")
g.edge("M", "Y")
g.edge("T", "Y")
g.edge("T", "S")
g.edge("Y", "S")

g.node("M", color="lightgrey", style="filled")
g.node("S", color="lightgrey", style="filled")

g

g = gr.Digraph(graph_attr={"rankdir": "LR"})
g.edge("T", "In-Game Purchase")
g.edge("T", "In-Game Purchase > 0")
g.edge("In-Game Purchase", "In-Game Purchase > 0")

g.node("In-Game Purchase > 0", color="lightgrey", style="filled")

g

g = gr.Digraph(graph_attr={"rankdir": "LR"})
g.edge("loan amount", "Default at yr=1")
g.edge("Default at yr=1", "Default at yr=2")
g.edge("Default at yr=2", "Default at yr=3")
g.edge("U", "Default at yr=1")
g.edge("U", "Default at yr=2")
g.edge("U", "Default at yr=3")

g.node("Default at yr=1", color="lightgrey", style="filled")
g.node("Default at yr=2", color="darkgrey", style="filled")

g