내용 ¶

0. 들어가며
1. 라이브러리 불러오기
2. 데이터 불러오기
3. 데이터 확인하기
4. 데이터 탐색하기
5. 예측 모델 만들기
6. 참고

0. 들어가며 ¶

데이터셋 설명¶

여기서 사용한 데이터셋은 위스콘신 유방암 진단 데이터 입니다.^[1] 이 데이터셋에는 30개의 특성(features)값이 들어있고 유방암의 진단결과가 포함되어 있습니다. 총 데이터의 수는 596개이며 위스콘신 대학교에서 제공한 유방암 진단결과 데이터 입니다. 데이터셋에 포함된 특성에 대한 간략한 설명을 아래 표에 나타내었습니다.

특성	설명
id	환자 식별 번호
dignosis	유방암 종양(M=악성, B=양성)
radius	세포의 크기
texture	질감(흑백 처리했을때의 표준편차 값으로 계산)
perimeter	둘레
area	면적
smoothness	매끄러움(반경의 국소적 변화 측정)
compactness	작은 정도($perimeter^2/area-1$로 계산)
concavity	오목함(윤곽의 오목한 부분의 정도)
concave points	오목한곳의 수
symmetry	대칭성
fractal dimension	프랙탈 차원($coastline approximation-1$로 계산)

각각의 측정값들은 _mean(평균값), _SE(표준오차), _worst(제일 큰값 3개의 평균)으로 나타내어 총 30개의 특성값을 갖는다.

H2O¶

H2O 는 자바(Java) 기반의 소프트웨어로 데이터 모델링에 사용된다. H2O의 첫번째 목적은 병렬 컴퓨팅을 통해 많은 CPU와 메모리를 프로세스 하는 것이다. 자바 기반이지만 파이썬과 R 인터페이스를 제공합니다^[2].

분석하기¶

먼저 유방암 진단에 대하여 이해하기 위해 데이터셋의 특성값을 분석해봅니다. 그런 다음 두 가지 알고리즘을 사용하여 모델을 생성하고 모델을 사용하여 유방암을 예측해봅니다.

1. 라이브러리 불러오기 ¶

분석에 사용할 라이브러리를 불러옵니다.

In [1]:

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator

%matplotlib inline

데이터를 불러오는데 H2O를 사용합니다. 먼저 H2O를 초기화해야 합니다.

1.1. H2O 시작하기¶

H2O는 먼저 기존 인스턴스에 연결을 시도합니다. 그리고 사용 가능한 것이 없다면 새로 인스턴스를 만듭니다. 새 인스턴스에 대한 정보가 콘솔창에 출력되면 H2O를 사용할 준비는 모두 된 것입니다.

In [2]:

h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_222"; OpenJDK Runtime Environment (build 1.8.0_222-8u222-b10-1~deb9u1-b10); OpenJDK 64-Bit Server VM (build 25.222-b10, mixed mode)
  Starting server from /opt/conda/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmp1sor8o01
  JVM stdout: /tmp/tmp1sor8o01/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmp1sor8o01/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.

H2O cluster uptime:	02 secs
H2O cluster timezone:	Etc/UTC
H2O data parsing timezone:	UTC
H2O cluster version:	3.26.0.5
H2O cluster version age:	18 days
H2O cluster name:	H2O_from_python_unknownUser_8bo2vb
H2O cluster total nodes:	1
H2O cluster free memory:	3.556 Gb
H2O cluster total cores:	4
H2O cluster allowed cores:	4
H2O cluster status:	accepting new members, healthy
H2O connection url:	http://127.0.0.1:54321
H2O connection proxy:	None
H2O internal security:	False
H2O API Extensions:	Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
Python version:	3.6.6 final

H2O 클러스터 가동 시간, 시간대, 버전, 버전 수명, 클러스터 이름, 할당 된 하드웨어 리소스 (노드 수, 메모리, 코어 수), 연결 URL, 노출 된 H2O API 확장 및 사용 된 파이썬 버전과 같은 추가 정보가 출력되었습니다.

2. 데이터 불러오기 ¶

이제 아래 코드를 사용해 H2O를 사용하여 데이터를 가져옵니다.

In [3]:

data_df = h2o.import_file("../input/data.csv", destination_frame="data_df")

Parse progress: |█████████████████████████████████████████████████████████| 100%

3. 데이터 확인하기 ¶

H2O의 describe() 기능을 사용해 불러온 데이터셋을 확인합니다.

describe() 함수를 호출하는 것은 summary() 를 호출하는 것과 동일한 기능을 합니다.

In [4]:

data_df.describe()

Rows:569
Cols:33

	id	diagnosis	radius_mean	texture_mean	perimeter_mean	area_mean	smoothness_mean	compactness_mean	concavity_mean	concave points_mean	symmetry_mean	fractal_dimension_mean	radius_se	texture_se	perimeter_se	area_se	smoothness_se	compactness_se	concavity_se	concave points_se	symmetry_se	fractal_dimension_se	radius_worst	texture_worst	perimeter_worst	area_worst	smoothness_worst	compactness_worst	concavity_worst	concave points_worst	symmetry_worst	fractal_dimension_worst	C33
type	int	enum	real	real	real	real	real	real	real	real	real	real	real	real	real	real	real	real	real	real	real	real	real	real	real	real	real	real	real	real	real	real	int
mins	8670.0		6.981	9.71	43.79	143.5	0.05263	0.01938	0.0	0.0	0.106	0.04996	0.1115	0.3602	0.757	6.802	0.001713	0.002252	0.0	0.0	0.007882	0.0008948	7.93	12.02	50.41	185.2	0.07117	0.02729	0.0	0.0	0.1565	0.05504	NaN
mean	30371831.43233744		14.127291739894554	19.289648506151146	91.96903339191564	654.8891036906855	0.09636028119507907	0.10434098418277679	0.0887993158172232	0.04891914586994728	0.18116186291739894	0.06279760984182778	0.40517205623901575	1.2168534270650264	2.8660592267135323	40.337079086116	0.007040978910369067	0.025478138840070295	0.031893716344463974	0.011796137082601054	0.020542298769771525	0.0037949038664323374	16.269189806678384	25.677223198594024	107.26121265377856	880.5831282952548	0.1323685940246046	0.2542650439367311	0.27218848330404216	0.11460622319859404	0.2900755711775044	0.0839458172231986	0.0
maxs	911320502.0		28.11	39.28	188.5	2501.0	0.1634	0.3454	0.4268	0.2012	0.304	0.09744	2.873	4.885	21.98	542.2	0.03113	0.1354	0.396	0.05279	0.07895	0.02984	36.04	49.54	251.2	4254.0	0.2226	1.058	1.252	0.291	0.6638	0.2075	NaN
sigma	125020585.61222367		3.524048826212078	4.301035768166949	24.2989810387549	351.914129181653	0.014064128137673616	0.052812757932512194	0.07971980870789348	0.038802844859153605	0.027414281336035712	0.00706036279508446	0.2773127329861039	0.5516483926172023	2.0218545540421076	45.49100551613181	0.003002517943839066	0.017908179325677388	0.030186060322988408	0.00617028517404687	0.008266371528798399	0.002646070967089195	4.833241580469323	6.14625762303832	33.602542269036356	569.356992669949	0.022832429404835465	0.157336488913742	0.2086242806081323	0.06573234119594207	0.06186746753751869	0.018061267348893986	-0.0
zeros	0		0	0	0	0	0	0	13	13	0	0	0	0	0	0	0	0	13	13	0	0	0	0	0	0	0	0	13	13	0	0	0
missing	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	569
0	842302.0	M	17.99	10.38	122.8	1001.0	0.1184	0.2776	0.3001	0.1471	0.2419	0.07871	1.095	0.9053	8.589	153.4	0.006399	0.04904	0.05373	0.01587	0.03003	0.006193	25.38	17.33	184.6	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.1189	nan
1	842517.0	M	20.57	17.77	132.9	1326.0	0.08474	0.07864	0.0869	0.07017	0.1812	0.05667	0.5435	0.7339	3.398	74.08	0.005225	0.01308	0.0186	0.0134	0.01389	0.003532	24.99	23.41	158.8	1956.0	0.1238	0.1866	0.2416	0.186	0.275	0.08902	nan
2	84300903.0	M	19.69	21.25	130.0	1203.0	0.1096	0.1599	0.1974	0.1279	0.2069	0.05999	0.7456	0.7869	4.585	94.03	0.00615	0.04006	0.03832	0.02058	0.0225	0.004571	23.57	25.53	152.5	1709.0	0.1444	0.4245	0.4504	0.243	0.3613	0.08758	nan
3	84348301.0	M	11.42	20.38	77.58	386.1	0.1425	0.2839	0.2414	0.1052	0.2597	0.09744	0.4956	1.156	3.445	27.23	0.00911	0.07458	0.05661	0.01867	0.05963	0.009208	14.91	26.5	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.173	nan
4	84358402.0	M	20.29	14.34	135.1	1297.0	0.1003	0.1328	0.198	0.1043	0.1809	0.05883	0.7572	0.7813	5.438	94.44	0.01149	0.02461	0.05688	0.01885	0.01756	0.005115	22.54	16.67	152.2	1575.0	0.1374	0.205	0.4	0.1625	0.2364	0.07678	nan
5	843786.0	M	12.45	15.7	82.57	477.1	0.1278	0.17	0.1578	0.08089	0.2087	0.07613	0.3345	0.8902	2.217	27.19	0.00751	0.03345	0.03672	0.01137	0.02165	0.005082	15.47	23.75	103.4	741.6	0.1791	0.5249	0.5355	0.1741	0.3985	0.1244	nan
6	844359.0	M	18.25	19.98	119.6	1040.0	0.09463	0.109	0.1127	0.074	0.1794	0.05742	0.4467	0.7732	3.18	53.91	0.004314	0.01382	0.02254	0.01039	0.01369	0.002179	22.88	27.66	153.2	1606.0	0.1442	0.2576	0.3784	0.1932	0.3063	0.08368	nan
7	84458202.0	M	13.71	20.83	90.2	577.9	0.1189	0.1645	0.09366	0.05985	0.2196	0.07451	0.5835	1.377	3.856	50.96	0.008805	0.03029	0.02488	0.01448	0.01486	0.005412	17.06	28.14	110.6	897.0	0.1654	0.3682	0.2678	0.1556	0.3196	0.1151	nan
8	844981.0	M	13.0	21.82	87.5	519.8	0.1273	0.1932	0.1859	0.09353	0.235	0.07389	0.3063	1.002	2.406	24.32	0.005731	0.03502	0.03553	0.01226	0.02143	0.003749	15.49	30.73	106.2	739.3	0.1703	0.5401	0.539	0.206	0.4378	0.1072	nan
9	84501001.0	M	12.46	24.04	83.97	475.9	0.1186	0.2396	0.2273	0.08543	0.203	0.08243	0.2976	1.599	2.039	23.94	0.007149	0.07217	0.07743	0.01432	0.01789	0.01008	15.09	40.68	97.65	711.4	0.1853	1.058	1.105	0.221	0.4366	0.2075	nan

569개의 행(rows)과 33개의 열columns)로 구성이 되어 있습니다. 이것으로 각각의 열에 대하여 데이터 타입, 최대 최소 값등의 정보를 알 수 있습니다.

4. 데이터 탐색하기 ¶

H2O에서 제공하는 기능을 사용해 데이터를 탐색해봅니다.

종양의 진단에 관한 diagnosis 값으로 그룹을 만들어 양성과 악성 종양의 데이터 수를 확인해 봅니다.

In [5]:

df_group = data_df.group_by("diagnosis").count()
df_group.get_frame()

diagnosis	nrow
B	357
M	212

Out[5]:

양성(B)과 악성(M)이 각각 357, 212개 존재하는 것을 알 수 있습니다.

이제 각 특성(feature)값에 따라 양성과 악성 종양의 분포가 어떻게 되는지 시각화를 통해 살펴봅니다.

In [6]:

features = [f for f in data_df.columns if f not in ["id", "diagnosis", "C33"]]

i = 0
t0 = data_df[data_df["diagnosis"] == "M"].as_data_frame()
t1 = data_df[data_df["diagnosis"] == "B"].as_data_frame()

# sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(6, 5, figsize=(16, 24))

for feature in features:
    i += 1
    plt.subplot(6, 5, i)
    sns.kdeplot(t0[feature], bw=0.5, label="Malignant")
    sns.kdeplot(t1[feature], bw=0.5, label="Benign")
    plt.xlabel(feature, fontsize=12)
    locs, labels = plt.xticks()
    plt.tick_params(axis="both", which="major", labelsize=12)
plt.show()

<Figure size 432x288 with 0 Axes>

No description has been provided for this image

위 그림을 통해 우리는 양성과 악성 종양을 구분할 수 있는 특성을 알 수 있습니다. 아래에는 그런 특성들의 목록입니다.

radius_mean
texture_mean
perimeter_mean
area_mean
radius_worst
texture_worst
perimeter_worst
area_worst

그러나 양성과 악성을 전혀 구분할 수 없는 특성들도 있습니다. 다음과 같은 특성은 두 가지 종양에서 차이가 나타나지 않습니다.

compactness_se
concavity_se
concave_points_se
simmetry_se
smoothness_se

이제 특성간의 상관관계(correlation) Heat map을 그려봅니다.

In [7]:

plt.figure(figsize=(16, 16))
corr = data_df[features].cor().as_data_frame()
corr.index = features
sns.heatmap(
    corr,
    annot=True,
    cmap="coolwarm",
    linecolor="white",
    vmin=-1,
    vmax=1,
    cbar_kws={"orientation": "vertical"},
)
plt.title("Correlation Heatmap", fontsize=14)
plt.show()

아래 일부 특성은 서로 밀접하게 관련되어 있습니다.

radius_mean 과 perimeter_mean
radius_mean 과 texture_mean
perimeter_worst 와 radius_worst
perimeter_worst 와 area_worst
area_se 와 perimeter_se

5. 예측 모델만들기 ¶

5.1. 데이터 나누기 ¶

학습, 검증, 테스트 데이터셋으로 데이터를 분할합니다. 각각 60%, 20%, 20% 로 분할합니다.

In [8]:

train_df, valid_df, test_df = data_df.split_frame(ratios=[0.6, 0.2], seed=2018)
target = "diagnosis"
train_df[target] = train_df[target].asfactor()
valid_df[target] = valid_df[target].asfactor()
test_df[target] = test_df[target].asfactor()
print(
    "학습, 검증, 테스트 데이터셋의 수 : ",
    train_df.shape[0],
    valid_df.shape[0],
    test_df.shape[0],
)

학습, 검증, 테스트 데이터셋의 수 :  344 124 101

5.2. GBM(Gradient Boosting Algorithm) 모델로 학습하기 ¶

GBM 모델을 사용해 기계 학습을 합니다.

In [9]:

predictors = features
gbm = H2OGradientBoostingEstimator()
gbm.train(x=predictors, y=target, training_frame=train_df)

gbm Model Build progress: |███████████████████████████████████████████████| 100%

5.3. 모델 평가하기 ¶

훈련 된 모델을 검사 해봅니다. 먼저 모델에 대한 요약을 출력합니다.

In [10]:

gbm.summary()

Model Summary:

		number_of_trees	number_of_internal_trees	model_size_in_bytes	min_depth	max_depth	mean_depth	min_leaves	max_leaves	mean_leaves
0		50.0	50.0	9916.0	4.0	5.0	4.98	7.0	14.0	11.18

Out[10]:

이것은 우리가 만든 모델이 50개의 tree와 50 개의 internal tree를 사용했음을 보여줍니다. 또한 최소 4개 최대 5개의 tree 깊이, 최소 7개 최대 14개의 leaf 수를 알 수 있습니다.

추가적으로 검증 데이터셋에 대한 모델의 성능을 확인해 봅시다.

In [11]:

print(gbm.model_performance(valid_df))

ModelMetricsBinomial: gbm
** Reported on test data. **

MSE: 0.013297212568117986
RMSE: 0.11531354026357003
LogLoss: 0.050535167368489856
Mean Per-Class Error: 0.012820512820512775
AUC: 0.9987933634992459
pr_auc: 0.9719453672942044
Gini: 0.9975867269984917

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.5851063290701174:

		B	M	Error	Rate
0	B	85.0	0.0	0.0	(0.0/85.0)
1	M	1.0	38.0	0.0256	(1.0/39.0)
2	Total	86.0	38.0	0.0081	(1.0/124.0)

Maximum Metrics: Maximum metrics at their respective thresholds

	metric	threshold	value	idx
0	max f1	0.585106	0.987013	34.0
1	max f2	0.149309	0.979899	39.0
2	max f0point5	0.585106	0.994764	34.0
3	max accuracy	0.585106	0.991935	34.0
4	max precision	0.995426	1.000000	0.0
5	max recall	0.149309	1.000000	39.0
6	max specificity	0.995426	1.000000	0.0
7	max absolute_mcc	0.585106	0.981341	34.0
8	max min_per_class_accuracy	0.585106	0.974359	34.0
9	max mean_per_class_accuracy	0.585106	0.987179	34.0
10	max tns	0.995426	85.000000	0.0
11	max fns	0.995426	38.000000	0.0
12	max fps	0.002134	85.000000	107.0
13	max tps	0.149309	39.000000	39.0
14	max tnr	0.995426	1.000000	0.0
15	max fnr	0.995426	0.974359	0.0
16	max fpr	0.002134	1.000000	107.0
17	max tpr	0.149309	1.000000	39.0

Gains/Lift Table: Avg response rate: 31.45 %, avg score: 31.14 %

	group	cumulative_data_fraction	lower_threshold	lift	cumulative_lift	response_rate	score	cumulative_response_rate	cumulative_score	capture_rate	cumulative_capture_rate	gain	cumulative_gain
0	1	0.016129	0.995336	3.179487	3.179487	1.000000	0.995382	1.000000	0.995382	0.051282	0.051282	217.948718	217.948718
1	2	0.048387	0.995332	3.179487	3.179487	1.000000	0.995332	1.000000	0.995348	0.102564	0.153846	217.948718	217.948718
2	3	0.056452	0.995242	3.179487	3.179487	1.000000	0.995250	1.000000	0.995334	0.025641	0.179487	217.948718	217.948718
3	4	0.104839	0.994650	3.179487	3.179487	1.000000	0.995004	1.000000	0.995182	0.153846	0.333333	217.948718	217.948718
4	5	0.153226	0.993092	3.179487	3.179487	1.000000	0.993745	1.000000	0.994728	0.153846	0.487179	217.948718	217.948718
5	6	0.201613	0.989837	3.179487	3.179487	1.000000	0.991822	1.000000	0.994031	0.153846	0.641026	217.948718	217.948718
6	7	0.298387	0.595463	3.179487	3.179487	1.000000	0.910300	1.000000	0.966875	0.307692	0.948718	217.948718	217.948718
7	8	0.403226	0.028560	0.489152	2.480000	0.153846	0.191962	0.780000	0.765397	0.051282	1.000000	-51.084813	148.000000
8	9	0.500000	0.005443	0.000000	2.000000	0.000000	0.010610	0.629032	0.619309	0.000000	1.000000	-100.000000	100.000000
9	10	0.596774	0.004044	0.000000	1.675676	0.000000	0.004568	0.527027	0.519622	0.000000	1.000000	-100.000000	67.567568
10	11	0.701613	0.003461	0.000000	1.425287	0.000000	0.003730	0.448276	0.442534	0.000000	1.000000	-100.000000	42.528736
11	12	0.814516	0.003102	0.000000	1.227723	0.000000	0.003131	0.386139	0.381627	0.000000	1.000000	-100.000000	22.772277
12	13	0.895161	0.003047	0.000000	1.117117	0.000000	0.003081	0.351351	0.347524	0.000000	1.000000	-100.000000	11.711712
13	14	1.000000	0.002134	0.000000	1.000000	0.000000	0.002776	0.314516	0.311381	0.000000	1.000000	-100.000000	0.000000

혼동 행렬을 통해 오직 하나의 값만 잘못 예측되었음을 알 수 있습니다. 모델의 성능은 AUC가 0.9987이고 Gini coeff가 0.997, LogLoss는 0.05 임을 알 수 있습니다.

이와 같이 좋은 결과값을 얻으면 더이상 모델을 더 조정할 필요가 없습니다. 이제 테스트 데이터셋을 이용해 값을 예측할 수 있습니다. 그러기전에 먼저 모델에 들어가는 변수들의 중요도 플롯을 확인해봅니다.

In [12]:

gbm.varimp_plot()

위 그림을 통해 가장 중요한 변수는 perimeter_worst, concave_points_mean, radius_worst, concave_points_worst라는 것을 알 수 있습니다.

이제 모델을 사용해 예측을 해봅니다.

5.4. 예측하기 ¶

In [13]:

pred_val = list(gbm.predict(test_df[predictors])[0])
true_val = list(test_df[target])
prediction_acc = np.mean(pred_val == true_val)
print("Prediction accuracy: ", prediction_acc)

gbm prediction progress: |████████████████████████████████████████████████| 100%
Prediction accuracy:  1.0

정확도는 1입니다(100 % 정확하게 예측 된 값).

6.참고 ¶

[1] Breast Cancer Wisconsin (Diagnostic) Data Set, https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
[2] SRK, Getting started with H2O, https://www.kaggle.com/sudalairajkumar/getting-started-with-h2o