96well 모양으로 시각화하기

Taeyoon Kim

2019-03-07 06:07

논문을 읽다 보면 위의 그림의 A 처럼 96well 형태로 시각화한것을 볼 수 있습니다. 가만히 생각해보면 위의 그림은 산점도(scatter plot)을 그리고 Binding 값을 점의 크기로 표현하면 될 것같습니다. 오늘은 그런 그림을 파이썬으로 그려봅니다.

필요한 라이브러리 불러오기¶

In [1]:

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from IPython.display import set_matplotlib_formats

set_matplotlib_formats("retina")

데이터 불러오기¶

시각화할 데이터를 pandas로 불러옵니다.

In [2]:

df = pd.read_csv("../data/20190213_test.csv")
df.tail()

Out[2]:

	HC	LC	Binding
44	DC	NL	8425
45	DC	CC	2700
46	DC	NC	20775
47	DC	DL	14750
48	DC	DC	8425

데이터프레임을 출력해보면 48개의 데이터가 있고, HC, LC, Binding 행으로 구성되어 있습니다. Binding 행에 대해 기술 통계분석을 해봅니다.

In [3]:

df.describe()

Out[3]:

	Binding
count	49.000000
mean	13840.306122
std	8812.934115
min	2700.000000
25%	2700.000000
50%	14750.000000
75%	20775.000000
max	26800.000000

위의 결과를 통해 Binding의 최대값은 26800, 최소값은 2700, 평균은 13840인것을 알 수 있습니다.

Seaborn 사용하기¶

seaborn 라이브러리를 사용해 시각화해봅니다.

In [4]:

plt.ylim(-1, 7)
plt.xlim(-1, 7)
ax = sns.scatterplot(
    x="LC", y="HC", hue="Binding", size="Binding", data=df, sizes=(100, 500)
)
# Put the legend out of the figure
ax.legend(loc="upper left", bbox_to_anchor=(1.04, 1), ncol=1)

Out[4]:

<matplotlib.legend.Legend at 0x186e5198>

No description has been provided for this image

똑같지는 않지만 비슷한 그래프를 그릴 수 있습니다. 게다가 코드도 간단합니다.

Matplotlib 사용하기¶

matplotlib을 사용해 구현해봅니다. seaborn보다는 복잡하지만 좀 더 유연하게 사용할 수 있습니다.

In [5]:

plt.scatter(
    df["HC"].values,
    df["LC"].values,
    c=df["Binding"].values,
    s=df["Binding"].values / 50,
    cmap="Purples",
)
plt.ylim(-1, 7)
plt.xlim(-1, 7)

g1 = plt.scatter([], [], s=0, marker="o", color="#555555")
g2 = plt.scatter([], [], s=200, marker="o", color="#555555")
g3 = plt.scatter([], [], s=300, marker="o", color="#555555")
g4 = plt.scatter([], [], s=400, marker="o", color="#555555")

plt.legend(
    (g1, g2, g3, g4),
    ("0", "10000", "20000", "30000"),
    scatterpoints=1,
    loc="upper left",
    ncol=1,
    bbox_to_anchor=(1.04, 1),
)

plt.show()

주택가격 예측하기

Taeyoon Kim

2019-02-22 07:07

필요한 라이브러리 불러오기¶

In [1]:

import pandas as pd
import numpy as np
import os
import tarfile
from six.moves import urllib
import matplotlib.pyplot as plt

# 항상 동일한 결과를 얻기 위해 random.seed 값을 설정한다
np.random.seed(42)

# 주피터 노트북 안에 그림이 나오도록 설정
%matplotlib inline

0. 분석할 데이터셋¶

아래의 코드는 분석할 데이터셋을 로컬 디스크에 다운로드 합니다.

In [2]:

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = os.path.join("input", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"


def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()


fetch_housing_data()

1. `pandas`로 EDA 하기¶

EDA는 Exploratory data analysis의 준말로, 데이터셋을 둘러보면 어떠한 특징(feature)들을 가졌는지 확인해봄니다.

1.1. 다운로드한 CSV파일을 불러와서 표 출력하기¶

In [27]:

housing = pd.read_csv("./input/housing/housing.csv")
housing.tail()

Out[27]:

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
20635	-121.09	39.48	25.0	1665.0	374.0	845.0	330.0	1.5603	78100.0	INLAND
20636	-121.21	39.49	18.0	697.0	150.0	356.0	114.0	2.5568	77100.0	INLAND
20637	-121.22	39.43	17.0	2254.0	485.0	1007.0	433.0	1.7000	92300.0	INLAND
20638	-121.32	39.43	18.0	1860.0	409.0	741.0	349.0	1.8672	84700.0	INLAND
20639	-121.24	39.37	16.0	2785.0	616.0	1387.0	530.0	2.3886	89400.0	INLAND

총 20639 개의 개별 데이터가 있는것과 좌표를 뜻하는 longitude, latitude 행이 앞쪽에 있는 것을 알 수 있습니다. 또한, 집에 대한 정보들이 여러 행으로 구분되어 있습니다.

1.2. 기초 통계분석 하기¶

위에서 살펴본 표는 대부분 수치(숫자)형으로 나와있음으로, 간단하게 describe()함수로 기초 통계분석을 할 수 있습니다.

In [4]:

housing.describe()

Out[4]:

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
count	20640.000000	20640.000000	20640.000000	20640.000000	20433.000000	20640.000000	20640.000000	20640.000000	20640.000000
mean	-119.569704	35.631861	28.639486	2635.763081	537.870553	1425.476744	499.539680	3.870671	206855.816909
std	2.003532	2.135952	12.585558	2181.615252	421.385070	1132.462122	382.329753	1.899822	115395.615874
min	-124.350000	32.540000	1.000000	2.000000	1.000000	3.000000	1.000000	0.499900	14999.000000
25%	-121.800000	33.930000	18.000000	1447.750000	296.000000	787.000000	280.000000	2.563400	119600.000000
50%	-118.490000	34.260000	29.000000	2127.000000	435.000000	1166.000000	409.000000	3.534800	179700.000000
75%	-118.010000	37.710000	37.000000	3148.000000	647.000000	1725.000000	605.000000	4.743250	264725.000000
max	-114.310000	41.950000	52.000000	39320.000000	6445.000000	35682.000000	6082.000000	15.000100	500001.000000

기초 통계분석을 통해 평균값(mean), 표준편차(std), 최소값(min), 최대값(max), 각종 백분위수(25%, 50% 75%) 등을 알수 있습니다.

1.3. 히스토그램 그리기¶

간단한 히스토그램을 그려 전체 데이터의 모양을 어떻게 생겼는지 확인합니다.

In [5]:

fig = housing.hist(bins=50, figsize=(20, 15))

1.4. 지리적 데이터 시각화¶

사용한 데이터셋에는 각각의 위도 경도값이 들어있습니다. 이것을 이용해 지리적 정보를 시각화 합니다.

In [6]:

import matplotlib.image as mpimg

california_img = mpimg.imread("./input/housing/california.png")
ax = housing.plot(
    kind="scatter",
    x="longitude",
    y="latitude",
    figsize=(10, 7),
    s=housing["population"] / 100,
    label="population",
    c="median_house_value",
    cmap=plt.get_cmap(),
    colorbar=False,
    alpha=0.5,
)
plt.imshow(california_img, extent=[-124.55, -113.80, 32.45, 42.05], alpha=0.6)
prices = housing["median_house_value"]
tick_values = np.linspace(prices.min(), prices.max(), 11)
cbar = plt.colorbar()
cbar.ax.set_yticklabels(["$%dk" % (round(v / 1000)) for v in tick_values], fontsize=14)
plt.show()

1.5. 상관관계 조사¶

숫자형 특성이 11개이므로 만약 모든 상관관계를 조사한다면 총 121개의 그래프가 그려집니다. 그렇게 많은 그래프를 하나로 표현하면 오히려 더 알아보기 힘들기 때문에 여기에서는 앞서 히스토그램에서 상관관계가 높아보이는 특성 몇 개만 추려서 시각화 합니다.

In [7]:

from pandas.tools.plotting import scatter_matrix

corr_matrix = housing.corr()
attributes = [
    "median_house_value",
    "median_income",
    "total_rooms",
    "housing_median_age",
]
fig = scatter_matrix(housing[attributes], figsize=(10, 10), alpha=0.2)

결과를 보면 중간 소득(median_income)이 중간 주택 가격(median_house_value)과 가장 상관 관계가 있어보입니다.

좀더 크게 산점도를 그려보겠습니다.

In [8]:

housing.plot(kind="scatter", x="median_income", y="median_house_value", alpha=0.1)
plt.axis([0, 16, 0, 550000])

Out[8]:

[0, 16, 0, 550000]

2. 데이터 전처리¶

2.1. 특성 추가하기¶

기계학습의 성능을 높이기 위해 기존의 특성을 가지고 추가특성을 만듭니다. 먼저 heatmap을 그려 특성간의 상관관계를 확인합니다.

In [41]:

import seaborn as sns

corr = housing.corr()

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(10, 7))

# Draw the heatmap with the mask and correct aspect ratio

with sns.axes_style("white"):
    sns.heatmap(corr, vmax=0.3, cmap="YlGnBu", square=True, linewidths=0.3)

위의 결과를 통해 total_rooms, households, total_bedrooms, population간의 높은 상관관계가 있음을 알 수 있습니다. 다음과 같은 특성을 추가합니다.

rooms_per_household
bedrooms_per_room
population_per_household

In [59]:

housing["rooms_per_household"] = housing["total_rooms"] / housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"] / housing["total_rooms"]
housing["population_per_household"] = housing["population"] / housing["households"]

rooms_per_household와 평균 주택 가격의 산점도를 그려봅니다.

In [10]:

housing.plot(kind="scatter", x="rooms_per_household", y="median_house_value", alpha=0.2)
plt.axis([0, 5, 0, 520000])
plt.show()

2.2. one-hot encoding¶

데이터셋의 ocean_proximity행은 수치형 데이터가 아닙니다. 따라서 기계학습에 사용하기 위해 get_dummies기능을 사용해 one-hot encoding을 수행합니다.

In [60]:

df = pd.get_dummies(data=housing, columns=["ocean_proximity"])

2.3. 결측치의 처리¶

데이터셋의 결측값을 각각의 행의 평균값으로 치환해주는 작업입니다.

In [61]:

from sklearn.preprocessing import Imputer

imputer = Imputer(strategy="median")

# housing_num = housing.drop('ocean_proximity', axis=1)
# 다른 방법: housing_num = housing.select_dtypes(include=[np.number])
imputer.fit(df)
X = imputer.transform(df)
housing_tr = pd.DataFrame(X, columns=df.columns)
housing_tr.tail()

Out[61]:

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	rooms_per_household	bedrooms_per_room	population_per_household	ocean_proximity_INLAND
20635	-121.09	39.48	25.0	1665.0	374.0	845.0	330.0	1.5603	78100.0	5.045455	0.224625	2.560606	1.0
20636	-121.21	39.49	18.0	697.0	150.0	356.0	114.0	2.5568	77100.0	6.114035	0.215208	3.122807	1.0
20637	-121.22	39.43	17.0	2254.0	485.0	1007.0	433.0	1.7000	92300.0	5.205543	0.215173	2.325635	1.0
20638	-121.32	39.43	18.0	1860.0	409.0	741.0	349.0	1.8672	84700.0	5.329513	0.219892	2.123209	1.0
20639	-121.24	39.37	16.0	2785.0	616.0	1387.0	530.0	2.3886	89400.0	5.254717	0.221185	2.616981	1.0

2.4. 데이터셋 레이블 분리하기¶

기계학습을 위해 데이터셋의 평균주택가격(median_house_value)행을 분리해 레이블로 사용합니다.

In [62]:

housing_labels = housing_tr["median_house_value"].copy()  # 레이블로 사용
housing_tr.drop("median_house_value", axis=1, inplace=True)  # 레이블 삭제

2.5. 데이터셋 정규화¶

기계학습의 성능을 높이기 위해 데이터셋을 정규화합니다.

In [63]:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
# scaler.fit(housing_tr)
scaled_df = scaler.fit_transform(housing_tr)
scaled_df.shape

Out[63]:

(20640, 16)

총 데이터의 숫자는 20640이고 16개의 특성으로 구성된 배열임을 알 수 있습니다.

2.6. 학습용, 확인용 데이터 나누기¶

무작위 샘플링을 통해 데이터를 학습용과 확인용으로 나눕니다. 학습용을 80% 확인용은 20% 비율로 분리합니다.

In [64]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    scaled_df, housing_labels, test_size=0.2, random_state=42
)
print(X_train.shape, X_test.shape)

(16512, 16) (4128, 16)

학습용은 16512개 확인용은 4128개 입니다.

3. 기계학습 하기¶

3.1. 선형회귀 모델¶

선형회귀를 사용한 기계학습으로 평균주택 가격을 예측합니다.

In [68]:

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression(n_jobs=-1)  # -1 means use all cpu core
lin_reg.fit(X_train, y_train)

Out[68]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=-1, normalize=False)

성능의 평가를 위해 예측값과 실제값의 RMS를 구합니다.

In [70]:

from sklearn.metrics import mean_squared_error

y_pred = lin_reg.predict(X_test)
rms = np.sqrt(mean_squared_error(y_test, y_pred))
print(rms)

69136.3530563

대부분의 주택의 중간가격이 120,000 ~ 265,000$ 인데 오차가 약 70,000$ 인것은 만족스럽지 못합니다. 이제 다른 모델을 사용해 기계학습을 해봅니다.

3.2. 의사결정트리 모델¶

DecisionTreeRegressor 모델을 사용해보고 성능은 동일하게 RMS를 구합니다.

In [73]:

from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(n_jobs=-1)
tree_reg.fit(X_train, y_train)

y_pred = tree_reg.predict(X_test)
rms = np.sqrt(mean_squared_error(y_test, y_pred))
print(rms)

69856.8366586

오히려 더 나쁜 결과가 나왔습니다. 이것은 의사결정트리 모델이 과접합(overfitting)되었기 때문입니다. 이제 랜덤 포레스트 모델을 사용해봅니다.

3.3 랜덤 포레스트 모델¶

성능의 평가는 RMS를 측정합니다.

In [74]:

from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(n_jobs=-1)
forest_reg.fit(X_train, y_train)

y_pred = forest_reg.predict(X_test)
rms = np.sqrt(mean_squared_error(y_test, y_pred))
print(rms)

52495.5185404

가장 나은 성능을 보여주긴 하지만 그리 만족스럽지는 못합니다. 다른 모델인 서포트 벡터 머신(support vector machine, SVM)도 사용해 보겠습니다.

3.4. 서포트 벡터 머신¶

In [77]:

from sklearn.svm import SVR

svm_reg = SVR(kernel="linear")
svm_reg.fit(X_train, y_train)

y_pred = svm_reg.predict(X_test)
rms = np.sqrt(mean_squared_error(y_test, y_pred))
print(rms)

106532.353525

가장 나쁜 수치를 보여주네요. 지금 까지 알아본 모델중에 랜덤 포레스트가 가장 나은 성능을 보입니다.

4. 기계학습 최적화하기¶

랜덤 포레스트 모델의 매개변수(parameter)를 변경해 성능을 끌어올려 보겠습니다.

4.1 GridSearch¶

변경하고 싶은 매개변수의 목록을 만들어 각각의 성능을 확인합니다.

이것은 시간이 많이 걸리는 작업 입니다.

In [78]:

from sklearn.model_selection import GridSearchCV

param_grid = [
    # try 12 (3×4) combinations of hyperparameters
    {"n_estimators": [3, 10, 30], "max_features": [2, 4, 6, 8]},
    # then try 6 (2×3) combinations with bootstrap set as False
    {"bootstrap": [False], "n_estimators": [3, 10], "max_features": [2, 3, 4]},
]

forest_reg = RandomForestRegressor(random_state=42, n_jobs=-1)
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training
grid_search = GridSearchCV(
    forest_reg,
    param_grid,
    cv=5,
    scoring="neg_mean_squared_error",
    return_train_score=True,
)
grid_search.fit(X_train, y_train)

Out[78]:

GridSearchCV(cv=5, error_score='raise',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
           oob_score=False, random_state=42, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]}, {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='neg_mean_squared_error', verbose=0)

In [86]:

cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

63746.5574472 {'max_features': 2, 'n_estimators': 3}
54818.1369471 {'max_features': 2, 'n_estimators': 10}
52761.066563 {'max_features': 2, 'n_estimators': 30}
60211.8968318 {'max_features': 4, 'n_estimators': 3}
52034.9078927 {'max_features': 4, 'n_estimators': 10}
49983.1654019 {'max_features': 4, 'n_estimators': 30}
57632.991143 {'max_features': 6, 'n_estimators': 3}
51015.4104325 {'max_features': 6, 'n_estimators': 10}
49392.4163034 {'max_features': 6, 'n_estimators': 30}
58940.3719018 {'max_features': 8, 'n_estimators': 3}
51942.3176781 {'max_features': 8, 'n_estimators': 10}
49959.1368802 {'max_features': 8, 'n_estimators': 30}
62059.1461145 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
53952.4642105 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
58047.57318 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
51474.6992514 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}
59148.5539952 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}
51707.4776245 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}

결과를 표로 출력합니다.

In [87]:

cv = pd.DataFrame(grid_search.cv_results_)
cv.head()

Out[87]:

	mean_fit_time	mean_score_time	mean_test_score	mean_train_score	param_bootstrap	param_max_features	param_n_estimators	params	rank_test_score	split0_test_score	...	split2_test_score	split2_train_score	split3_test_score	split3_train_score	split4_test_score	split4_train_score	std_fit_time	std_score_time	std_test_score	std_train_score
0	0.133629	0.103405	-4.063624e+09	-1.094790e+09	NaN	2	3	{'max_features': 2, 'n_estimators': 3}	18	-3.912718e+09	...	-4.054892e+09	-1.131086e+09	-4.054280e+09	-1.092380e+09	-4.397559e+09	-1.104997e+09	0.000421	0.000162	1.797902e+08	2.398293e+07
1	0.145697	0.106882	-3.005028e+09	-5.738667e+08	NaN	2	10	{'max_features': 2, 'n_estimators': 10}	11	-2.965064e+09	...	-3.036337e+09	-5.956773e+08	-2.967483e+09	-5.805910e+08	-3.118674e+09	-5.775896e+08	0.000452	0.000160	6.549213e+07	1.593803e+07
2	0.272286	0.116512	-2.783730e+09	-4.358046e+08	NaN	2	30	{'max_features': 2, 'n_estimators': 30}	9	-2.783921e+09	...	-2.813139e+09	-4.470262e+08	-2.722224e+09	-4.370232e+08	-2.861313e+09	-4.341716e+08	0.001962	0.000833	5.049988e+07	8.623569e+06
3	0.133654	0.103288	-3.625473e+09	-9.618490e+08	NaN	4	3	{'max_features': 4, 'n_estimators': 3}	16	-3.798650e+09	...	-3.783640e+09	-9.712237e+08	-3.299407e+09	-9.286709e+08	-3.630501e+09	-9.372133e+08	0.000599	0.000177	1.796931e+08	2.476000e+07
4	0.205908	0.106540	-2.707632e+09	-5.113191e+08	NaN	4	10	{'max_features': 4, 'n_estimators': 10}	8	-2.697739e+09	...	-2.769564e+09	-5.220880e+08	-2.671836e+09	-5.201207e+08	-2.693382e+09	-4.987163e+08	0.048003	0.000372	3.292795e+07	1.029301e+07

5 rows × 23 columns

가장 좋은 매개변수의 값은 아래와 같습니다. 이때의 성능(RMS)값은 49392.4163034 입니다.

In [82]:

grid_search.best_params_

Out[82]:

{'max_features': 6, 'n_estimators': 30}

In [83]:

grid_search.best_estimator_

Out[83]:

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features=6, max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=30, n_jobs=-1, oob_score=False, random_state=42,
           verbose=0, warm_start=False)

4.3. 성능에 가장 영향주는 특성¶

gridsearch에서 가장 좋은 성능의 모델을 가지고 성능에 가장 큰 영향을 주는 특성을 확인해보겠습니다.

In [94]:

feature_importances = grid_search.best_estimator_.feature_importances_
plt.bar(range(len(feature_importances)), feature_importances)

Out[94]:

<Container object of 16 artists>

8번째(0부터 시작하기 때문)의 값이 가장 큰 영향을 주는데, 이것은 평균수입(median_income)에 대한 값입니다.

4.2. RandomizedSearch¶

Gridsearch는 계산시간이 아주 올래걸리기 때문에 최근에는 RandomizedSearch 방법으로 최적화를 많이 진행합니다. 이름에서 알 수 있듯, 사용자가 매개변수의 범위를 지정해주면 무작위로 매개변수를 조합한 성능을 측정합니다.

In [84]:

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
    "n_estimators": randint(low=1, high=200),
    "max_features": randint(low=1, high=8),
}

forest_reg = RandomForestRegressor(random_state=42, n_jobs=-1)
rnd_search = RandomizedSearchCV(
    forest_reg,
    param_distributions=param_distribs,
    n_iter=10,
    cv=5,
    scoring="neg_mean_squared_error",
    random_state=42,
)
rnd_search.fit(X_train, y_train)

Out[84]:

RandomizedSearchCV(cv=5, error_score='raise',
          estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
           oob_score=False, random_state=42, verbose=0, warm_start=False),
          fit_params=None, iid=True, n_iter=10, n_jobs=1,
          param_distributions={'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fe9181981d0>, 'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fe918104898>},
          pre_dispatch='2*n_jobs', random_state=42, refit=True,
          return_train_score='warn', scoring='neg_mean_squared_error',
          verbose=0)

In [88]:

cvres = rnd_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

49093.7903483 {'max_features': 7, 'n_estimators': 180}
51594.3819541 {'max_features': 5, 'n_estimators': 15}
50107.6001672 {'max_features': 3, 'n_estimators': 72}
50577.2720048 {'max_features': 5, 'n_estimators': 21}
49181.7494258 {'max_features': 7, 'n_estimators': 122}
50131.1870262 {'max_features': 3, 'n_estimators': 75}
50001.880295 {'max_features': 3, 'n_estimators': 88}
49438.7625614 {'max_features': 5, 'n_estimators': 100}
49887.3160204 {'max_features': 3, 'n_estimators': 150}
64777.9799569 {'max_features': 5, 'n_estimators': 2}

가장 좋은 성능의 결과를 확인합니다.

In [95]:

final_model = rnd_search.best_estimator_
y_pred = final_model.predict(X_test)
rms = np.sqrt(mean_squared_error(y_test, y_pred))
print(rms)

48434.3695832

5. 마치며¶

기계학습으로 캘리포니아의 주택가격을 예측해보는 것을 살펴보았습니다. 결론적으로 주택가격에 가장 영향을 주는 특성은 평균소득이었으며, 예측 오차는 약 48434$ 입니다.

필요한 라이브러리 불러오기¶

데이터 불러오기¶

Seaborn 사용하기¶

Matplotlib 사용하기¶

필요한 라이브러리 불러오기¶

0. 분석할 데이터셋¶

1. pandas로 EDA 하기¶

1.1. 다운로드한 CSV파일을 불러와서 표 출력하기¶

1.2. 기초 통계분석 하기¶

1.3. 히스토그램 그리기¶

1.4. 지리적 데이터 시각화¶

1.5. 상관관계 조사¶

2. 데이터 전처리¶

2.1. 특성 추가하기¶

2.2. one-hot encoding¶

2.3. 결측치의 처리¶

2.4. 데이터셋 레이블 분리하기¶

2.5. 데이터셋 정규화¶

2.6. 학습용, 확인용 데이터 나누기¶

3. 기계학습 하기¶

3.1. 선형회귀 모델¶

3.2. 의사결정트리 모델¶

3.3 랜덤 포레스트 모델¶

3.4. 서포트 벡터 머신¶

4. 기계학습 최적화하기¶

4.1 GridSearch¶

4.3. 성능에 가장 영향주는 특성¶

4.2. RandomizedSearch¶

5. 마치며¶

1. `pandas`로 EDA 하기¶