edwith 머신러닝 강의 노트정리

Taeyoon Kim

2018-05-28 07:27

0. 시작하기전에¶

edwith에서 최성철교수님의 강의를 듣고 노트정리를 한 내용입니다. 따라서 사용한 모든 코드와 내용의 저작권은 최성철 교수님에게 있습니다. 보다 자세한 내용은 위의 링크에서 강의를 수강하시기 바랍니다.

1. 교육 환경¶

모든 라이브러리는 Anaconda를 이용해 설치 하였습니다. 사용한 라이브러리의 목록은 다음과 같습니다.

Jupyter
Numpy
Pandas
Scikit learn

2. 파이썬 코드 스타일¶

파이썬을 활용하여 효율적으로 코드를 표현하는 기법을 배워 봅니다. 예를 들면 다음과 같은 리스트안의 내용을 한줄로 출력하는 코드가 있습니다.

colors = ["red", "blue", "green", "yellow"]
result = ""
for s in colors:
    result += s

위의 코드를 파이썬스럽게 만들면, 아래와 같습니다.

colors = ["red", "blue", "green", "yellow"]
result = "".join(colors)

이렇게 하면 간결하고 가독성이 좋아집니다.

2.1. Split & Join¶

Split & Join 을 사용하여 String Type 의 값을 List 형태로 변환하고, List Type의 값을 String Type 의 값으로 변환해 봅니다.

In [ ]:

# 빈칸을 기준으로 문자열 나누기
items = "zero one two three".split()
# 쉼표를 기준으로 문자열 나누기
items = "zero,one,two,three".split()
# "."을 기준으로 문자열 나누고 unpacking
example = "cs50.gachon.edu"
subdomain, domain, tld = example.split(".")

2.2. List Comprehension¶

많이 사용되는 기법 중에 하나인 List Comprehension 을 사용해봅니다. 비교를 하기 위해 먼저, for loop + append() 사용한 코드를 살펴 보겠습니다.

In [ ]:

result = []
for i in range(10):
    result.append(i)
print(result)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

list Comprehension 사용하면 다음과 같습니다.

In [ ]:

result = [i for i in range(10)]
print(result)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

2.3. Enumerate & Zip¶

리스트의 값을 추출할때 방법으로 이용되는 enumerate 와 두개의 list 값을 병렬적으로 추출할 수 있는 zip 모듈을 알아 봅니다.

2.3.1. Enumerate¶

리스트의 값을 index 번호와 함께 추출합니다.

In [ ]:

my_list = ["a", "b", "c"]
for i, j in enumerate(my_list):
    print(i, j)

0 a
1 b
2 c

In [ ]:

list(enumerate(my_list))  # 리스트에 있는 index와 값을 unpacking 후 다시 리스트로 저장

Out[ ]:

[(0, 'a'), (1, 'b'), (2, 'c')]

2.3.2. Zip¶

두 개의 리스트 값을 병렬적으로 추출합니다.

In [ ]:

list_a = ["a1", "a2", "a3"]
list_b = ["b1", "b2", "b3"]
for i in zip(list_a, list_b):
    print(i, end=" ")

('a1', 'b1') ('a2', 'b2') ('a3', 'b3')

2.4. Lambda & Map & Reduce¶

함수처럼 사용가능한 Lambda
Sequence 자료형의 데이터에서 함수를 적용하는 방법인 Map 과 Reduce 함수

2.4.1. Lambda¶

다음과 같은 코드를 Lambda로 보다 간략하게 쓸 수 있습니다.

def f(x, y):
    return x + y
print(f(1, 4))

In [ ]:

# Lambda 사용
def f(x, y):
    return x + y


print(f(1, 4))

제 개인적 의견으로는 Lambda는 가독성을 나쁘게 해서, 좋아하지 않습니다.

2.4.2. Map & Reduce¶

In [ ]:

# python 3에는 list를 붙여야 합니다.
ex = [1, 2, 3, 4, 5]
print(list(map(lambda x: x + x, ex)))

[2, 4, 6, 8, 10]

In [ ]:

# Reduce
from functools import reduce

print(reduce(lambda x, y: x + y, [1, 2, 3, 4, 5]))

In [ ]:

# 위에서 배운것으로 팩토리얼을 구현해 봅니다.
def factorial(n):
    return reduce(lambda x, y: x * y, range(1, n + 1))


factorial(10)

Out[ ]:

2.5. Asterisk¶

곱셈, 제곱연산, 가변인자 활용 등 여러 부분에서 다양하게 사용되는 Asterisk(*) 의 사용법을 알아봅니다.

In [ ]:

def asterisk_test(a, *args):
    print(a, args)
    print(type(args))


asterisk_test(1, 2, 3, 4, 5, 6)

1 (2, 3, 4, 5, 6)
<class 'tuple'>

In [ ]:

def asterisk_test(a, **kargs):
    print(a, kargs)
    print(type(kargs))


asterisk_test(1, b=2, c=3, d=4, e=5, f=6)

1 {'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6}
<class 'dict'>

In [ ]:

def asterisk_test(a, *args):
    print(a, args[0])
    print(type(args))


asterisk_test(1, (2, 3, 4, 5, 6))

1 (2, 3, 4, 5, 6)
<class 'tuple'>

In [ ]:

def asterisk_test(a, args):
    print(a, *args)
    print(type(args))


asterisk_test(1, (2, 3, 4, 5, 6))

1 2 3 4 5 6
<class 'tuple'>

In [ ]:

a, b, c = ([1, 2], [3, 4], [5, 6])
print(a, b, c)

[1, 2] [3, 4] [5, 6]

In [ ]:

data = ([1, 2], [3, 4], [5, 6])
print(*data)

[1, 2] [3, 4] [5, 6]

In [ ]:

for data in zip(*([1, 2], [3, 4], [5, 6])):
    print(sum(data))

9
12

In [ ]:

def asterisk_test(a, b, c, d, e=0):
    print(a, b, c, d, e)


data = {"d": 1, "c": 2, "b": 3, "e": 56}
asterisk_test(10, **data)

10 3 2 1 56

2.6. Collections¶

tuple, dict 에 대한 확장 데이터 구조를 제공하는 Collections 안에 포함된 모듈을 이용하여 Data Sturcture 의 기본 개념을 이해하고 사용하는 방법을 알아봅니다.

In [ ]:

from collections import Counter

c = Counter()  # a new, empty counter
c = Counter("gallahad")  # a new counter from an iterable
print(c)

Counter({'a': 3, 'l': 2, 'g': 1, 'h': 1, 'd': 1})

3. 선형대수(Linear algebra) 문제 풀어보기¶

파이썬으로 선형대수를 다루는 방법을 알아봅니다

In [ ]:

u = [2, 2]
v = [2, 3]
z = [3, 5]

result = [t for t in zip(u, v, z)]
print(result)

[(2, 2, 3), (2, 3, 5)]

In [ ]:

matrix_a = [[3, 6], [4, 5]]
matrix_b = [[5, 8], [6, 7]]
result = [[sum(row) for row in zip(*t)] for t in zip(matrix_a, matrix_b)]

print(result)

[[8, 14], [10, 12]]

In [ ]:

matrix_a = [[1, 2, 3], [4, 5, 6]]
result = [[element for element in t] for t in zip(*matrix_a)]

[t for t in zip(*matrix_a)]
print(result)

[[1, 4], [2, 5], [3, 6]]

In [ ]:

matrix_a = [[1, 1, 2], [2, 1, 1]]
matrix_b = [[1, 1], [2, 1], [1, 3]]
result = [
    [sum(a * b for a, b in zip(row_a, column_b)) for column_b in zip(*matrix_b)]
    for row_a in matrix_a
]
print(result)

[[5, 8], [5, 6]]

아래는 선형대수 과제에 대한 코드입니다. vector와 matrix의 기초적인 연산을 수행하는 12개의 함수를 작성합니다.

In [ ]:

# Problem #1 - vector_size_check
def vector_size_check(*vector_variables):
    return all(len(vector_variables[0]) == len(i) for i in vector_variables)


# 실행결과
print(vector_size_check([1, 2, 3], [2, 3, 4], [5, 6, 7]))  # Expected value: True
print(vector_size_check([1, 3, 4], [4], [6, 7]))  # Expected value: False

True
False

In [ ]:

# Problem #2 - vector_addition
def vector_addition(*vector_variables):
    return [sum(i) for i in zip(*vector_variables)]


# 실행결과
print(vector_addition([1, 3], [2, 4], [6, 7]))  # Expected value: [9, 14]
print(vector_addition([1, 5], [10, 4], [4, 7]))  # Expected value: [15, 16]

[9, 14]
[15, 16]

김동혁 hyukster9@gmail.com 님이 위 코드의 오류를 고쳐주셨습니다. 감사합니다.

In [ ]:

# Problem #3 - vector_subtraction
def vector_subtraction(*vector_variables):
    if vector_size_check(*vector_variables) is False:
        raise ArithmeticError
    return [i[0] * 2 - sum(i) for i in zip(*vector_variables)]


# 실행결과
print(vector_subtraction([1, 3], [2, 4]))  # Expected value: [-1, -1]
print(vector_subtraction([1, 5], [10, 4], [4, 7]))  # Expected value: [-13, -6]

[-1, -1]
[-13, -6]

In [ ]:

# Problem #4 - scalar_vector_product (one line code available)
def scalar_vector_product(alpha, vector_variable):
    return [alpha * i for i in vector_variable]


# 실행결과
print(scalar_vector_product(5, [1, 2, 3]))  # Expected value: [5, 10, 15]
print(scalar_vector_product(3, [2, 2]))  # Expected value: [6, 6]
print(scalar_vector_product(4, [1]))  # Expected value: [4]

[5, 10, 15]
[6, 6]
[4]

4. 머신러닝의 개요¶

이번 장에서는 Machine Learning 에서 사용하는 용어와 개념에 대해서 공부합니다.

Model: 예측을 위한 수학 공식, 함수
- e.g. 1차 방정식, 확률분포, condition rule
Algorithms: 어떠한 문제를 풀기 위한 과정, Model을 생성하기 위한 과정
Feature : 머신러닝에서 데이터의 특징을 나타내는 변수
- 연속형(continuous feature): 온도, 속도, 일반적인 실수값
- 이산형(discrete feature): 성별, 등수

5. 데이터 다루기¶

5.1. Numpy 사용법¶

과학 계산용 패키지인 numpy 의 여러 특징과 기능, 코드를 작성하는 방법 등을 배웁니다. 먼저, 라이브러리를 불러오고 array를 생성하겠습니다.

In [ ]:

import numpy as np

# array의 생성
test_matrix = np.array([[1, 2, 3, 4], [1, 2, 5, 8]])
test_matrix

Out[ ]:

array([[1, 2, 3, 4],
       [1, 2, 5, 8]])

array의 모양 확인하는 방법은 아래와 같습니다.

In [ ]:

test_matrix.shape

Out[ ]:

(2, 4)

5.1.1 Array 모양 바꾸기¶

array의 모양을 평평하게(1D array) 바꾸기

In [ ]:

np.array(test_matrix).reshape(
    8,
)

Out[ ]:

array([1, 2, 3, 4, 1, 2, 5, 8])

다른 방법으로 flatten()를 사용할 수도 있습니다.

In [ ]:

np.array(test_matrix).flatten()

Out[ ]:

array([1, 2, 3, 4, 1, 2, 5, 8])

5.1.2. slicing¶

array의 특정 영역만 선택하는 방법입니다.

In [ ]:

test_matrix[:, 1]  # 2번째 column 부터

Out[ ]:

array([2, 2])

In [ ]:

test_matrix[1, :]  # 2번째 row 부터

Out[ ]:

array([1, 2, 5, 8])

5.1.3. Concatenate¶

각각의 array를 하나로 합쳐줍니다.

In [ ]:

a = np.array([[1, 2, 3]])
b = np.array([[2, 3, 4]])
np.concatenate((a, b), axis=0)

Out[ ]:

array([[1, 2, 3],
       [2, 3, 4]])

In [ ]:

np.concatenate((a, b), axis=1)

Out[ ]:

array([[1, 2, 3, 2, 3, 4]])

5.1.4. 데이터 타입 바꿔주기¶

데이터 타입을 확인하고 원하는 것으로 바꿔주는 코드입니다.

In [ ]:

test_matrix.dtype

Out[ ]:

dtype('int32')

In [ ]:

# float 형으로 변경해줍니다.
test_matrix_float = test_matrix.astype(float)
test_matrix_float.dtype

Out[ ]:

dtype('float64')

5.2. Pandas¶

파이썬의 엑셀이라 불리는 Pandas의 여러 기능과 사용하는 방법을 설명합니다. pandas는 series와 dataframe이라는 형태로 데이터를 처리합니다.

series : 1차원 배열
dataframe : 2차원 배열

5.2.1. 데이터 불러오기¶

pandas로 외부데이터를 불러오는 방법입니다.

In [ ]:

import pandas as pd

data_url = "https://www.shanelynn.ie/wp-content/uploads/2015/06/phone_data.csv"
df = pd.read_csv(data_url, index_col=0)
df.tail()

Out[ ]:

	date	duration	item	month	network	network_type
index
825	13/03/15 00:38	1.000	sms	2015-03	world	world
826	13/03/15 00:39	1.000	sms	2015-03	Vodafone	mobile
827	13/03/15 06:58	34.429	data	2015-03	data	data
828	14/03/15 00:13	1.000	sms	2015-03	world	world
829	14/03/15 00:16	1.000	sms	2015-03	world	world

총 829개의 데이터를 가져왔습니다.

5.2.2. 데이터 선택하기¶

Index 정보로 데이터 선택법¶

In [ ]:

df[:3]  # 위에서 3번째까지의 데이터 선택

Out[ ]:

	date	duration	item	month	network	network_type
index
0	15/10/14 06:58	34.429	data	2014-11	data	data
1	15/10/14 06:58	13.000	call	2014-11	Vodafone	mobile
2	15/10/14 14:46	23.000	call	2014-11	Meteor	mobile

Column 정보로 데이터 선택법¶

duration과 network column의 값만 선택하고, 위에서부터 3개의 값만 선택해보겠습니다.

In [ ]:

df[["duration", "network"]][:3]

Out[ ]:

	duration	network
index
0	34.429	data
1	13.000	Vodafone
2	23.000	Meteor

5.2.3. 데이터 주무르기¶

dateutil라이브러리를 사용하면 날짜데이터를 쉽게 파싱할 수 있습니다.

In [ ]:

import dateutil

df["date"] = df["date"].apply(dateutil.parser.parse, dayfirst=True)
df.head()

Out[ ]:

	date	duration	item	month	network	network_type
index
0	2014-10-15 06:58:00	34.429	data	2014-11	data	data
1	2014-10-15 06:58:00	13.000	call	2014-11	Vodafone	mobile
2	2014-10-15 14:46:00	23.000	call	2014-11	Meteor	mobile
3	2014-10-15 14:48:00	4.000	call	2014-11	Tesco	mobile
4	2014-10-15 17:27:00	4.000	call	2014-11	Tesco	mobile

달별 총 통화량을 계산해 보겠습니다.

In [ ]:

df.groupby("month")["duration"].sum()

Out[ ]:

month
2014-11    26639.441
2014-12    14641.870
2015-01    18223.299
2015-02    15522.299
2015-03    22750.441
Name: duration, dtype: float64

피봇테이블 기능을 이용해서도 할 수 있습니다.

In [ ]:

df.pivot_table(["duration"], index=df.month, aggfunc="sum")

Out[ ]:

	duration
month
2014-11	26639.441
2014-12	14641.870
2015-01	18223.299
2015-02	15522.299
2015-03	22750.441

5.2.4. NA 값 처리하기¶

In [ ]:

raw_data = {
    "first_name": ["Jason", np.nan, "Tina", "Jake", "Amy"],
    "last_name": ["Miller", np.nan, "Ali", "Milner", "Cooze"],
    "age": [42, np.nan, 36, 24, 73],
    "sex": ["m", np.nan, "f", "m", "f"],
    "preTestScore": [4, np.nan, np.nan, 2, 3],
    "postTestScore": [25, np.nan, np.nan, 62, 70],
}
df = pd.DataFrame(
    raw_data,
    columns=["first_name", "last_name", "age", "sex", "preTestScore", "postTestScore"],
)
df

Out[ ]:

	first_name	last_name	age	sex	preTestScore	postTestScore
0	Jason	Miller	42.0	m	4.0	25.0
1	NaN	NaN	NaN	NaN	NaN	NaN
2	Tina	Ali	36.0	f	NaN	NaN
3	Jake	Milner	24.0	m	2.0	62.0
4	Amy	Cooze	73.0	f	3.0	70.0

In [ ]:

# NA 값이 하나라도 있는 데이터는 지우기
df_no_missing = df.dropna(axis=0, thresh=6)
df_no_missing

Out[ ]:

	first_name	last_name	age	sex	preTestScore	postTestScore
0	Jason	Miller	42.0	m	4.0	25.0
3	Jake	Milner	24.0	m	2.0	62.0
4	Amy	Cooze	73.0	f	3.0	70.0

In [ ]:

# 모든 값이 NA인 데이터 지우기
df_cleaned = df.dropna(how="all")
df_cleaned

Out[ ]:

	first_name	last_name	age	sex	preTestScore	postTestScore
0	Jason	Miller	42.0	m	4.0	25.0
2	Tina	Ali	36.0	f	NaN	NaN
3	Jake	Milner	24.0	m	2.0	62.0
4	Amy	Cooze	73.0	f	3.0	70.0

In [ ]:

# NA값을 0으로 바꾸기
df.fillna(0)

Out[ ]:

	first_name	last_name	age	sex	preTestScore	postTestScore
0	Jason	Miller	42.0	m	4.0	25.0
1	0	0	0.0	0	0.0	0.0
2	Tina	Ali	36.0	f	0.0	0.0
3	Jake	Milner	24.0	m	2.0	62.0
4	Amy	Cooze	73.0	f	3.0	70.0

5.2.5. One-hot encoding¶

get_dummies()함수를 이용해 one-hot encoding을 해봅니다.

In [ ]:

# 성별 컬럼을 one-hot-encoding 해보겠습니다.
pd.concat([df, pd.get_dummies(df["sex"], prefix="sex")], axis=1)

Out[ ]:

	first_name	last_name	age	sex	preTestScore	postTestScore	sex_f	sex_m
0	Jason	Miller	42.0	m	4.0	25.0	0	1
1	NaN	NaN	NaN	NaN	NaN	NaN	0	0
2	Tina	Ali	36.0	f	NaN	NaN	1	0
3	Jake	Milner	24.0	m	2.0	62.0	0	1
4	Amy	Cooze	73.0	f	3.0	70.0	1	0

5.2.6. feature_scaling¶

pandas로 구현하는 방법도 있지만, sklearn의 preprocessing 기능을 사용하는것이 보다 편합니다.

In [ ]:

from sklearn import preprocessing

df = df.fillna(0)  # NA값을 제거하기 위해
minmax_scaler = preprocessing.MinMaxScaler().fit(df[["preTestScore", "postTestScore"]])
df[["preTestScore", "postTestScore"]] = minmax_scaler.transform(
    df[["preTestScore", "postTestScore"]]
)
df

Out[ ]:

	first_name	last_name	age	sex	preTestScore	postTestScore
0	Jason	Miller	42.0	m	1.00	0.357143
1	0	0	0.0	0	0.00	0.000000
2	Tina	Ali	36.0	f	0.00	0.000000
3	Jake	Milner	24.0	m	0.50	0.885714
4	Amy	Cooze	73.0	f	0.75	1.000000

preTestScore, postTestScore의 값이 0~1사이의 값으로 정규화(normalization)되었습니다.

6. Numpy 과제풀이¶

숙제로 나온 과제 일부를 풀어봅니다.

In [ ]:

import numpy as np

test_matrix = np.array([[1, 2, 3, 4], [1, 2, 5, 8]])


def change_shape_of_ndarray(X, n_row):
    if n_row == 1:
        return X.flatten()
    else:
        return X.reshape(n_row, -1)


change_shape_of_ndarray(test_matrix, 1)

Out[ ]:

array([1, 2, 3, 4, 1, 2, 5, 8])

In [ ]:

test_matrix = np.array([1, 2, 3, 4])


def save_ndarray(X, filename="test.npy"):
    with open(filename, "wb") as f:
        np.save(X, f)


def boolean_index(X, condition):
    condition = eval(str("X") + condition)
    return np.where(condition)


boolean_index(test_matrix, ">2")

Out[ ]:

(array([2, 3], dtype=int64),)

In [ ]:

test_matrix = np.array([1, 2, 3, 4])


def find_nearest_value(X, target_value):
    return X[np.argmin(np.abs(X - target_value))]


find_nearest_value(test_matrix, 0.1)

Out[ ]:

In [ ]:

test_matrix = np.array([1, 2, 3, 4])


def get_n_largest_values(X, n):
    return X[np.argsort(X[::-1])[:n]]


get_n_largest_values(test_matrix, 2)

Out[ ]:

array([4, 3])

7. 선형회귀(Linear regression)¶

종속 변수 y와 한 개 이상의 독립 변수 (또는 설명 변수) X와의 선형 상관 관계를 모델링하는 회귀분석 기법 --wikipedia

앞으로 scikit-learn 라이브러리를 가지고 머신러닝을 진행할 것입니다. 머신러닝을 분류하는 기준은 여러가지가 있지만, 여기서는 다음 4가지로 분류 합니다.

Gradient descent based learning
Probability theory based learning
Information theory based learning
Distance similarity based learning

머신러닝의 목적은 아래 같습니다,

실제 값과 학습된 모델 예측치의 오차를 최소화
모델의 최적 weight값 찾기

선형회귀에서 오차를 측정하는 방법으로 Squared Error를 사용합니다. 따라서 Squared Error를 최소화 할 수 있는 weight값을 찾는 것이 목표가 됩니다.

7.1. Cost function¶

cost function 은 실제 값과 예측된 값의 차이를 나타낸 수식입니다. cost functon의 최소값을 찾으면 최적 weights값을 찾을 수 있습니다. 아래 두가지 방법을 알아 보겠습니다.

normal equation
gradient descent

7.2. Normal equation¶

특징은 다음과 같습니다.

X^TX의 역행렬이 존재할 때 사용
hyper parameter가 없음
Feature가 많으면 계산 속도가 느려짐

Hyper parameter¶

사용자가 임의로 정해줘야하는 변수입니다. 예를 들면 학습률(Learning rate)가 있습니다.

학습률이 너무 낮게 설정: 학습 시간이 오래 걸림
학습률을 너무 높게 설정: 수렴하지 못하는 경우가 생김

아래는 실습 코드입니다.

7.2.1 실습 코드¶

In [ ]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

#  LOAD DATASET - simple variable
df_train = pd.read_csv("./data/normal_eq_train.csv")
df_test = pd.read_csv("./data/normal_eq_test.csv")
# df_test.head()
X_train = df_train["x"].values.reshape(-1, 1)
X_test = df_test["x"].values.reshape(-1, 1)
y_train = df_train["y"].values
y_test = df_test["y"].values
df_train.tail()

Out[ ]:

	x	y
695	58	58.595006
696	93	94.625094
697	82	88.603770
698	66	63.648685
699	97	94.975266

데이터를 시각화하면 아래 그림과 같습니다. 명확하게 선형성(Linearity)이 보입니다.

In [ ]:

plt.scatter(X_train, y_train, alpha=0.3)

Out[ ]:

<matplotlib.collections.PathCollection at 0x136ae470>

No description has been provided for this image

모델 만들기¶

sklearn의 선형회귀 모델을 사용합니다.

In [ ]:

from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score

lr = linear_model.LinearRegression(normalize=False)  # false 이유?
lr.fit(X_train, y_train)

Out[ ]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

예측하기 하고 시각화하기¶

테스트 값(x)을 넣어 예측 값(y)를 구하고 시각화 해봅니다.

In [ ]:

y_pred = lr.predict(X_test)

# The coefficients
print("Coefficients: {:02.3f} ".format(lr.coef_[0]))
# The mean squared error
print("Mean squared error: {:02.3f}".format(mean_squared_error(y_test, y_pred)))
# Explained variance score: 1 is perfect prediction
print("Variance score: {:02.3f}".format(r2_score(y_test, y_pred)))

Coefficients: 1.001 
Mean squared error: 9.435
Variance score: 0.989

In [ ]:

# Plot outputs
plt.scatter(X_test, y_test, alpha=0.3)
plt.plot(X_test, y_pred, color="red")
plt.show()

7.3. Gradient Descent¶

Gradient Descent 알고리즘으로 Linear Regression 을 구하는 방법에 대해 공부합니다

7.3.1. Linear regression with Gradient Descent¶

임의의 theta0, theta1 값으로 초기화
Cost function 이 최소화 될 때까지 학습
Learning rate, Iteration 횟수 등 hyper parameter 필요
Feature가 많으면 Normal equation에 비해 상대적으로 빠름
다만, 최적값에 수렴하지 않을 수도 있음

예시 코드¶

사용한 데이터는 스웨덴 자동차 보험 입니다. x를 횟수(Claims) 값을 두고 y의 값은 지불(Payment)열로 지정해서 보겠습니다.

In [ ]:

# LOAD DATASET
df = pd.read_csv("./data/SwedishMotorInsurance.csv")
df.tail()

Out[ ]:

	Kilometres	Zone	Bonus	Make	Insured	Claims	Payment
2177	5	7	7	5	8.74	0	0
2178	5	7	7	6	16.61	0	0
2179	5	7	7	7	2.83	1	966
2180	5	7	7	8	13.06	0	0
2181	5	7	7	9	384.87	16	112252

In [ ]:

# X = number of claims
# Y = total payment for all the claims in thousands of Swedish Kronor
raw_X = df["Claims"].values.reshape(-1, 1)
y = df["Payment"].values / 10000
plt.plot(raw_X, y, "o", alpha=0.5)

Out[ ]:

[<matplotlib.lines.Line2D at 0x13a35cc0>]

In [ ]:

raw_X[:5], y[:5]

Out[ ]:

(array([[108],
        [ 19],
        [ 13],
        [124],
        [ 40]], dtype=int64),
 array([39.2491,  4.6221,  1.5694, 42.2201, 11.9373]))

In [ ]:

np.ones((len(raw_X), 1))[:3]

Out[ ]:

array([[1.],
       [1.],
       [1.]])

In [ ]:

X = np.concatenate((np.ones((len(raw_X), 1)), raw_X), axis=1)
X[:5]

Out[ ]:

array([[  1., 108.],
       [  1.,  19.],
       [  1.,  13.],
       [  1., 124.],
       [  1.,  40.]])

In [ ]:

w = np.random.normal((0, 4))  # 초기 weight값을 임의로 정해줍니다.
w

Out[ ]:

array([-0.00891921,  4.46857823])

In [ ]:

y_predict = np.dot(X, w)
plt.plot(raw_X, y, "o", alpha=0.5)
plt.plot(raw_X, y_predict)

Out[ ]:

[<matplotlib.lines.Line2D at 0x13a78c50>]

In [ ]:

# HYPOTHESIS AND COST FUNCTION
def hypothesis_function(X, theta):
    return X.dot(theta)


def cost_function(h, y):
    return (1 / (2 * len(y))) * np.sum((h - y) ** 2)


h = hypothesis_function(X, w)
cost_function(h, y)

Out[ ]:

341210.1358302579

In [ ]:

# GRADIENT DESCENT
def gradient_descent(X, y, w, alpha, iterations):
    theta = w
    m = len(y)
    theta_list = [theta.tolist()]
    cost = cost_function(hypothesis_function(X, theta), y)
    cost_list = [cost]
    for i in range(iterations):
        t0 = theta[0] - (alpha / m) * np.sum(np.dot(X, theta) - y)
        t1 = theta[1] - (alpha / m) * np.sum((np.dot(X, theta) - y) * X[:, 1])
        theta = np.array([t0, t1])
        if i % 10 == 0:
            theta_list.append(theta.tolist())
            cost = cost_function(hypothesis_function(X, theta), y)
            cost_list.append(cost)
    return theta, theta_list, cost_list


# DO Linear regression with GD
iterations = 70  # 학습횟수
alpha = 0.00001  # 학습률

theta, theta_list, cost_list = gradient_descent(X, y, w, alpha, iterations)
cost = cost_function(hypothesis_function(X, theta), y)

print("theta:", theta)
print("cost:", cost_function(hypothesis_function(X, theta), y))

theta: [-0.01387632  0.50162216]
cost: 47.51860200002646

In [ ]:

theta_list[:10]

Out[ ]:

[[-0.008919209046962171, 4.468578233162438],
 [-0.010979772463144966, 2.748558651973508],
 [-0.013688742102085531, 0.5092589839807392],
 [-0.013728097491510558, 0.5016479365747901],
 [-0.01375837638747473, 0.5016221036915547],
 [-0.013788621606274266, 0.50162205194611],
 [-0.013818863883511206, 0.5016220878239617],
 [-0.013849103323916964, 0.5016221239962518]]

In [ ]:

theta_list = np.array(theta_list)
cost_list

Out[ ]:

[341210.1358302579,
 109500.67530520404,
 48.783091900116084,
 47.518661422618486,
 47.518637666285954,
 47.518628518478,
 47.518619372554696,
 47.51861022834695]

In [ ]:

y_predict_step = np.dot(X, theta_list.transpose())
y_predict_step
plt.plot(raw_X, y, "o", alpha=0.3)
for i in range(0, len(cost_list)):
    plt.plot(raw_X, y_predict_step[:, i], label="Line %d" % i)

plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)
plt.show()

학습이 진행 될 수록(line이 늘어날수록) 실제 데이터와 가까워 지는것을 확인 할 수 있습니다.

In [ ]:

plt.plot(range(len(cost_list)), cost_list)

cost function의 값이 2번째 학습때부터 0으로 최소값으로 수렴하는 것을 볼 수 있습니다.

7.4. Multivariate Linear Regression¶

한개 이상의 feature로 구성된 데이터를 분석할때 사용하는 Multivariate Linear Regression 을 구현하는 방법에 대해 공부합니다.

In [ ]:

from sklearn.datasets import load_boston
import matplotlib.pyplot as plt
import numpy as np
import random

%matplotlib inline


def gen_data(numPoints, bias, variance):
    x = np.zeros(shape=(numPoints, 3))
    y = np.zeros(shape=numPoints)
    # basically a straight line
    for i in range(0, numPoints):
        # bias feature
        x[i][0] = random.uniform(0, 1) * variance + i
        x[i][1] = random.uniform(0, 1) * variance + i
        x[i][2] = 1
        # our target variable
        y[i] = (i + bias) + random.uniform(0, 1) * variance + 500
    return x, y


# gen 100 points with a bias of 25 and 10 variance as a bit of noise
x, y = gen_data(100, 25, 10)

plt.plot(x[:, 0:1], "ro")
plt.plot(y, "bo")

plt.show()

In [ ]:

fig = plt.figure()
ax = fig.add_subplot(111, projection="3d")
ax.scatter(x[:, 0], x[:, 1], y)

ax.set_xlabel("X0 Label")
ax.set_ylabel("X1 Label")
ax.set_zlabel("Y Label")

plt.show()

In [ ]:

def compute_cost(x, y, theta):
    """
    Comput cost for linear regression
    """
    # Number of training samples
    m = y.size
    predictions = x.dot(theta)
    sqErrors = predictions - y

    J = (1.0 / (2 * m)) * sqErrors.T.dot(sqErrors)
    return J


def minimize_gradient(x, y, theta, iterations=100000, alpha=0.01):
    m = y.size
    cost_history = []
    theta_history = []

    for _ in range(iterations):
        predictions = x.dot(theta)

        for i in range(theta.size):
            partial_marginal = x[:, i]
            errors_xi = (predictions - y) * partial_marginal
            theta[i] = theta[i] - alpha * (1.0 / m) * errors_xi.sum()

        if _ % 1000 == 0:
            theta_history.append(theta)
            cost_history.append(compute_cost(x, y, theta))

    return theta, np.array(cost_history), np.array(theta_history)


theta_initial = np.ones(3)
theta, cost_history, theta_history = minimize_gradient(
    x, y, theta_initial, 300000, 0.0001
)
print("theta", theta)

theta [5.12894737e-01 5.11459989e-01 5.23264450e+02]

In [ ]:

from sklearn import linear_model

regr = linear_model.LinearRegression()
regr.fit(x[:, :2], y)

# # The coefficients
print("Coefficients: ", regr.coef_)
print("intercept: ", regr.intercept_)

Coefficients:  [0.50658951 0.50525441]
intercept:  524.1326900611566

In [ ]:

print(np.dot(theta, x[10]))
print(regr.predict(x[10, :2].reshape(1, 2)))

537.8284506612534
[538.51880902]

In [ ]:

import matplotlib.pyplot as plt
import numpy as np

fig = plt.figure()
ax = fig.add_subplot(111, projection="3d")
ax.scatter3D(theta_history[:, 0], theta_history[:, 1], cost_history, zdir="z")
plt.show()

In [ ]:

plt.plot(cost_history)
plt.show()

100번째 학습이후 cost function이 수렴하는것을 알 수 있습니다.

7.5. 성능 측정법(Performance measure)¶

만들어진 모델의 평가는 어떻게 할 것인가? 를 판단하기 위해서는 평가할수있는 수치가 필요합니다. 이제 우리가 만든 모델을 평가하기 위해서 사용되는 measure 에 대한 개념과, scikit-learn 을 사용하여 어떻게 코드를 작성하는 지에 대해 공부합니다.

In [ ]:

# Mean Absolute Error(MAE)
from sklearn.metrics import median_absolute_error

y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]

median_absolute_error(y_true, y_pred)

Out[ ]:

0.5

In [ ]:

# Root Mean Squared Error(RMSE)
from sklearn.metrics import mean_squared_error

y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]

mean_squared_error(y_true, y_pred)

Out[ ]:

0.375

In [ ]:

# R squared
from sklearn.metrics import r2_score

y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]

r2_score(y_true, y_pred)

Out[ ]:

0.9486081370449679

7.6. Stochastic Gradient Descent¶

Stochastic Gradient Descent(이하,SGD) 알고리즘에 대한 개념과 GD와의 장단점이 무엇인지에 대해 비교해보겠습니다. 그리고 모델을 학습시킬 경우에 필수적인 hyper parameter 중 epoch과 batch size 에 대해서 함께 공부합니다.

7.6.1. Full-batch gradient descent¶

일반적으로 GD = full-batch GD라고 가정합니다. 모든 데이터 셋으로 학습하고 다음과 같은 특징을 가집니다.

업데이트 감소 -> 계산상 효율적(속도) 가능
안정적인 Cost 함수 수렴
지역 최적화 가능
메모리 문제가 발생할 수 있음
대규모 dataset à 모델/파라메터 업데이트가 느려짐

7.6.2. SGD¶

전체 데이터 셋에서 임의로 training sample을 뽑은 후 학습에 사용합니다. 특징은 다음과 같습니다.

빈번한 업데이트 모델 성능 및 개선 속도 확인 가능
일부 문제에 대해 더 빨리 수렴
지역 최적화 회피가능
대용량 데이터시 시간이 오래걸림
더 이상 cost가 줄어들지 않는 시점의 발견이 어려움

7.6.3. Mini-SGD¶

일부 데이터 셋에서 한 번에 일정 데이터를 임의로 뽑아서 학습합니다.

SGD와 Batch GD를 혼합한 기법
가장 일반적으로 많이 쓰이는 기법

7.6.4. SGD implementation issues¶

SGD를 실제로 구현했을 때 생기는 여러 이슈는 다음과 같습니다.

Learning-rate decay
- 일정한 주기로 Learning rate을 감소시키는 방법
- 특정 epoch 마다 Learning rate를 감소
- Hyper-parameter 설정의 어려움
종료조건 설정
- SGD과정에서 특정 값이하로 cost function이 줄어들지 않을 경우 GD를 멈추는 방법
- 성능이 좋아지지 않는/필요없는 연산을 방지함
- 종료조건을 설정 : tol > (loss - previous_loss)
- tol은 hyperparameter로 사람이 설정함

7.6.5. Epoch and batch-size¶

전체 데이터가 Training 데이터에 들어갈 때 카운팅
Full-batch를 n번 실행하면 n epoch
Batch-size 한번에 학습되는 데이터의 개수
- 총 5,120개의 Training data에 512 batch-size면 몇 번 학습을 해야 1 epoch이 되는가? = 10번

예시 코드¶

In [ ]:

from sklearn.model_selection import train_test_split
import warnings
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_error

warnings.filterwarnings("ignore")

# 데이터 불러오기
boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
X = df.values
y = boston.target

# feature scailing
std_scaler = StandardScaler()
std_scaler.fit(X)
X_scaled = std_scaler.transform(X)

# 데이터 나누기
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42
)

# SGDRegression 하기
lr_SGD = SGDRegressor()
lr_SGD.fit(X_train, y_train)

y_hat = lr_SGD.predict(X_test)
y_true = y_test

# 평가하기
mse = mean_squared_error(y_hat, y_true)
rmse = np.sqrt((((y_hat - y_true) ** 2).sum() / len(y_true)))
# 시각화하기
plt.scatter(y_true, y_hat, s=10)
plt.xlabel("Prices: $Y_i$")
plt.ylabel("Predicted prices: $\hat{Y}_i$")
plt.title("Prices vs Predicted prices: $Y_i$ vs $\hat{Y}_i$")

Out[ ]:

Text(0.5, 1.0, 'Prices vs Predicted prices: $Y_i$ vs $\\hat{Y}_i$')

7.7. 오버피팅(Overfitting) 과 정규화(Regularization)¶

오버피팅이란 모델이 학습데이터에 과다 최적화되어 오히려 새로운 데이터 예측에는 성능이 떨어지는 것을 말합니다.

Overfitting 을 방지하기 위한 기법¶

데이터셋을 나누자
더 많은 데이터를 활용한다
Feature의 개수를 줄인다
적절히 Parameter를 선정한다
Regularization

위의 기법 중에서 1번과 5번을 구체적으로 알아 보겠습니다.

7.7.1. 데이터셋을 나누는 법¶

학습한 데이터로 다시 테스트를 할 경우, 오버피팅이 됩니다. 테스트용 데이터는 기존의 학습 데이터와 차이가 있기 때문이죠. 모델은 새로운 데이터가 처리가능하도록 generalize되어 있어야 합니다. 그래서 학습용 데이터와 테스트용 데이터를 분리하는 기법이 필요합니다. 이러한 기법을 holdout method이라고 합니다.

학습용과 테스트용 데이터를 나누는 비율은 데이터의 크기에 따라 다르지만 일반적으로 Training Data 70%, Test Data 30%을 사용합니다.

sci-kit learn에서는 데이터셋을 간단히 나눌수 있습니다. 아래 예제 코드를 확인하세요.

In [ ]:

# train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

7.7.2. Regularization 방법¶

L1 Regularization과 L2 Regularization의 개념과, 두 Regularization의 차이점에 대하여 공부합니다.

L1 regularization: 기존 Cost function L1(norm) penalty term을 추가, Lasso 라고 합니다.
L2 regularization: 기존 Cost function L2(norm) penalty term을 추가, Ridge 라고 합니다.

차이점 비교¶

L1(Lasso)	L2(Ridge)
Unstable solution	Stable solution
Always on solution	Only one solution
Sparse solution	Non-sparse solution
Feature selection

예시 코드¶

scikit-learn 에서 linear 모델 중 SGD Regressor 와 Ridge, Lasso 를 실제로 코드를 작성하고 실행하며 각각의 모델이 서로 어떤 특징을 가지는지에 대해 공부합니다. scikit-learn의 각 모델을 실행할 때 지정해야 하는 파라미터에 대한 설명도 함께 진행하겠습니다.

In [ ]:

# Linear Regression with Ridge & Lasso regression
from sklearn.linear_model import Lasso, Ridge

ridge = Ridge(fit_intercept=True, alpha=0.5)
ridge.fit(X_train, y_train)
# lasso = Lasso(fit_intercept=True, alpha=0.5)
y_hat = ridge.predict(X_test)
y_true = y_test
mse = mean_squared_error(y_hat, y_true)
rmse = np.sqrt((((y_hat - y_true) ** 2).sum() / len(y_true)))
# rmse, mse
plt.scatter(y_true, y_hat, s=10)
plt.xlabel("Prices: $Y_i$")
plt.ylabel("Predicted prices: $\hat{Y}_i$")
plt.title("Prices vs Predicted prices: $Y_i$ vs $\hat{Y}_i$")

Out[ ]:

Text(0.5, 1.0, 'Prices vs Predicted prices: $Y_i$ vs $\\hat{Y}_i$')

In [ ]:

from sklearn.model_selection import KFold

print("Ridge Regression")
print("alpha\t RMSE_train\t RMSE_10cv\n")
alpha = np.linspace(0.01, 100, 10)
t_rmse = np.array([])
cv_rmse = np.array([])

for a in alpha:
    ridge = Ridge(fit_intercept=True, alpha=a)

    # computing the RMSE on training data
    ridge.fit(X_train, y_train)
    p = ridge.predict(X_test)
    err = p - y_test
    total_error = np.dot(err, err)
    rmse_train = np.sqrt(total_error / len(p))

    # computing RMSE using 10-fold cross validation
    kf = KFold(10)
    xval_err = 0
    for train, test in kf.split(X):
        ridge.fit(X[train], y[train])
        p = ridge.predict(X[test])
        err = p - y[test]
        xval_err += np.dot(err, err)
    rmse_10cv = np.sqrt(xval_err / len(X))

    t_rmse = np.append(t_rmse, [rmse_train])
    cv_rmse = np.append(cv_rmse, [rmse_10cv])
    print("{:.3f}\t {:.4f}\t\t {:.4f}".format(a, rmse_train, rmse_10cv))

Ridge Regression
alpha	 RMSE_train	 RMSE_10cv

0.010	 4.6394		 5.8757
11.120	 4.7653		 5.7211
22.230	 4.7743		 5.6373
33.340	 4.7828		 5.5765
44.450	 4.7923		 5.5322
55.560	 4.8026		 5.4996
66.670	 4.8131		 5.4755
77.780	 4.8237		 5.4576
88.890	 4.8342		 5.4445
100.000	 4.8444		 5.4349

In [ ]:

plt.plot(alpha, t_rmse, label="RMSE-Train")
plt.plot(alpha, cv_rmse, label="RMSE_XVal")
plt.legend(("RMSE-Train", "RMSE_XVal"))
plt.ylabel("RMSE")
plt.xlabel("Alpha")

Out[ ]:

Text(0.5, 0, 'Alpha')

In [ ]:

a = 0.3
for name, met in [
    ("lasso", Lasso(fit_intercept=True, alpha=a)),
    ("ridge", Ridge(fit_intercept=True, alpha=a)),
]:
    met.fit(X_train, y_train)
    # p = np.array([met.predict(xi) for xi in x])
    p = met.predict(X_test)
    e = p - y_test
    total_error = np.dot(e, e)
    rmse_train = np.sqrt(total_error / len(p))

    kf = KFold(10)
    err = 0
    for train, test in kf.split(X):
        met.fit(X[train], y[train])
        p = met.predict(X[test])
        e = p - y[test]
        err += np.dot(e, e)

    rmse_10cv = np.sqrt(err / len(X))
    print("Method: %s" % name)
    print("RMSE on training: %.4f" % rmse_train)
    print("RMSE on 10-fold CV: %.4f" % rmse_10cv)

Method: lasso
RMSE on training: 4.8154
RMSE on 10-fold CV: 5.7637
Method: ridge
RMSE on training: 4.6598
RMSE on 10-fold CV: 5.8487

7.8. 다항회귀(Polynomial Regression)¶

Regression 모델 중 X, Y 의 관계가 곡선 형태(비선형)일 경우에 사용할 수 있는 Polynomial Regression에 대한 개념과, scikit-learn을 활용해서 polynomial regression 을 수행하는 방법에 대해서 공부합니다.

7.8.1. Polynomial Features¶

1차 방정식을 고차다항식으로 변경하는 기법,sklearn.preprocessing.PolynomialFeatures 사용합니다. 다음과 같은 상황에서 사용됩니다.

한개 변수가 Y값과 비선형적인 관계가 있다고 의심될 때
주기적인 패턴을 보이는 Series 데이터
모델 자체가 복잡해지면 해결가능한 부분이 많을때
- SVM, Tree-based models

7.8.2. 최적화 하는 방법¶

RMSE의 최소값을 찾는다
Ridge, Lasso, LR 모두 다 사용해보기
Degree 를 10 ~ 50까지 사용해보기

예시 코드¶

아래 코드로 자세히 알아보겠습니다.

In [ ]:

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
import warnings

warnings.filterwarnings("ignore")

# Create matrix and vectors
X = [[0.44, 0.68], [0.99, 0.23]]
y = [109.85, 155.72]
X_test = [[0.49, 0.18]]

# PolynomialFeatures (prepreprocessing)
poly = PolynomialFeatures(degree=2)
X_ = poly.fit_transform(X)
X_test_ = poly.fit_transform(X_test)

# Instantiate
lg = LinearRegression()

# Fit
lg.fit(X_, y)

# Obtain coefficients
lg.coef_

Out[ ]:

array([  0.        ,  19.4606578 , -15.92235638,  27.82874066,
        -2.52988551, -14.48934431])

In [ ]:

# Predict
lg.predict(X_test_)

Out[ ]:

array([126.84247142])

7.9. Performance measure techniques¶

머신러닝의 Performance를 측정하는 여러가지 기법에 대한 개념과, scikit-learn 으로 실행시키는 방법에 대해서 공부합니다.

교차 검증(cross validation)이란?¶

일반적으로 테스트 데이터가 별도로 존재하는 경우가 많지 않기 때문에 보통은 원래 학습용으로 확보한 데이터 중 일부를 떼어내어 테스트 데이터로 사용하는 경우가 많습니다. 그런데 데이터를 어떻게 나누냐에 따라 성능이 조금씩 달라질 수 있습니다. 따라서 여러가지 서로 다른 학습과 테스트 데이터를 사용해 여러번의 성능을 측정해 평균 성능(mean performance) 과 성능 분산(performance variance) 을 구하는 방법을 교차 검증(cross validation) 이라고 합니다.

scikit-Learn 의 model_selection기능은 교차 검증을 위해 전체 데이터 셋에서 학습용 데이터와 테스트용 데이터를 분리해 내는 여러가지 방법을 제공합니다.

7.9.1. K-fold cross validation¶

학습 데이터를 K번 나눠서 Test와 Train을 실시 à Test의 평균값을 사용
모델의 Parameter 튜닝, 간단한 모델의 최종 성능 측정 등 사용
cross_val_score 함수로, 한번에 해결 가능
sklearn은 이 후 작업의 통일성을 위해 MSE를 음수로 변환

7.9.2. Leave One Out (LOO)¶

하나의 sample만을 test set으로 남긴다.

7.9.3. Validation set for parameter turning¶

Validation set의 많은 이유중 하나가 Hyper parameter turning
Number of iterations (SGD), Number of branch (Tree-based) etc.
Validation set의 성능으로 최적의 parameter를 찾음
Validation set 결과와 Training set 결과의 차이가 벌어지면 overfitting

7.9.4. 기타등등¶

LeavePOut – 한번에 P개를 뽑음 (Not LOO for one data)
ShuffleSplit – 독립적인(중복되는) 데이터 Sampling
StratifiedKFold – Y 값 비율에 따라 뽑음
RepatedKFold – 중복이 포함된 K-Fold 생성
GroupKFold – 그룹별로 데이터를 Sampling

In [ ]:

from sklearn import datasets

boston = datasets.load_boston()
X = boston.data
y = boston.target

In [ ]:

from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True)
for train_index, test_index in kf.split(X):
    print("TRAIN - ", len(train_index))
    print("TEST - ", len(test_index))

TRAIN -  404
TEST -  102
TRAIN -  405
TEST -  101
TRAIN -  405
TEST -  101
TRAIN -  405
TEST -  101
TRAIN -  405
TEST -  101

In [ ]:

from sklearn.model_selection import cross_val_score
import numpy as np

lasso_regressor = Lasso(warm_start=False)
ridge_regressor = Ridge()

lasso_scores = cross_val_score(
    lasso_regressor, X, y, cv=10, scoring="neg_mean_squared_error"
)
ridge_scores = cross_val_score(
    ridge_regressor, X, y, cv=10, scoring="neg_mean_squared_error"
)
np.mean(lasso_scores), np.mean(ridge_scores)

Out[ ]:

(-34.464084588302306, -34.07824620925941)

In [ ]:

from sklearn.model_selection import LeaveOneOut

test = [1, 2, 3, 4]
loo = LeaveOneOut()
for train, test in loo.split(test):
    print("%s %s" % (train, test))

[1 2 3] [0]
[0 2 3] [1]
[0 1 3] [2]
[0 1 2] [3]

9. 로지스틱 회귀(Logistic Regression)¶

로지스틱 회귀(logistic regression)는 독립 변수의 선형 결합을 이용하여 사건의 발생 가능성을 예측하는데 사용되는 통계 기법이다. -- wikipedia

분류 문제를 다루는 Logistic Regression 에 대해 공부합니다. 기존 접근의 문제점들은 다음과 같습니다.

1이상 또는 0이하의 수들이 나오는 걸 어떻게 해석해야 할까?
1 또는 0으로 정확히 표현 가능한가?
변수가 Y에 영향을 주는정도가 비례하는가?
확률로 발생할 사건의 가능성을 표현할 수 있는가?

그에 대한 해법은 확률로 표현하는것입니다.

9.1. Sigmoid function¶

분류(Classification) 문제에서 분류되는 가능성을 확률로 표현하는 Sigmoid function 에 대해 공부합니다.

9.1.1. Odds Ratio¶

해당 사건이 일어날 확률과 일어나지 않을 확률의 비율

일어날 확률 P(X)
일어나지 않을 확률 1 - P(X) 수식으로 나타내면 다음과 같습니다.

$$\frac{P(X)}{1 - P(X)}$$

9.1.2. Logit function¶

X의 값이 주어졌을 때 y의 확률을 이용한 log odds. 수식은 다음과 같습니다.

$$logit(P) =\ln\left(\frac{P(X)}{1 - P(X)}\right)$$

9.1.3. Sigmoid(=Logistic) Function¶

Logit 함수의 역함수로 z에 관한 확률을 산출합니다. 미분가능한 연속구간으로 변환되고, S 모양이라 하여 sigmoid function으로 부릅니다. 수식은 아래와 같습니다.

$$ y = \frac{1}{1 + e^{-z}}$$

9.2. Cost function¶

Logistic Regression 에서 사용하는 Cost Function 에 대해 알아봅니다

예제 코드¶

In [ ]:

import pandas as pd
import numpy as np

df = pd.read_csv("./data/binary.csv")
# rename the 'rank' column because there is also a DataFrame method called 'rank'
df.columns = ["admit", "gre", "gpa", "prestige"]
df.tail()

Out[ ]:

	gre	gpa	prestige
395	620	4.00	2
396	560	3.04	3
397	460	2.63	2
398	700	3.65	2
399	600	3.89	3

In [ ]:

import matplotlib.pyplot as plt

%matplotlib inline
df.hist()

Out[ ]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000000002814898>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000000140126D8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000000005313908>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000000533CB70>]],
      dtype=object)

In [ ]:

from sklearn import preprocessing  # Min-Max Standardzation

y_data = df["admit"].values.reshape(-1, 1)
x_data = df.ix[:, 1:].values

min_max_scaler = preprocessing.MinMaxScaler()
x_data = min_max_scaler.fit_transform(x_data)

x_data[:5]

Out[ ]:

array([[0.27586207, 0.77586207, 0.66666667],
       [0.75862069, 0.81034483, 0.66666667],
       [1.        , 1.        , 0.        ],
       [0.72413793, 0.53448276, 1.        ],
       [0.51724138, 0.38505747, 1.        ]])

In [ ]:

from sklearn import linear_model, datasets
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(
    x_data, y_data, test_size=0.3, random_state=42
)

logreg = linear_model.LogisticRegression(fit_intercept=True)
logreg.fit(X_train, y_train.ravel())

y_pred = logreg.predict(X_test)
y_true = y_test

accuracy_score(y_true, y_pred)

Out[ ]:

0.725

정확도가 72% 정도로 만족스럽지는 않네요.

10. 수강후기¶

최성철 교수님께서 설명을 쉽게 해주셔서 이해하기 쉬웠던것 같습니다. 방대한 분량을 5일안에 강의하시다 보니, 뒷부분이 좀 너무 빠르게 지나간것 같아서 아쉽기는 합니다. 간만에 한국어로 강의를 듣다보니 집중해서 듣게 되고, 갑자기 의욕이 생깁니다. Kaggle에 들어가서 좀 더 배우도록 하겠습니다.