Altair로 시각화하기

Taeyoon Kim

2018-08-28 02:07

Altair는 Vega 및 Vega-Lite에 기반한 Python용 선언적 통계 시각화 라이브러리입니다. GitHub에서 개발현황을 볼 수 있습니다.
Altair를 사용하면 데이터와 의미를 이해하는 데 더 많은 시간을 쓸 수 있습니다. Altair의 API는 간단하고 친숙하며 일관성이 있는 Vega-Lite 문법 위에 구축되었습니다. 이 단순함은 최소한의 코드로 아름답고 효과적인 시각화를 할 수 있습니다.

특징¶

신중하게 설계된 선언적 Python API
자동 생성 되는 내부 API는 Vega-Lite과 완전히 일치합니다.
Vega-Lite JSON 사양에 맞는 코드를 자동 생성합니다.
Jupyter Notebook, JupyterLab, Nteract, nbviewer에서 시각화를 표시합니다.
시각화를 PNG, SVG, HTML로 내보낼수 있습니다.
갤러리에서 수십 가지 예제를 제공합니다.

Altair + Jupyter notebook¶

Jupyter notebook을 사용하는 경우, 버전 5.3 이상에서 가장 잘 작동합니다. 또한 노트북에서 Altair를 사용하려면 vega 패키지를 추가로 설치해야 합니다.

Altair 설치방법¶

Conda를 이용해 Altair를 설치하려면 다음 명령을 실행하십시오.

install

conda install -c conda-forge altair vega_datasets vega

간단한 맛보기¶

산포도(Scatter plot)를 한 번 그려보겠습니다.

In [ ]:

import altair as alt
from vega_datasets import data

alt.renderers.enable("notebook")

iris = data.iris()

alt.Chart(iris).mark_point().encode(x="petalLength", y="petalWidth", color="species")

Out[ ]:

No description has been provided for this image

사용되는 데이터의 형태¶

Altair의 데이터는 Pandas Dataframe을 기반으로 구축되었습니다. 이 튜토리얼에서는 아래와 같은 간단한 Dataframe을 작성해 사용할 겁니다. 그리고 데이터의 레이블이있는 열은 Altair의 시각화에 필수적입니다.

In [ ]:

import pandas as pd

data = pd.DataFrame({"a": list("CCCDDDEEE"), "b": [2, 7, 4, 1, 2, 6, 8, 4, 7]})

차트 개체(Chart Object)¶

Altair의 기본 객체는 데이터(Dataframe)를 단일 인수로 취하는 Chart입니다.

In [ ]:

import altair as alt

chart = alt.Chart(data)

위에서 Chart 객체를 정의했지만 아직 차트에 데이터를 처리는 하지 않았습니다. 인코딩과 마크작업을 통해 데이터를 처리해보도록 하겠습니다.

인코딩 과 마크(Encodings and Marks)¶

차트 개체를 사용하여 데이터를 시각화하는 방법을 지정할 수 있습니다. 이 작업은 Chart 객체의 mark 속성을 통해 수행됩니다. Chart.mark_* 메소드를 통해 사용합니다. 예를 들어 mark_point()를 사용하여 데이터를 점으로 표시 할 수 있습니다.

In [ ]:

alt.Chart(data).mark_point()

Out[ ]:

여기서 렌더링은 데이터 세트의 한 행당 하나의 점으로 구성되며,이 점들에 대한 위치를 아직 지정하지 않았기 때문에 모두 겹쳐져서 표시됩니다.

포인트를 시각적으로 분리하기 위해 다양한 인코딩 채널을 데이터의 열에 매핑 할 수 있습니다. 예를 들어 데이터의 변수 a를 x축 위치를 나타내는 x 채널로 인코딩 할 수 있습니다. 이것은 Chart.encode() 메소드로 할 수 있습니다.

encode() 메서드는 인코딩 채널(x, y, 색상, 모양, 크기 등)을 열 이름으로 접근 할 수 있게 합니다. Pandas dataframe의 경우 Altair가 각각의 열에 적합한 데이터 유형을 자동으로 정해 줍니다.

이제 예시들를 통해 확인해 봅시다.

막대그래프(bar graph)¶

In [ ]:

# simple barplot
import altair as alt
import pandas as pd

data = pd.DataFrame(
    {
        "a": ["A", "B", "C", "D", "E", "F", "G", "H", "I"],
        "b": [28, 55, 43, 91, 81, 53, 19, 87, 52],
    }
)

alt.Chart(data).mark_bar().encode(x="a", y="b")

Out[ ]:

선그래프(line graph)¶

In [ ]:

import altair as alt
import numpy as np
import pandas as pd

x = np.arange(100)
data = pd.DataFrame({"x": x, "sin(x)": np.sin(x / 5)})

alt.Chart(data).mark_line().encode(x="x", y="sin(x)")

Out[ ]:

선그래프에 데이터를 점으로 표시하기¶

In [ ]:

import altair as alt
import numpy as np
import pandas as pd

x = np.arange(100)
data = pd.DataFrame({"x": x, "sin(x)": np.sin(x / 5)})

alt.Chart(data).mark_line(point=True).encode(x="x", y="sin(x)")

Out[ ]:

히트맵(heat map)¶

In [ ]:

import altair as alt
import numpy as np
import pandas as pd

# Compute x^2 + y^2 across a 2D grid
x, y = np.meshgrid(range(-5, 5), range(-5, 5))
z = x**2 + y**2

# Convert this grid to columnar data expected by Altair
data = pd.DataFrame({"x": x.ravel(), "y": y.ravel(), "z": z.ravel()})

alt.Chart(data).mark_rect().encode(x="x:O", y="y:O", color="z:Q")

Out[ ]:

히스토그램(histogram)¶

In [ ]:

import altair as alt
from vega_datasets import data

movies = data.movies.url

alt.Chart(movies).mark_bar().encode(
    alt.X("IMDB_Rating:Q", bin=True),
    y="count()",
)

Out[ ]:

면적그래프(area graph)¶

In [ ]:

import altair as alt
from vega_datasets import data

iowa = data.iowa_electricity()

alt.Chart(iowa).mark_area().encode(x="year:T", y="net_generation:Q", color="source:N")

Out[ ]:

스트립 플롯(strip plot)¶

In [ ]:

import altair as alt
from vega_datasets import data

source = data.cars()

alt.Chart(source).mark_tick().encode(x="Horsepower:Q", y="Cylinders:O")

Out[ ]:

더 복잡한 그래프 예제¶

In [ ]:

alt.renderers.enable("notebook")
alt.data_transformers.enable("json")

data = pd.DataFrame(
    {
        "Day": range(1, 16),
        "Value": [
            54.8,
            112.1,
            63.6,
            37.6,
            79.7,
            137.9,
            120.1,
            103.3,
            394.8,
            199.5,
            72.3,
            51.1,
            112.0,
            174.5,
            130.5,
        ],
    }
)

data2 = pd.DataFrame([{"ThresholdValue": 300, "Threshold": "hazardous"}])

bar1 = alt.Chart(data).mark_bar().encode(x="Day:O", y="Value:Q")

bar2 = (
    alt.Chart(data)
    .mark_bar(color="#e45755")
    .encode(x="Day:O", y="baseline:Q", y2="Value:Q")
    .transform_filter("datum.Value >= 300")
    .transform_calculate("baseline", "300")
)

rule = alt.Chart(data2).mark_rule().encode(y="ThresholdValue:Q")

text = (
    alt.Chart(data2)
    .mark_text(align="left", dx=215, dy=-5)
    .encode(
        alt.Y("ThresholdValue:Q", axis=alt.Axis(title="PM2.5 Value")),
        text=alt.value("hazardous"),
    )
)

bar1 + text + bar2 + rule

Out[ ]:

In [ ]:

population = data.population.url

# Define aggregate fields
lower_box = "q1(people):Q"
lower_whisker = "min(people):Q"
upper_box = "q3(people):Q"
upper_whisker = "max(people):Q"

# Compose each layer individually
lower_plot = (
    alt.Chart(population)
    .mark_rule()
    .encode(
        y=alt.Y(lower_whisker, axis=alt.Axis(title="population")),
        y2=lower_box,
        x="age:O",
    )
)

middle_plot = (
    alt.Chart(population)
    .mark_bar(size=5.0)
    .encode(y=lower_box, y2=upper_box, x="age:O")
)

upper_plot = (
    alt.Chart(population).mark_rule().encode(y=upper_whisker, y2=upper_box, x="age:O")
)

middle_tick = (
    alt.Chart(population)
    .mark_tick(color="white", size=5.0)
    .encode(  # Removed 'b'
        y="median(people):Q",
        x="age:O",
    )
)

lower_plot + middle_plot + upper_plot + middle_tick

Out[ ]:

In [ ]:

countries = alt.topo_feature(data.world_110m.url, "countries")

base = (
    alt.Chart(countries)
    .mark_geoshape(fill="#666666", stroke="white")
    .properties(width=300, height=180)
)

projections = ["equirectangular", "mercator", "orthographic", "gnomonic"]
charts = [base.project(proj).properties(title=proj) for proj in projections]

alt.vconcat(alt.hconcat(*charts[:2]), alt.hconcat(*charts[2:]))

Out[ ]:

In [ ]:

barley = data.barley()

points = (
    alt.Chart(barley)
    .mark_point(filled=True)
    .encode(
        alt.X(
            "mean(yield)",
            scale=alt.Scale(zero=False),
            axis=alt.Axis(title="Barley Yield"),
        ),
        y="variety",
        color=alt.value("black"),
    )
)

error_bars = (
    alt.Chart(barley).mark_rule().encode(x="ci0(yield)", x2="ci1(yield)", y="variety")
)

points + error_bars

Out[ ]:

마치며¶

Altair의 예제를 살펴보면서 이 도구의 잠재력과 간결함을 느끼셨을 것입니다. 하지만 Altair는 아래와 같이 몇가지 주의사항이 있습니다.

API는 여전히 꽤 새로운 것입니다. 따라서 일부에 버그가 존재할 수 있습니다.
문서화가 아직 부족합니다. 때때로 Vega-Lite 문서를보고 답을 찾아야합니다.
처리 할 수있는 데이터 포인트의 수는 현재 매우 적습니다. 지금은 5,000으로 제한되어 있지만 앞으로 늘어 날 것입니다.

그러나 이런 주의사항에도 Altair은 많은 발전가능성을 가지고 있습니다. 앞으로 matplotlib의 아성을 뛰어 넘을 수 있을지 지켜보도록 하죠.

포켓몬스터 데이터로 배우는 seaborn

Taeyoon Kim

2018-08-21 18:07

자세한 내용은 출처를 확인해주세요.

Seaborn 소개¶

Seaborn은 강력하지만 다루기 힘든 시각화 라이브러리인 Matplotlib에 좀 더 나은 환경을 제공하기 위해 만들어졌습니다. 공식 웹페이지는 다음과 같이 Seaborn을 소개하고 있습니다.

Matplotlib이 쉬운 일을 쉽게, 어려운 일은 가능하게 만들때, Seaborn은 어려운 일도 쉽게 만드려고 합니다.

Seaborn은 아래와 같은 기능을 제공합니다.

기본 테마가 아름답습니다.
색상 표를 사용자가 정의할 수 있습니다.
매력적인 통계 도표 만들기.
쉽고 유연하게 결과물을 출력합니다.

강조하고 싶은 점은 Seaborn이 탐색적 분석을 위한 최선의 도구라는 것입니다. Seaborn을 사용한다면 원본 데이터를 빠르고 효율적으로 파악할 수있습니다.

다만 Seaborn은 Matplotlib의 대체품이 아닌 보완품입니다. Matplotlib 위에서 작동하기 때문에 Matplotlib를 다루는 방법도 알고 있어야 합니다.

시작하기 앞서¶

필요한 라이브러리를 불러옵니다.

In [ ]:

# importing libraries and dataset
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

포켓몬스터 데이터¶

포켓몬스터 게임에서 수집한 데이터인 Pokemon.csv파일을 사용하겠습니다.

먼저 read_csv기능을 사용해 CSV형식으로 저장된 데이터를 불러옵니다.

In [ ]:

# Read dataset
df = pd.read_csv("G:/Pokemon.csv", index_col=0)
df.head()

Out[ ]:

	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Stage	Legendary
#
1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	2	False
3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	3	False
4	Charmander	Fire	NaN	309	39	52	43	60	50	65	1	False
5	Charmeleon	Fire	NaN	405	58	64	58	80	65	80	2	False

선형 회귀 시각화¶

공격력과 방어력간에 선형의 관계가 있는지 시각화를 통해 살펴봅니다.

In [ ]:

sns.lmplot(x="Attack", y="Defense", data=df, hue="Type 1")
plt.ylim(0, None)

Out[ ]:

(0, 220.91225584629962)

다수의 포켓몬들의 공격력과 방어력은 명백한 선형의 관계가 있는것을 볼 수 있습니다. 유령형 포켓몬들은 공격력이 늘면 오히려 방어력이 줄어드는 경향이 있네요.

상자 그림 그리기¶

모든 포켓몬의 스텟(공격력, 방어력, 체력 등등)의 상자그림을 그려보겠습니다.

In [ ]:

sns.boxplot(data=df)

Out[ ]:

<matplotlib.axes._subplots.AxesSubplot at 0xc7c65f8>

필요없는 값들을 빼고 다시 그려보겠습니다.

In [ ]:

stats_df = df.drop(["Total", "Stage", "Legendary"], axis=1)
sns.boxplot(data=stats_df)

Out[ ]:

<matplotlib.axes._subplots.AxesSubplot at 0xce099b0>

모든 값들이 비슷비슷하네요. 다만 체력(hp)의 경우는 아주높은 값의 특이값(outlier)가 존재 합니다.

바이올린 도표 그리기¶

공격력을 각각의 포켓몬 유형에 따라서 바이올린 도표를 그려봅니다. 기본적으로 상자그림과 동일한 유형입니다.

In [ ]:

sns.set_style("white")
sns.violinplot(x="Type 1", y="Attack", data=df)

Out[ ]:

<matplotlib.axes._subplots.AxesSubplot at 0xc75f710>

격투(fighting)타입의 포켓몬들이 공격력이 높은것을 확인 할 수 있습니다.

각각의 유형들을 원하는 색상으로 지정해줄수도 있습니다. 아래와 같이 색상표를 정의합니다.

In [ ]:

pkmn_type_colors = [
    "#78C850",  # Grass
    "#F08030",  # Fire
    "#6890F0",  # Water
    "#A8B820",  # Bug
    "#A8A878",  # Normal
    "#A040A0",  # Poison
    "#F8D030",  # Electric
    "#E0C068",  # Ground
    "#EE99AC",  # Fairy
    "#C03028",  # Fighting
    "#F85888",  # Psychic
    "#B8A038",  # Rock
    "#705898",  # Ghost
    "#98D8D8",  # Ice
    "#7038F8",  # Dragon
]
sns.violinplot(x="Type 1", y="Attack", data=df, palette=pkmn_type_colors)

Out[ ]:

<matplotlib.axes._subplots.AxesSubplot at 0xe1148d0>

산포도(Scatter plot)¶

산포도로 표현할 수 도 있습니다.

In [ ]:

# Swarm plot with Pokemon color palette
sns.swarmplot(x="Type 1", y="Attack", data=df, palette=pkmn_type_colors)

Out[ ]:

<matplotlib.axes._subplots.AxesSubplot at 0xe3fcc88>

산포도와 바이올린 도표 겹치기¶

각각의 도표를 겹쳐서 더많은 정보를 표현 할 수도 있습니다.

In [ ]:

plt.figure(figsize=(10, 6))

# Create plot
sns.violinplot(
    x="Type 1",
    y="Attack",
    data=df,
    inner=None,  # Remove the bars inside the violins
    palette=pkmn_type_colors,
)

sns.swarmplot(
    x="Type 1",
    y="Attack",
    data=df,
    color="k",
    alpha=0.7,  # Make points black
)  # and slightly transparent

# Set title with matplotlib
plt.title("Attack by Type")

Out[ ]:

<matplotlib.text.Text at 0xe7cb588>

함께 모아서¶

각각의 특성에 대한 그림을 반복해서 그릴수도 있습니다. 하지만 하나의 그림에 정보를 표현하려면 어떻게 해야 할까요? 이런 상황에서 우리는 Pandas의 melt()기능을 사용합니다.

아래의 예를 확인하세요.

In [ ]:

# Melt DataFrame
melted_df = pd.melt(
    stats_df,
    id_vars=["Name", "Type 1", "Type 2"],  # Variables to keep
    var_name="Stat",
)  # Name of melted variable
# melted_df.head()
print(stats_df.shape)
print(melted_df.shape)

(151, 9)
(906, 5)

산포도 그리기¶

복잡한 정보를 표현할때는 아래와 같은 산포도가 좋습니다.

In [ ]:

sns.swarmplot(
    x="Stat",
    y="value",
    data=melted_df,
    hue="Type 1",
    split=True,  # 2. Separate points by hue
    palette=pkmn_type_colors,
)  # 3. Use Pokemon palette

# put a legend to the right of the current axis
plt.legend(loc="center left", bbox_to_anchor=(1, 0.5))

Out[ ]:

<matplotlib.legend.Legend at 0xf54a1d0>

히트맵 그리기¶

각각의 스텟들 사이에 상관관계가 있는지 히트맵을 그려서 확인해보겠습니다.

In [ ]:

corr = stats_df.corr()

# Heatmap
sns.heatmap(corr)

Out[ ]:

<matplotlib.axes._subplots.AxesSubplot at 0xfd1b588>

히스토그램 그리기¶

공격력에 대한 히스토그램을 그려봅니다.

In [ ]:

# Distribution Plot (a.k.a. Histogram)
sns.distplot(df.Attack)

Out[ ]:

<matplotlib.axes._subplots.AxesSubplot at 0xf556358>

카운트 플롯(Count Plot)¶

각각의 값들의 갯수를 표현하는 도표입니다.

In [ ]:

sns.countplot(x="Type 1", data=df, palette=pkmn_type_colors)

# Rotate x-labels
plt.xticks(rotation=-45)

Out[ ]:

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]),
 <a list of 15 Text xticklabel objects>)

요인 플롯(Factor Plot)¶

In [ ]:

g = sns.factorplot(
    x="Type 1",
    y="Attack",
    data=df,
    hue="Stage",  # Color by stage
    col="Stage",  # Separate by stage
    kind="swarm",
)  # Swarmplot

# Rotate x-axis labels
g.set_xticklabels(rotation=-45)

# Doesn't work because only rotates last plot
# plt.xticks(rotation=-45)

Out[ ]:

<seaborn.axisgrid.FacetGrid at 0xfbe3390>

밀도 플롯(Density Plot)¶

공격력와 방어력을 축으로 각각의 포켓몬들의 밀도를 표현합니다.

In [ ]:

sns.kdeplot(df.Attack, df.Defense)

Out[ ]:

<matplotlib.axes._subplots.AxesSubplot at 0xf4f99e8>

분포도(Joint Distribution Plot)¶

공격력과 방어력을 축으로 각각의 분포를 하나의 그림에서 확인하는 방법입니다.

In [ ]:

sns.jointplot(x="Attack", y="Defense", data=df)

Out[ ]:

<seaborn.axisgrid.JointGrid at 0x10e8c668>

마치며,¶

Seaborn은 Matplotlib을 기반으로 다양한 색상 테마와 통계용 차트등의 기능을 추가해 좀 더 멋있게 시각화를 해주는 도구 입니다. 더 멋있는 도표는 예제 갤러리에서 확인하세요.