ch03 - gghistostats

2020-05-04

Page content

Intro

A picture is worth a thousand words — English Language Adage The simple graph has brought more information to the data analyst’s mind than any other device. — John Tukey

한장의 그림이 수천단어보다 가치가 있다는 영어속담과, 명료한 시각화가 데이터분석가에게 다른 어떤 도구보다 더 많은 정보를 제공한다는 유명한 데이터 과학자의 조언. 핵심은 시각화이다.

본 장에서는 ggplot2 패키지를 활용한 시각화를 먼저 보여줄 것이다. 먼저 간단하게 ggplot2 패키지에 소개하자면 Grammar of Graphics¹의 철학을 담아서 R 생태계에서 유명한 학자 중, Hadley Wickham에 의해 주도적으로 개발되었다. 그래프에도 문법이 있다는 패키지의 철학 아래, R의 시각화는 괄목할만한 발전을 이루었고 이는 R의 대중화에도 큰 영향을 끼쳤다.

이제 본격적으로 R 시각화를 작성해보자.

I. 사전준비

본 장에서는 시각화를 위해서 ggplot2를 주요 패키지로 사용한다. ggplot2 패키지를 설치했다고 가정한다. 만약 처음 패키지를 설치하는 사람이라면 R 패키지 설치[^2]에서 다시 한번 R 패키지 주요 생태계에 대해 이해하도록 한다.

Studio를 열고 아래 코드를 실행하자:

library(ggplot2)
library(ggstatsplot)
library(dplyr)

이번에는 보고서에 작성할 시각화 보고서를 작성해보자.

II. 데이터셋 - 텍사스 부동산 판매데이터

대한민국에서 가장 중요한 문제중의 하나인 부동산과 관련하여 시각화를 진행하려 한다. 지금은 ggplot2 패키지 내에 존재하는 텍사스 부동산 판매데이터를 통해 시각화를 진행하지만, 향후에는 국내 부동산 데이터를 직접 가져와서 시각화를 진행하기를 바란다.

txhousing 데이터는 총 8602개의 관측치와 9개의 변수로 구성이 되어 있다. txhousing 데이터의 출처 및 각 변수(Column)에 대해 조금 더 자세히 알고 싶다면 R 소스코드 에디터에서 help(txhousing)을 실행하여 도움말을 참고하기를 바란다.

III. ggplot2 with Histogram

지난 포스트 ch02 - Histogram에서 히스토그램에 관한 그래프를 그려봤다.

예시) 이 때, 1월과 7월을 비교하는 그래프를 그려본다.

txhousing %>% 
  group_by(month) %>%
  summarise(grp.mean = mean(sales, na.rm = TRUE)) -> mean_sales

jan_jul <- txhousing %>% 
  filter(month %in% c(1, 7)) %>% 
  mutate(month = as.character(month))
mean_jan_jul <- mean_sales %>% 
  filter(month %in% c(1,7)) %>% 
  mutate(month = as.character(month))

ggplot(jan_jul, aes(x=sales, color=month)) +
  geom_histogram(fill='white', alpha=.5, position = "identity") + 
  geom_vline(data = mean_jan_jul, 
             aes(xintercept=grp.mean, color=month), 
             linetype='dashed') + 
  scale_x_log10() + 
  theme_bw() + 
  theme(legend.position="top")

그렇지만, 통계 결과 보고서로 제출하기에는 무언가 아쉽다. ggplot2 패키지 extension 중에서 ggstatsplot 패키지가 있다. 꾸준하게 버전이 업데이트 중인 것으로 봐서 통계 보고서 작성 시, 참조하면 좋을 것 같다.

IV. gghistostats

우선 tutorial를 참조해서 시각화를 작성해보자.

(1) Sample Tutorial

library(psych)
dplyr::glimpse(x = psych::sat.act)

## Rows: 700
## Columns: 6
## $ gender    <int> 2, 2, 2, 1, 1, 1, 2, 1, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 1, 2,…
## $ education <int> 3, 3, 3, 4, 2, 5, 5, 3, 4, 5, 3, 4, 4, 4, 3, 4, 3, 4, 4, 4,…
## $ age       <int> 19, 23, 20, 27, 33, 26, 30, 19, 23, 40, 23, 34, 32, 41, 20,…
## $ ACT       <int> 24, 35, 21, 26, 31, 28, 36, 22, 22, 35, 32, 29, 21, 35, 27,…
## $ SATV      <int> 500, 600, 480, 550, 600, 640, 610, 520, 400, 730, 760, 710,…
## $ SATQ      <int> 500, 500, 470, 520, 550, 640, 500, 560, 600, 800, 710, 600,…

이 때, ACT 변수를 가지고 히스토그램으로 시각화 예제가 나온다. 그대로 복사하기 보다는 하나씩 scripting 해서 각각의 인수(Argument)가 무엇을 의미하는지 음미하면서 실습을 진행한다.

ggstatsplot::gghistostats(
  data = psych::sat.act, # data from which variable is to be taken
  x = ACT, # numeric variable
  results.subtitle = FALSE, # don't run statistical tests
  messages = FALSE, # turn off messages
  xlab = "ACT Score", # x-axis label
  title = "Distribution of ACT Scores", # title for the plot
  subtitle = "N = 700", # subtitle for the plot
  caption = "Data courtesy of: SAPA project (https://sapa-project.org)", # caption for the plot
  centrality.k = 1 # show 1 decimal places for centrality label
)

위와 같은 보고서를 출력할 수가 있다.

(2) 1월과 7월 데이터 비교 그래프

그렇다면 한번 적용해보자. 그런데, 우리는 1월과 7월 데이터만 필요하다. 이 때에는 grouping.var 인수와 plotgrid.args를 추가하기만 하면 된다.

ggstatsplot::grouped_gghistostats(
  data = jan_jul, # jan_jul 변경
  x = sales, # sales 
  xlab = "Sales",
  grouping.var = month, # grouping variable 1 = Jan, 7 = Jul
  title.prefix = "Month", # prefix for the fixed title
  k = 1, # number of decimal places in results
  type = "r", # robust test: one-sample percentile bootstrap
  test.value = 20, # test value against which sample mean is to be compared
  test.value.line = TRUE, # show a vertical line at `test.value`
  bar.measure = "density", # density
  centrality.parameter = "median", # plotting centrality parameter
  centrality.line.args = list(color = "#D55E00"),
  centrality.label.args = list(color = "#D55E00"),
  test.value.line.args = list(color = "#009E73"),
  test.value.label.args = list(color = "#009E73"),
  messages = FALSE, # turn off messages
  ggtheme = ggthemes::theme_stata(), # changing default theme
  ggstatsplot.layer = FALSE, # turn off ggstatsplot theme layer
  # arguments relevant for ggstatsplot::combine_plots
  title.text = "Distribution of Sales across Month(Jan",
  caption.text = "Data is from ggplot2 package",
  plotgrid.args = list(
    nrow = 2,
    ncol = 1,
    labels = c("January", "July")
  )
)

위 그래프처럼, 그룹별로도 시각화를 작성할 수 있다. 그러나, 조금 더 디자인을 하기 원한다면, purrr 패키지와 같이 코드를 작성하는 것을 권하고 있다.

using ggstatsplot with the purrr package

V. 향후 방향

올해 논문도 써야 하는데, 위 패키지를 근거로 하나둘씩 한글화를 만들어 봐야겠다는 생각이 든다.
중간중간 시간이 날 때마다 논문을 위한 시각화 함수를 만들어 봐야겠다.
또한, 위 코드에서 보는 것처럼, 한글 설명이 없어서, 위 패키지를 작성하면서 한글화 작업도 조금씩 병행해보려고 한다.

R 강의 소개

필자의 강의: 왕초보 데이터 분석 with R
- 쿠폰 유효일은 2021년 10월 30일까지입니다.
- 링크: https://www.udemy.com/course/beginner_with_r/?couponCode=5BF397C9A1E46079627D
- 현재 강의를 계속 찍고 있고, 가격은 한 Section이 끝날 때마다 조금씩 올릴 예정입니다.

Hadley Wickham이 작성한 ggplot2 패키지에 관한 논문을 읽어보는 것을 제안한다. “The Layered Grammar of Graphics”, http://vita.had.co.nz/papers/layered-grammar.pdf ↩︎