[Python] PyDataset Library를 활용한 Sample 데이터 수집
Page content
	
강의 홍보
- 취준생을 위한 강의를 제작하였습니다.
 - 본 블로그를 통해서 강의를 수강하신 분은 게시글 제목과 링크를 수강하여 인프런 메시지를 통해 보내주시기를 바랍니다.
스타벅스 아이스 아메리카노를 선물로 보내드리겠습니다.
 - [비전공자 대환영] 제로베이스도 쉽게 입문하는 파이썬 데이터 분석 - 캐글입문기

 
1줄 요약
- R처럼 Sample 데이터를 쉽게 불러오자.
 
Sample Dataset
- Sample Data를 가져오는 코드를 작성합니다.
 - 이 때 
PyDataset라이브러리를 활용합니다. 
!pip install pydataset
Collecting pydataset
[?25l  Downloading https://files.pythonhosted.org/packages/4f/15/548792a1bb9caf6a3affd61c64d306b08c63c8a5a49e2c2d931b67ec2108/pydataset-0.2.0.tar.gz (15.9MB)
[K     |████████████████████████████████| 15.9MB 285kB/s 
[?25hRequirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from pydataset) (1.1.5)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas->pydataset) (2.8.1)
Requirement already satisfied: numpy>=1.15.4 in /usr/local/lib/python3.7/dist-packages (from pandas->pydataset) (1.19.5)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas->pydataset) (2018.9)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas->pydataset) (1.15.0)
Building wheels for collected packages: pydataset
  Building wheel for pydataset (setup.py) ... [?25l[?25hdone
  Created wheel for pydataset: filename=pydataset-0.2.0-cp37-none-any.whl size=15939431 sha256=ebe470895a3467fe13c7654021e9108227a6dec8ce6da4f9b4e704520bcd6203
  Stored in directory: /root/.cache/pip/wheels/fe/3f/dc/5d02ccc767317191b12d042dd920fcf3432fab74bc7978598b
Successfully built pydataset
Installing collected packages: pydataset
Successfully installed pydataset-0.2.0
from pydataset import data
print(data())
        dataset_id                                             title
0    AirPassengers       Monthly Airline Passenger Numbers 1949-1960
1          BJsales                 Sales Data with Leading Indicator
2              BOD                         Biochemical Oxygen Demand
3     Formaldehyde                     Determination of Formaldehyde
4     HairEyeColor         Hair and Eye Color of Statistics Students
..             ...                                               ...
752        VerbAgg                  Verbal Aggression item responses
753           cake                 Breakage Angle of Chocolate Cakes
754           cbpp                 Contagious bovine pleuropneumonia
755    grouseticks  Data on red grouse ticks from Elston et al. 2001
756     sleepstudy       Reaction times in a sleep deprivation study
- 데이터를 불러오는 코드를 작성한다.
 
cake = data("cake")
print(cake)
data("cake", show_doc=True)
     replicate recipe  temperature  angle  temp
1            1      A          175     42   175
2            1      A          185     46   185
3            1      A          195     47   195
4            1      A          205     39   205
5            1      A          215     53   215
..         ...    ...          ...    ...   ...
266         15      C          185     28   185
267         15      C          195     25   195
268         15      C          205     25   205
269         15      C          215     31   215
270         15      C          225     25   225
cake
PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)
## Breakage Angle of Chocolate Cakes
### Description
Data on the breakage angle of chocolate cakes made with three different
recipes and baked at six different temperatures. This is a split-plot design
with the recipes being whole-units and the different temperatures being
applied to sub-units (within replicates). The experimental notes suggest that
the replicate numbering represents temporal ordering.
### Format
A data frame with 270 observations on the following 5 variables.
`replicate`
a factor with levels `1` to `15`
`recipe`
a factor with levels `A`, `B` and `C`
`temperature`
an ordered factor with levels `175` < `185` < `195` < `205` < `215` < `225`
`angle`
a numeric vector giving the angle at which the cake broke.
`temp`
numeric value of the baking temperature (degrees F).
### Details
The `replicate` factor is nested within the `recipe` factor, and `temperature`
is nested within `replicate`.
### Source
Original data were presented in Cook (1938), and reported in Cochran and Cox
(1957, p. 300). Also cited in Lee, Nelder and Pawitan (2006).
### References
Cook, F. E. (1938) _Chocolate cake, I. Optimum baking temperature_. Master's
Thesis, Iowa State College.
Cochran, W. G., and Cox, G. M. (1957) _Experimental designs_, 2nd Ed. New
York, John Wiley \& Sons.
Lee, Y., Nelder, J. A., and Pawitan, Y. (2006) _Generalized linear models with
random effects. Unified analysis via H-likelihood_. Boca Raton, Chapman and
Hall/CRC.
### Examples
    str(cake)
    ## 'temp' is continuous, 'temperature' an ordered factor with 6 levels
    (fm1 <- lmer(angle ~ recipe * temperature + (1|recipe:replicate), cake, REML= FALSE))
    (fm2 <- lmer(angle ~ recipe + temperature + (1|recipe:replicate), cake, REML= FALSE))
    (fm3 <- lmer(angle ~ recipe + temp        + (1|recipe:replicate), cake, REML= FALSE))
    ## and now "choose" :
    anova(fm3, fm2, fm1)