[Python] PyDataset Library를 활용한 Sample 데이터 수집

Page content

강의 홍보

1줄 요약

  • R처럼 Sample 데이터를 쉽게 불러오자.

Sample Dataset

!pip install pydataset
Collecting pydataset
[?25l  Downloading https://files.pythonhosted.org/packages/4f/15/548792a1bb9caf6a3affd61c64d306b08c63c8a5a49e2c2d931b67ec2108/pydataset-0.2.0.tar.gz (15.9MB)
     |████████████████████████████████| 15.9MB 285kB/s 
[?25hRequirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from pydataset) (1.1.5)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas->pydataset) (2.8.1)
Requirement already satisfied: numpy>=1.15.4 in /usr/local/lib/python3.7/dist-packages (from pandas->pydataset) (1.19.5)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas->pydataset) (2018.9)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas->pydataset) (1.15.0)
Building wheels for collected packages: pydataset
  Building wheel for pydataset (setup.py) ... [?25l[?25hdone
  Created wheel for pydataset: filename=pydataset-0.2.0-cp37-none-any.whl size=15939431 sha256=ebe470895a3467fe13c7654021e9108227a6dec8ce6da4f9b4e704520bcd6203
  Stored in directory: /root/.cache/pip/wheels/fe/3f/dc/5d02ccc767317191b12d042dd920fcf3432fab74bc7978598b
Successfully built pydataset
Installing collected packages: pydataset
Successfully installed pydataset-0.2.0
from pydataset import data
print(data())
        dataset_id                                             title
0    AirPassengers       Monthly Airline Passenger Numbers 1949-1960
1          BJsales                 Sales Data with Leading Indicator
2              BOD                         Biochemical Oxygen Demand
3     Formaldehyde                     Determination of Formaldehyde
4     HairEyeColor         Hair and Eye Color of Statistics Students
..             ...                                               ...
752        VerbAgg                  Verbal Aggression item responses
753           cake                 Breakage Angle of Chocolate Cakes
754           cbpp                 Contagious bovine pleuropneumonia
755    grouseticks  Data on red grouse ticks from Elston et al. 2001
756     sleepstudy       Reaction times in a sleep deprivation study
  • 데이터를 불러오는 코드를 작성한다.
cake = data("cake")
print(cake)

data("cake", show_doc=True)
     replicate recipe  temperature  angle  temp
1            1      A          175     42   175
2            1      A          185     46   185
3            1      A          195     47   195
4            1      A          205     39   205
5            1      A          215     53   215
..         ...    ...          ...    ...   ...
266         15      C          185     28   185
267         15      C          195     25   195
268         15      C          205     25   205
269         15      C          215     31   215
270         15      C          225     25   225
cake

PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

## Breakage Angle of Chocolate Cakes

### Description

Data on the breakage angle of chocolate cakes made with three different
recipes and baked at six different temperatures. This is a split-plot design
with the recipes being whole-units and the different temperatures being
applied to sub-units (within replicates). The experimental notes suggest that
the replicate numbering represents temporal ordering.

### Format

A data frame with 270 observations on the following 5 variables.

`replicate`

a factor with levels `1` to `15`

`recipe`

a factor with levels `A`, `B` and `C`

`temperature`

an ordered factor with levels `175` < `185` < `195` < `205` < `215` < `225`

`angle`

a numeric vector giving the angle at which the cake broke.

`temp`

numeric value of the baking temperature (degrees F).

### Details

The `replicate` factor is nested within the `recipe` factor, and `temperature`
is nested within `replicate`.

### Source

Original data were presented in Cook (1938), and reported in Cochran and Cox
(1957, p. 300). Also cited in Lee, Nelder and Pawitan (2006).

### References

Cook, F. E. (1938) _Chocolate cake, I. Optimum baking temperature_. Master's
Thesis, Iowa State College.

Cochran, W. G., and Cox, G. M. (1957) _Experimental designs_, 2nd Ed. New
York, John Wiley \& Sons.

Lee, Y., Nelder, J. A., and Pawitan, Y. (2006) _Generalized linear models with
random effects. Unified analysis via H-likelihood_. Boca Raton, Chapman and
Hall/CRC.

### Examples

    str(cake)
    ## 'temp' is continuous, 'temperature' an ordered factor with 6 levels
    (fm1 <- lmer(angle ~ recipe * temperature + (1|recipe:replicate), cake, REML= FALSE))
    (fm2 <- lmer(angle ~ recipe + temperature + (1|recipe:replicate), cake, REML= FALSE))
    (fm3 <- lmer(angle ~ recipe + temp        + (1|recipe:replicate), cake, REML= FALSE))
    ## and now "choose" :
    anova(fm3, fm2, fm1)