Finding Missing Values

2021-12-18

Page content

데이터 가져오기

pandas, numpy, matplotlib 라이브러리를 불러온다.
데이터를 불러온다.
- 데이터는 https://ourworldindata.org/coronavirus-source-data 에서 가져왔다. 2020년 6월 1일 기준이다.

import pandas as pd

covidtotals = pd.read_csv("data/covidtotalswithmissings.csv")
print(covidtotals.head())

  iso_code    lastdate     location  total_cases  total_deaths  \
0      AFG  2020-06-01  Afghanistan        15205           257   
1      ALB  2020-06-01      Albania         1137            33   
2      DZA  2020-06-01      Algeria         9394           653   
3      AND  2020-06-01      Andorra          764            51   
4      AGO  2020-06-01       Angola           86             4   

   total_cases_pm  total_deaths_pm  population  pop_density  median_age  \
0         390.589            6.602  38928341.0       54.422        18.6   
1         395.093           11.467   2877800.0      104.871        38.0   
2         214.225           14.891  43851043.0       17.348        29.1   
3        9888.048          660.066     77265.0      163.755         NaN   
4           2.617            0.122  32866268.0       23.890        16.8   

   gdp_per_capita  hosp_beds  
0        1803.987       0.50  
1       11803.431       2.89  
2       13913.839       1.90  
3             NaN        NaN  
4        5819.495        NaN

Missing Value 확인
일부 feature에서 missing value가 있는 것을 확인함.

covidtotals.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210 entries, 0 to 209
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   iso_code         210 non-null    object 
 1   lastdate         210 non-null    object 
 2   location         210 non-null    object 
 3   total_cases      210 non-null    int64  
 4   total_deaths     210 non-null    int64  
 5   total_cases_pm   209 non-null    float64
 6   total_deaths_pm  209 non-null    float64
 7   population       210 non-null    float64
 8   pop_density      198 non-null    float64
 9   median_age       186 non-null    float64
 10  gdp_per_capita   182 non-null    float64
 11  hosp_beds        164 non-null    float64
dtypes: float64(7), int64(2), object(3)
memory usage: 19.8+ KB

데이터를 크게 두개의 기분으로 분류한다.
- Covid case & Demographic Columns

case_vars = ["location", "total_cases", "total_deaths", "total_cases_pm", "total_deaths_pm"]
demo_vars = ["population", "pop_density", "median_age", "gdp_per_capita", "hosp_beds"]

axis 설정을 통해 인구통계와 Covid Cased의 결측치 값을 설정한다.

covidtotals[demo_vars].isnull().sum(axis=0)

population         0
pop_density       12
median_age        24
gdp_per_capita    28
hosp_beds         46
dtype: int64

covidtotals[case_vars].isnull().sum(axis=0)

location           0
total_cases        0
total_deaths       0
total_cases_pm     1
total_deaths_pm    1
dtype: int64

이번에는 행 방향으로 발생한 결측치를 확인한다.
결측치가 없는 행은 156개이고, 1개만 있는 행은 24개 순으로 집계 되었다.

demovars_misscnt = covidtotals[demo_vars].isnull().sum(axis=1)
demovars_misscnt.value_counts()

0    156
1     24
2     12
3     10
4      8
dtype: int64

인구통계 데이터가 3가지 이상 누락된 국가를 나열한다.
- 5개의 값만 추출했다.

print(covidtotals.loc[demovars_misscnt >= 3, ['location'] + demo_vars].head(5).T)

                     3         5                                24  \
location        Andorra  Anguilla  Bonaire Sint Eustatius and Saba   
population      77265.0   15002.0                          26221.0   
pop_density     163.755       NaN                              NaN   
median_age          NaN       NaN                              NaN   
gdp_per_capita      NaN       NaN                              NaN   
hosp_beds           NaN       NaN                              NaN   

                                    28              64  
location        British Virgin Islands  Faeroe Islands  
population                     30237.0         48865.0  
pop_density                    207.973          35.308  
median_age                         NaN             NaN  
gdp_per_capita                     NaN             NaN  
hosp_beds                          NaN             NaN

이번에는 코로나 사례 데이터에서 누락값을 확인한다.
- 홍콩만 사례가 누락된 것을 확인할 수 있다.

totvars_misscnt = covidtotals[case_vars].isnull().sum(axis=1)
totvars_misscnt.value_counts()

0    209
2      1
dtype: int64

print(covidtotals.loc[totvars_misscnt == 2, ['location'] + case_vars].T)

                        87
location         Hong Kong
location         Hong Kong
total_cases              0
total_deaths             0
total_cases_pm         NaN
total_deaths_pm        NaN

print(covidtotals[covidtotals['location'] == "Hong Kong"])

   iso_code    lastdate   location  total_cases  total_deaths  total_cases_pm  \
87      HKG  2020-05-26  Hong Kong            0             0             NaN   

    total_deaths_pm  population  pop_density  median_age  gdp_per_capita  \
87              NaN   7496988.0     7039.714        44.8        56054.92   

    hosp_beds  
87        NaN

방법 1. Inplace 사용

그러나, 가급적 사용하는 것을 추천하지는 않는다.
- 참조: https://towardsdatascience.com/why-you-should-probably-never-use-pandas-inplace-true-9f9f211849e4

# 결측치 채우기
covidtotals = pd.read_csv("data/covidtotalswithmissings.csv")
covidtotals2 = covidtotals.copy()
covidtotals2[case_vars].isnull().sum(axis = 0)

location           0
total_cases        0
total_deaths       0
total_cases_pm     1
total_deaths_pm    1
dtype: int64

covidtotals2.total_cases_pm.fillna(covidtotals2.total_cases/(covidtotals2.population/10000000), inplace = True)
covidtotals2.total_deaths_pm.fillna(covidtotals2.total_deaths/(covidtotals2.population/10000000), inplace = True)
covidtotals2[case_vars].isnull().sum(axis = 0)

location           0
total_cases        0
total_deaths       0
total_cases_pm     0
total_deaths_pm    0
dtype: int64

방법 2. 매칭을 통한 대체

covidtotals = pd.read_csv("data/covidtotalswithmissings.csv")
covidtotals2 = covidtotals.copy()
covidtotals2[case_vars].isnull().sum(axis = 0)

location           0
total_cases        0
total_deaths       0
total_cases_pm     1
total_deaths_pm    1
dtype: int64

covidtotals2.loc[:, 'total_cases_pm'] = covidtotals2.loc[:, 'total_cases_pm'].fillna(value=covidtotals2.total_cases/(covidtotals.population/10000000))
covidtotals2.loc[:, 'total_deaths_pm'] = covidtotals2.loc[:, 'total_deaths_pm'].fillna(value=covidtotals2.total_deaths/(covidtotals.population/10000000))
covidtotals2[case_vars].isnull().sum(axis = 0)

location           0
total_cases        0
total_deaths       0
total_cases_pm     0
total_deaths_pm    0
dtype: int64

References

Walker, M. (2020). Python Data Cleaning Cookbook: Modern techniques and Python tools to detect and remove dirty data and extract key insights. Packt Publishing.