Finding Missing Values
Page content
데이터 가져오기
- pandas, numpy, matplotlib 라이브러리를 불러온다.
- 데이터를 불러온다.
- 데이터는 https://ourworldindata.org/coronavirus-source-data 에서 가져왔다. 2020년 6월 1일 기준이다.
import pandas as pd
covidtotals = pd.read_csv("data/covidtotalswithmissings.csv")
print(covidtotals.head())
iso_code lastdate location total_cases total_deaths \
0 AFG 2020-06-01 Afghanistan 15205 257
1 ALB 2020-06-01 Albania 1137 33
2 DZA 2020-06-01 Algeria 9394 653
3 AND 2020-06-01 Andorra 764 51
4 AGO 2020-06-01 Angola 86 4
total_cases_pm total_deaths_pm population pop_density median_age \
0 390.589 6.602 38928341.0 54.422 18.6
1 395.093 11.467 2877800.0 104.871 38.0
2 214.225 14.891 43851043.0 17.348 29.1
3 9888.048 660.066 77265.0 163.755 NaN
4 2.617 0.122 32866268.0 23.890 16.8
gdp_per_capita hosp_beds
0 1803.987 0.50
1 11803.431 2.89
2 13913.839 1.90
3 NaN NaN
4 5819.495 NaN
- Missing Value 확인
- 일부 feature에서 missing value가 있는 것을 확인함.
covidtotals.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210 entries, 0 to 209
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 iso_code 210 non-null object
1 lastdate 210 non-null object
2 location 210 non-null object
3 total_cases 210 non-null int64
4 total_deaths 210 non-null int64
5 total_cases_pm 209 non-null float64
6 total_deaths_pm 209 non-null float64
7 population 210 non-null float64
8 pop_density 198 non-null float64
9 median_age 186 non-null float64
10 gdp_per_capita 182 non-null float64
11 hosp_beds 164 non-null float64
dtypes: float64(7), int64(2), object(3)
memory usage: 19.8+ KB
- 데이터를 크게 두개의 기분으로 분류한다.
- Covid case & Demographic Columns
case_vars = ["location", "total_cases", "total_deaths", "total_cases_pm", "total_deaths_pm"]
demo_vars = ["population", "pop_density", "median_age", "gdp_per_capita", "hosp_beds"]
- axis 설정을 통해 인구통계와 Covid Cased의 결측치 값을 설정한다.
covidtotals[demo_vars].isnull().sum(axis=0)
population 0
pop_density 12
median_age 24
gdp_per_capita 28
hosp_beds 46
dtype: int64
covidtotals[case_vars].isnull().sum(axis=0)
location 0
total_cases 0
total_deaths 0
total_cases_pm 1
total_deaths_pm 1
dtype: int64
- 이번에는 행 방향으로 발생한 결측치를 확인한다.
- 결측치가 없는 행은 156개이고, 1개만 있는 행은 24개 순으로 집계 되었다.
demovars_misscnt = covidtotals[demo_vars].isnull().sum(axis=1)
demovars_misscnt.value_counts()
0 156
1 24
2 12
3 10
4 8
dtype: int64
- 인구통계 데이터가 3가지 이상 누락된 국가를 나열한다.
- 5개의 값만 추출했다.
print(covidtotals.loc[demovars_misscnt >= 3, ['location'] + demo_vars].head(5).T)
3 5 24 \
location Andorra Anguilla Bonaire Sint Eustatius and Saba
population 77265.0 15002.0 26221.0
pop_density 163.755 NaN NaN
median_age NaN NaN NaN
gdp_per_capita NaN NaN NaN
hosp_beds NaN NaN NaN
28 64
location British Virgin Islands Faeroe Islands
population 30237.0 48865.0
pop_density 207.973 35.308
median_age NaN NaN
gdp_per_capita NaN NaN
hosp_beds NaN NaN
- 이번에는 코로나 사례 데이터에서 누락값을 확인한다.
- 홍콩만 사례가 누락된 것을 확인할 수 있다.
totvars_misscnt = covidtotals[case_vars].isnull().sum(axis=1)
totvars_misscnt.value_counts()
0 209
2 1
dtype: int64
print(covidtotals.loc[totvars_misscnt == 2, ['location'] + case_vars].T)
87
location Hong Kong
location Hong Kong
total_cases 0
total_deaths 0
total_cases_pm NaN
total_deaths_pm NaN
print(covidtotals[covidtotals['location'] == "Hong Kong"])
iso_code lastdate location total_cases total_deaths total_cases_pm \
87 HKG 2020-05-26 Hong Kong 0 0 NaN
total_deaths_pm population pop_density median_age gdp_per_capita \
87 NaN 7496988.0 7039.714 44.8 56054.92
hosp_beds
87 NaN
방법 1. Inplace 사용
- 그러나, 가급적 사용하는 것을 추천하지는 않는다.
# 결측치 채우기
covidtotals = pd.read_csv("data/covidtotalswithmissings.csv")
covidtotals2 = covidtotals.copy()
covidtotals2[case_vars].isnull().sum(axis = 0)
location 0
total_cases 0
total_deaths 0
total_cases_pm 1
total_deaths_pm 1
dtype: int64
covidtotals2.total_cases_pm.fillna(covidtotals2.total_cases/(covidtotals2.population/10000000), inplace = True)
covidtotals2.total_deaths_pm.fillna(covidtotals2.total_deaths/(covidtotals2.population/10000000), inplace = True)
covidtotals2[case_vars].isnull().sum(axis = 0)
location 0
total_cases 0
total_deaths 0
total_cases_pm 0
total_deaths_pm 0
dtype: int64
방법 2. 매칭을 통한 대체
covidtotals = pd.read_csv("data/covidtotalswithmissings.csv")
covidtotals2 = covidtotals.copy()
covidtotals2[case_vars].isnull().sum(axis = 0)
location 0
total_cases 0
total_deaths 0
total_cases_pm 1
total_deaths_pm 1
dtype: int64
covidtotals2.loc[:, 'total_cases_pm'] = covidtotals2.loc[:, 'total_cases_pm'].fillna(value=covidtotals2.total_cases/(covidtotals.population/10000000))
covidtotals2.loc[:, 'total_deaths_pm'] = covidtotals2.loc[:, 'total_deaths_pm'].fillna(value=covidtotals2.total_deaths/(covidtotals.population/10000000))
covidtotals2[case_vars].isnull().sum(axis = 0)
location 0
total_cases 0
total_deaths 0
total_cases_pm 0
total_deaths_pm 0
dtype: int64
References
Walker, M. (2020). Python Data Cleaning Cookbook: Modern techniques and Python tools to detect and remove dirty data and extract key insights. Packt Publishing.