Outliers

EDA with Housing Price Prediction - Handling Outliers

강의 홍보

I. 개요

  • 이제 본격적으로 Kaggle 데이터를 활용하여 분석을 진행한다.
  • 데이터는 이미 다운 받은 상태를 전제로 하며, 만약에 데이터가 없다면 이전 포스팅에서 절차를 확인하기 바란다. (미리보기 가능)

II. 구글 드라이브 연동

  • 구글 코랩을 시작하면 언제든지 가장 먼저 해야 하는 것은 드라이브 연동이다.
from google.colab import drive # 패키지 불러오기 
from os.path import join  

ROOT = "/content/drive"     # 드라이브 기본 경로
print(ROOT)                 # print content of ROOT (Optional)
drive.mount(ROOT)           # 드라이브 기본 경로 Mount

MY_GOOGLE_DRIVE_PATH = 'My Drive/Colab Notebooks/inflearn_kaggle/' # 프로젝트 경로
PROJECT_PATH = join(ROOT, MY_GOOGLE_DRIVE_PATH) # 프로젝트 경로
print(PROJECT_PATH)
/content/drive
Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive
/content/drive/My Drive/Colab Notebooks/inflearn_kaggle/
%cd "{PROJECT_PATH}"
/content/drive/My Drive/Colab Notebooks/inflearn_kaggle
  • 위 코드가 에러 없이 돌아간다면 이제 데이터를 불러올 차례다.
!ls
data  docs  source
  • 필자는 inflearn_kaggle 폴더안에 data, docs, source 등의 하위 폴더를 추가로 만들었다.
  • 즉, data 안에 다운로드 받은 파일이 있을 것이다.

III. 캐글 데이터 수집 및 EDA

  • 우선 데이터를 수집하기에 앞서서 EDA에 관한 필수 패키지를 설치하자.
import pandas as pd
import pandas_profiling
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import seaborn as sns

from IPython.core.display import display, HTML
from pandas_profiling import ProfileReport
/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm
%matplotlib inline
import matplotlib.pylab as plt

plt.rcParams["figure.figsize"] = (14,4)
plt.rcParams['lines.linewidth'] = 2
plt.rcParams['lines.color'] = 'r'
plt.rcParams['axes.grid'] = True 

(1) 데이터 수집

  • 지난 시간에 받은 데이터가 총 4개임을 확인했다.
    • data_description.txt
    • sample_submission.csv
    • test.csv
    • train.csv
  • 여기에서는 우선 test.csv & train.csv 파일을 받도록 한다.
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')
print("data import is done")
data import is done

(2) 데이터 확인

  • Kaggle 데이터를 불러오면 우선 확인해야 하는 것은 데이터셋의 크기다.
    • 변수의 갯수
    • Numeric 변수 & Categorical 변수의 개수 등을 파악해야 한다.
  • Point 1 - train데이터에서 굳이 훈련데이터와 테스트 데이터를 구분할 필요는 없다.
    • 보통 Kaggle에서는 테스트 데이터를 주기적으로 업데이트 해준다.
  • Point 2 - 보통 test 데이터의 변수의 개수가 하나 더 작다.
train.shape, test.shape
((1460, 81), (1459, 80))
  • 그 후 train데이터의 상위 5개의 데이터만 확인한다.
display(train.head())
IdMSSubClassMSZoningLotFrontageLotAreaStreetAlleyLotShapeLandContourUtilitiesLotConfigLandSlopeNeighborhoodCondition1Condition2BldgTypeHouseStyleOverallQualOverallCondYearBuiltYearRemodAddRoofStyleRoofMatlExterior1stExterior2ndMasVnrTypeMasVnrAreaExterQualExterCondFoundationBsmtQualBsmtCondBsmtExposureBsmtFinType1BsmtFinSF1BsmtFinType2BsmtFinSF2BsmtUnfSFTotalBsmtSFHeating...CentralAirElectrical1stFlrSF2ndFlrSFLowQualFinSFGrLivAreaBsmtFullBathBsmtHalfBathFullBathHalfBathBedroomAbvGrKitchenAbvGrKitchenQualTotRmsAbvGrdFunctionalFireplacesFireplaceQuGarageTypeGarageYrBltGarageFinishGarageCarsGarageAreaGarageQualGarageCondPavedDriveWoodDeckSFOpenPorchSFEnclosedPorch3SsnPorchScreenPorchPoolAreaPoolQCFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionSalePrice
0160RL65.08450PaveNaNRegLvlAllPubInsideGtlCollgCrNormNorm1Fam2Story7520032003GableCompShgVinylSdVinylSdBrkFace196.0GdTAPConcGdTANoGLQ706Unf0150856GasA...YSBrkr85685401710102131Gd8Typ0NaNAttchd2003.0RFn2548TATAY0610000NaNNaNNaN022008WDNormal208500
1220RL80.09600PaveNaNRegLvlAllPubFR2GtlVeenkerFeedrNorm1Fam1Story6819761976GableCompShgMetalSdMetalSdNone0.0TATACBlockGdTAGdALQ978Unf02841262GasA...YSBrkr1262001262012031TA6Typ1TAAttchd1976.0RFn2460TATAY29800000NaNNaNNaN052007WDNormal181500
2360RL68.011250PaveNaNIR1LvlAllPubInsideGtlCollgCrNormNorm1Fam2Story7520012002GableCompShgVinylSdVinylSdBrkFace162.0GdTAPConcGdTAMnGLQ486Unf0434920GasA...YSBrkr92086601786102131Gd6Typ1TAAttchd2001.0RFn2608TATAY0420000NaNNaNNaN092008WDNormal223500
3470RL60.09550PaveNaNIR1LvlAllPubCornerGtlCrawforNormNorm1Fam2Story7519151970GableCompShgWd SdngWd ShngNone0.0TATABrkTilTAGdNoALQ216Unf0540756GasA...YSBrkr96175601717101031Gd7Typ1GdDetchd1998.0Unf3642TATAY035272000NaNNaNNaN022006WDAbnorml140000
4560RL84.014260PaveNaNIR1LvlAllPubFR2GtlNoRidgeNormNorm1Fam2Story8520002000GableCompShgVinylSdVinylSdBrkFace350.0GdTAPConcGdTAAvGLQ655Unf04901145GasA...YSBrkr1145105302198102141Gd9Typ1TAAttchd2000.0RFn3836TATAY192840000NaNNaNNaN0122008WDNormal250000
......................................................................................................................................................................................................................................................
1455145660RL62.07917PaveNaNRegLvlAllPubInsideGtlGilbertNormNorm1Fam2Story6519992000GableCompShgVinylSdVinylSdNone0.0TATAPConcGdTANoUnf0Unf0953953GasA...YSBrkr95369401647002131TA7Typ1TAAttchd1999.0RFn2460TATAY0400000NaNNaNNaN082007WDNormal175000
1456145720RL85.013175PaveNaNRegLvlAllPubInsideGtlNWAmesNormNorm1Fam1Story6619781988GableCompShgPlywoodPlywoodStone119.0TATACBlockGdTANoALQ790Rec1635891542GasA...YSBrkr2073002073102031TA7Min12TAAttchd1978.0Unf2500TATAY34900000NaNMnPrvNaN022010WDNormal210000
1457145870RL66.09042PaveNaNRegLvlAllPubInsideGtlCrawforNormNorm1Fam2Story7919412006GableCompShgCemntBdCmentBdNone0.0ExGdStoneTAGdNoGLQ275Unf08771152GasA...YSBrkr1188115202340002041Gd9Typ2GdAttchd1941.0RFn1252TATAY0600000NaNGdPrvShed250052010WDNormal266500
1458145920RL68.09717PaveNaNRegLvlAllPubInsideGtlNAmesNormNorm1Fam1Story5619501996HipCompShgMetalSdMetalSdNone0.0TATACBlockTATAMnGLQ49Rec102901078GasA...YFuseA1078001078101021Gd5Typ0NaNAttchd1950.0Unf1240TATAY3660112000NaNNaNNaN042010WDNormal142125
1459146020RL75.09937PaveNaNRegLvlAllPubInsideGtlEdwardsNormNorm1Fam1Story5619651965GableCompShgHdBoardHdBoardNone0.0GdTACBlockTATANoBLQ830LwQ2901361256GasA...YSBrkr1256001256101131TA6Typ0NaNAttchd1965.0Fin1276TATAY736680000NaNNaNNaN062008WDNormal147500

1460 rows × 81 columns