Pandas Filtering

2020-04-03

Data Transformation, Python, 데이터 전처리, Pandas

Page content

강의 홍보

취준생을 위한 강의를 제작하였습니다.
본 블로그를 통해서 강의를 수강하신 분은 게시글 제목과 링크를 수강하여 인프런 메시지를 통해 보내주시기를 바랍니다.
- 스타벅스 아이스 아메리카노를 선물로 보내드리겠습니다.
[비전공자 대환영] 제로베이스도 쉽게 입문하는 파이썬 데이터 분석 - 캐글입문기

Overview

필터링은 특정 조건식을 만족하는 행을 따로 추출하는 개념이다. 특정 행의 값에 조건식 True/False을 판별하여 값을 추출하는 방법이다. 이 때, 비교 연산자 또는 조건식 (>, <, ==, ...)을 적용하면 행을 추출할 수 있다.

우선 데이터부터 확인한다. 아래 소스코드를 복사 붙여넣기 하면 데이터를 확인할 수 있다.

import pandas as pd

url = 'https://github.com/chloevan/datasets/raw/master/entertainment/movie_ticket_sales.xlsx'
sales = pd.read_excel(url)
print(sales.head())

          theater_name                  movie_title ticket_type  \
0     Sumdance Cinemas                Harry Plotter      senior   
1  The Empirical House  10 Things I Hate About Unix       child   
2  The Empirical House         The Seaborn Identity       adult   
3     Sumdance Cinemas  10 Things I Hate About Unix       adult   
4  The Empirical House                Mamma Median!      senior   

   ticket_quantity  
0                4  
1                2  
2                4  
3                2  
4                2

Step 1. Accessing a Single Column

우선 조건식을 적용하기 전, Column에 접근해야 가능하다. 여러 방법이 있지만, 아래와 같이 접근해본다.

data['name_of_column']

여기에서는 ticket_type Column에 접근해본다.

print(sales['ticket_type'].head())

0    senior
1     child
2     adult
3     adult
4    senior
Name: ticket_type, dtype: object

위 값을 통해서 senior, child, adult, …, 순으로 된 것을 확인 할 수 있다.

여기에서 비교연산자를 통해 참/거짓을 판별해야 한다. 만약에, ‘senior’만 가져오도록 한다면, child & adult 값은 False로 반환이 될 것이다. 확인해보자.

Step 2. Comparing Operators

주로 사용되는 비교연산자는 아래와 같다.

== (equal to)
!= (not equal to)
< (less than)
> (greater than)
<= (less than or equal to)
>= (greator or equal to)

isin() 함수를 적용하면 특정 값을 가진 행들을 따로 추출할 수도 있다. 이 함수를 사용하면, 코드가 좀 더 간결해지도록 작성할 수 있다.

true_false = sales['ticket_type'] == "senior"
print(true_false.head())

0     True
1    False
2    False
3    False
4     True
Name: ticket_type, dtype: bool

위 값과 비교해보면 child, adult, adult 값이 모두 False로 반환된 것을 확인할 수 있다.

Step 3. Filtering

Column에 대한 접근 및 비교연산자를 통해서, 특정 조건식에 맞는 데이터를 추출해본다. 이 때, 데이터셋을 한번 더 입력하는 번거로움만 거치면 문제가 되지 않는다. 소스코드를 통해 빠르게 구현해보자. 역시나, 크게 어려운 것은 아니다.

조건은 ticket_type에서 senior에 해당하는 행만 추출해본다.

senior_data = sales[sales['ticket_type'] == "senior"].reset_index(drop = True)
print(senior_data.head())

                      theater_name           movie_title ticket_type  \
0                 Sumdance Cinemas         Harry Plotter      senior   
1              The Empirical House         Mamma Median!      senior   
2              The Empirical House         Mamma Median!      senior   
3                        The Frame         Harry Plotter      senior   
4  Richie's Famous Minimax Theatre  The Seaborn Identity      senior   

   ticket_quantity  
0                4  
1                2  
2                2  
3                2  
4                2

Step 4. isin() 활용

여기에서 문제가 생겼다. senior와 함께, adult의 값도 같이 추출해달라는 요청이 들어왔다. 물론 각각의 데이터를 개별적으로 추출하는 것도 하나의 방법이 될 수 있지만, 소스코드가 길어질 것이 예상이 된다. 물론 실제로도 그렇다. 한번 해보겠다.

seniors = sales['ticket_type'] == "senior"
adults = sales['ticket_type'] == "adult"
new_data = sales[seniors | adults].reset_index(drop = True)
print(new_data.head())

          theater_name                  movie_title ticket_type  \
0     Sumdance Cinemas                Harry Plotter      senior   
1  The Empirical House         The Seaborn Identity       adult   
2     Sumdance Cinemas  10 Things I Hate About Unix       adult   
3  The Empirical House                Mamma Median!      senior   
4     Sumdance Cinemas                Harry Plotter       adult   

   ticket_quantity  
0                4  
1                4  
2                2  
3                2  
4                2

이번에는 isin()을 활용해보자.

new_data = sales[sales['ticket_type'].isin(['senior', 'adult'])].reset_index(drop = True)
print(new_data.head())

          theater_name                  movie_title ticket_type  \
0     Sumdance Cinemas                Harry Plotter      senior   
1  The Empirical House         The Seaborn Identity       adult   
2     Sumdance Cinemas  10 Things I Hate About Unix       adult   
3  The Empirical House                Mamma Median!      senior   
4     Sumdance Cinemas                Harry Plotter       adult   

   ticket_quantity  
0                4  
1                4  
2                2  
3                2  
4                2

결과값은 똑같지만, 코드 1줄이 줄어든 것을 확인할 수 있다. 문제는, 전체 값이 100개 중에서, 30개만 추출할 때를 생각해보자, isin() 활용하지 않는다면, 반복적인 불필요한 코드만 계속 늘어날 것을 예상할 수 있다.

Conclusion

지금까지, Pandas를 활용한 Filtering의 방법 및 접근에 대해 배웠다. 물론 실무에서는 이것보다도 훨씬 더 복잡한 형태로 작업이 되지만, 기본적인 원리는 똑같다. 비교연산자를 통한 참/거짓의 활용, 그리고 isin() 함수의 활용을 통해 Filtering 작업을 수행하며. 복잡한 문자열에 정규표현식(regular expression)을 활용하기도 한다. 그러나 마찬가지로, 참/거짓을 활용한 비교연산자를 통한 추출방법임에는 변함이 없기 때문에, 다양하게 연습을 해본다.