ACEA Water, Intro to Time Series Forecasting

Page content

강의 홍보

Overview

Can you build a model to predict the amount of water in each waterbody to help preserve this natural resource? This is an Analytics competition where your task is to create a Notebook that best addresses the Evaluation criteria below. Submissions should be shared directly with host and will be judged by the Acea Group based on how well they addrss:

Methodology/Completeness (min 0 points, max 5 points)

  • Are the statistical models appropriate given the data?
  • Did the author develop one or more machine learning models?
  • Did the author provide a way of assessing the performance and accuracy of their solution?
  • What is the Mean Absolute Error (MAE) of the models?
  • What is the Root Mean Square Error (RMSE) of the models?

Presentation (min 0 points, max 5 points)

  • Does the notebook have a compelling and coherent narrative?
  • Does the notebook contain data visualizations that help to communicate the author’s main points?
  • Did the author include a thorough discussion on the intersection between features and their prediction? For example between rainfall and amount/level of water.
  • Was there discussion of automated insight generation, demonstrating what factors to take into account?
  • Is the code documented in a way that makes it easy to understand and reproduce?
  • Were all external sources of data made public and cited appropriately?

Application (min 0 points, max 5 points)

  • Is the provided model useful/able to forecast water availability in terms of level or water flow in a time interval of the year?
  • Is the provided methodology applicable also on new datasets belong to another waterbody?

1. Problem Definition

  • It’s generally difficult to define problem. Since the goal of this competition is to build a model predict the depth to groundwater or river of an aquifer located in Petrignano, Italy, the problem definition could be related with the future prediction.

1.1 📊 Quick EDA 📈

  • To explore data, let’s import relevant libraries
import numpy as np # Linear Algebra

import pandas as pd # Data Processing, CSV File I/O
import seaborn as sns # Visualisation
import matplotlib.pyplot as plt # Visualisation
from matplotlib.patches import Rectangle # Helper
import os # PATH


PATH = "../input/acea-water-prediction/"

aquifer_auser = pd.read_csv(f"{PATH}Aquifer_Auser.csv")
aquifer_doganella = pd.read_csv(f"{PATH}Aquifer_Doganella.csv")
aquifer_luco = pd.read_csv(f"{PATH}Aquifer_Luco.csv")
aquifer_petrignano = pd.read_csv(f"{PATH}Aquifer_Petrignano.csv")
lake_biliancino = pd.read_csv(f"{PATH}Lake_Bilancino.csv")
river_arno = pd.read_csv(f"{PATH}River_Arno.csv")
water_spring_amiata = pd.read_csv(f"{PATH}Water_Spring_Amiata.csv")
water_spring_lupa = pd.read_csv(f"{PATH}Water_Spring_Lupa.csv")
water_spring_madonna = pd.read_csv(f"{PATH}Water_Spring_Madonna_di_Canneto.csv")
  • To understand each dataset, we will make temp_df dataset, given column category such as Date, Rainfall, Depth, Volume, Hydrometry, Lake_Level, Flow_Rate. Through this, We will know more about the common and difference between datasets.
datasets = []
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        if '.csv' in filename:
            datasets += list([filename])

print("The datasets:", datasets)
datasets_df = pd.DataFrame(columns=['filename'], data=datasets)
datasets_df['waterbody_type'] = datasets_df.filename.apply(lambda x: x.split('_')[0])
print("The datasets_df:", datasets_df)

f, ax = plt.subplots(nrows=1, ncols=2, figsize=(24,6),  gridspec_kw={'width_ratios': [1,2]})
sns.countplot(datasets_df.waterbody_type, palette=['lightblue'], ax= ax[0])
ax[0].set_title('Datasets', fontsize=16)
ax[0].set_ylabel('Number of Datasets', fontsize=14)
ax[0].set_ylim(0, 5)
ax[0].set_xlabel('Waterbody Type', fontsize=14)
ax[0].set_xticklabels(labels=['Aquifer', 'Water Spring', 'Lake', 'River'], fontsize=14)


def get_column_category(x):
    if 'Date' in x:
        return 'Date'
    elif 'Rainfall' in x:
        return 'Rainfall'
    elif 'Depth' in x:
        return 'Depth to Groundwater'
    elif 'Temperature' in x:
        return 'Temperature'
    elif 'Volume' in x:
        return 'Volume'
    elif 'Hydrometry' in x:
        return 'Hydrometry'
    elif 'Lake_Level' in x:
        return 'Lake Level'
    elif 'Flow_Rate' in x:
        return 'Flow Rate'
    else:
        return x

temp_df = pd.DataFrame({'column_name' : aquifer_auser.columns, 'waterbody_type':'Aquifer Auser'})
temp_df = temp_df.append(pd.DataFrame({'column_name' : aquifer_doganella.columns, 'waterbody_type':'Aquifer Doganella'}))
temp_df = temp_df.append(pd.DataFrame({'column_name' : aquifer_luco.columns, 'waterbody_type':'Aquifer Luco'}))
temp_df = temp_df.append(pd.DataFrame({'column_name' : aquifer_petrignano.columns, 'waterbody_type':'Aquifer Petrignano'}))
temp_df = temp_df.append(pd.DataFrame({'column_name' : lake_biliancino.columns, 'waterbody_type':'Lake Biliancino'}))
temp_df = temp_df.append(pd.DataFrame({'column_name' : river_arno.columns, 'waterbody_type':'River Arno'}))
temp_df = temp_df.append(pd.DataFrame({'column_name' : water_spring_amiata.columns, 'waterbody_type':'Water Spring Amiata'}))
temp_df = temp_df.append(pd.DataFrame({'column_name' : water_spring_lupa.columns, 'waterbody_type':'Water Spring Lupa'}))
temp_df = temp_df.append(pd.DataFrame({'column_name' : water_spring_madonna.columns, 'waterbody_type':'Water Spring Madonna'}))
temp_df['column_category'] = temp_df.column_name.apply(lambda x: get_column_category(x))
temp_df = temp_df.groupby('waterbody_type').column_category.value_counts().to_frame()
temp_df.columns = ['counts']
temp_df = temp_df.reset_index(drop=False)
temp_df = temp_df.pivot(index='waterbody_type', columns='column_category')['counts']
temp_df['Number of Features in Dataset'] = temp_df.sum(axis=1)
display(temp_df)

sns.heatmap(temp_df, cmap='Blues', linewidth=1, ax=ax[1], vmin=0, vmax=10, annot=True)

#ax[1].set_ylabel('Waterbody Type', fontsize=16)
ax[1].set_ylabel(' ', fontsize=16)
ax[1].set_xlabel('Column Category', fontsize=16)
ax[1].set_title('Features, Number of Columns and Target Variables', fontsize=16)
ax[1].add_patch(Rectangle((1, 0), 1, 4, fill=False, alpha=1, color='dodgerblue', lw=3))
ax[1].add_patch(Rectangle((2, 4), 1, 1, fill=False, alpha=1, color='dodgerblue', lw=3))
ax[1].add_patch(Rectangle((4, 4), 1, 1, fill=False, alpha=1, color='dodgerblue', lw=3))
ax[1].add_patch(Rectangle((3, 5), 1, 1, fill=False, alpha=1, color='dodgerblue', lw=3))
ax[1].add_patch(Rectangle((2, 6), 1, 3, fill=False, alpha=1, color='dodgerblue', lw=3))
for tick in ax[1].xaxis.get_major_ticks():
    tick.label.set_fontsize(14) 
for tick in ax[1].yaxis.get_major_ticks():
    tick.label.set_fontsize(14)
    #tick.label.set_rotation('horizontal')
    tick.label.set_rotation(45) 
    
plt.show()
The datasets: ['Aquifer_Doganella.csv', 'Aquifer_Auser.csv', 'Water_Spring_Amiata.csv', 'Lake_Bilancino.csv', 'Water_Spring_Madonna_di_Canneto.csv', 'Aquifer_Luco.csv', 'Aquifer_Petrignano.csv', 'Water_Spring_Lupa.csv', 'River_Arno.csv']
The datasets_df:                               filename waterbody_type
0                Aquifer_Doganella.csv        Aquifer
1                    Aquifer_Auser.csv        Aquifer
2              Water_Spring_Amiata.csv          Water
3                   Lake_Bilancino.csv           Lake
4  Water_Spring_Madonna_di_Canneto.csv          Water
5                     Aquifer_Luco.csv        Aquifer
6               Aquifer_Petrignano.csv        Aquifer
7                Water_Spring_Lupa.csv          Water
8                       River_Arno.csv          River


/opt/conda/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning
column_category Date Depth to Groundwater Flow Rate Hydrometry Lake Level Rainfall Temperature Volume Number of Features in Dataset
waterbody_type
Aquifer Auser 1.0 5.0 NaN 2.0 NaN 10.0 4.0 5.0 27.0
Aquifer Doganella 1.0 9.0 NaN NaN NaN 2.0 2.0 8.0 22.0
Aquifer Luco 1.0 4.0 NaN NaN NaN 10.0 4.0 3.0 22.0
Aquifer Petrignano 1.0 2.0 NaN 1.0 NaN 1.0 2.0 1.0 8.0
Lake Biliancino 1.0 NaN 1.0 NaN 1.0 5.0 1.0 NaN 9.0
River Arno 1.0 NaN NaN 1.0 NaN 14.0 1.0 NaN 17.0
Water Spring Amiata 1.0 3.0 4.0 NaN NaN 5.0 3.0 NaN 16.0
Water Spring Lupa 1.0 NaN 1.0 NaN NaN 1.0 NaN NaN 3.0
Water Spring Madonna 1.0 NaN 1.0 NaN NaN 1.0 1.0 NaN 4.0

png

  • Both Date and Rainfall appear in all datasets
  • Temperature appear in all datasets except Water Spring Lupa.
  • Target Variables are shared in the right figure.
    • The Rectangle figure indicates target variable - Depth To Groundwater, Flow Rate, Hydrometry Lake Level
  • Since I have no good knowledge of Water Engineering, it’s necessary to define each target variables.
    • Depth to Groundwater: groundwater level detected by the piezometer (m from the ground floor)
    • Flow Rate (water spring/aquifer): flow rate (l/s)
    • Flow Rate (lake): flow rate ($m^3$/s)
    • Hydrometry: groundwater level detected by the hydrometric station (m)
    • Lake Level: river level (m)

1.2 Picking Data ❗

  • Since this Notebook is for newbie studying time series analysis and forecasting, it will begin with one dataset, Lake Biliancino
  • Let’s explore Water Spring Lupa data
Lake_Bilancino = pd.read_csv(f"{PATH}Lake_Bilancino.csv")
display(Lake_Bilancino)
Lake_Bilancino.info()
Date Rainfall_S_Piero Rainfall_Mangona Rainfall_S_Agata Rainfall_Cavallina Rainfall_Le_Croci Temperature_Le_Croci Lake_Level Flow_Rate
0 03/06/2002 NaN NaN NaN NaN NaN NaN 249.43 0.31
1 04/06/2002 NaN NaN NaN NaN NaN NaN 249.43 0.31
2 05/06/2002 NaN NaN NaN NaN NaN NaN 249.43 0.31
3 06/06/2002 NaN NaN NaN NaN NaN NaN 249.43 0.31
4 07/06/2002 NaN NaN NaN NaN NaN NaN 249.44 0.31
... ... ... ... ... ... ... ... ... ...
6598 26/06/2020 0.0 0.0 0.0 0.0 0.0 22.50 250.85 0.60
6599 27/06/2020 0.0 0.0 0.0 0.0 0.0 23.40 250.84 0.60
6600 28/06/2020 0.0 0.0 0.0 0.0 0.0 21.50 250.83 0.60
6601 29/06/2020 0.0 0.0 0.0 0.0 0.0 23.20 250.82 0.60
6602 30/06/2020 0.0 0.0 0.0 0.0 0.0 22.75 250.80 0.60

6603 rows × 9 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6603 entries, 0 to 6602
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Date                  6603 non-null   object 
 1   Rainfall_S_Piero      6026 non-null   float64
 2   Rainfall_Mangona      6026 non-null   float64
 3   Rainfall_S_Agata      6026 non-null   float64
 4   Rainfall_Cavallina    6026 non-null   float64
 5   Rainfall_Le_Croci     6026 non-null   float64
 6   Temperature_Le_Croci  6025 non-null   float64
 7   Lake_Level            6603 non-null   float64
 8   Flow_Rate             6582 non-null   float64
dtypes: float64(8), object(1)
memory usage: 464.4+ KB
  • Missing Value appears in some variables.
Lake_Bilancino = Lake_Bilancino.dropna().reset_index(drop=True)
Lake_Bilancino.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6025 entries, 0 to 6024
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Date                  6025 non-null   object 
 1   Rainfall_S_Piero      6025 non-null   float64
 2   Rainfall_Mangona      6025 non-null   float64
 3   Rainfall_S_Agata      6025 non-null   float64
 4   Rainfall_Cavallina    6025 non-null   float64
 5   Rainfall_Le_Croci     6025 non-null   float64
 6   Temperature_Le_Croci  6025 non-null   float64
 7   Lake_Level            6025 non-null   float64
 8   Flow_Rate             6025 non-null   float64
dtypes: float64(8), object(1)
memory usage: 423.8+ KB
  • Data Type of Date is object, let’s convert it to datetime
Lake_Bilancino['Date_dt'] = pd.to_datetime(Lake_Bilancino.Date, format = '%d/%m/%Y')
  • I have no idea what to do with Rainfall_* variables. So will delete it.
Lake_Bilancino.drop(['Date', 
                     'Rainfall_S_Piero', 
                     'Rainfall_Mangona', 
                     'Rainfall_S_Agata', 
                     'Rainfall_Cavallina', 
                     'Rainfall_Le_Croci'], axis = 1, inplace=True)
display(Lake_Bilancino)
Temperature_Le_Croci Lake_Level Flow_Rate Date_dt
0 6.50 251.21 0.5 2004-01-02
1 4.45 251.28 0.5 2004-01-03
2 2.00 251.35 0.5 2004-01-04
3 0.90 251.37 0.5 2004-01-05
4 2.25 251.42 0.5 2004-01-06
... ... ... ... ...
6020 22.50 250.85 0.6 2020-06-26
6021 23.40 250.84 0.6 2020-06-27
6022 21.50 250.83 0.6 2020-06-28
6023 23.20 250.82 0.6 2020-06-29
6024 22.75 250.80 0.6 2020-06-30

6025 rows × 4 columns

  • Let’s Draw Time Series Data
fig, ax = plt.subplots(nrows = 3, ncols = 1, figsize=(10, 10))
ax[0].plot(Lake_Bilancino['Date_dt'], Lake_Bilancino['Temperature_Le_Croci'])
ax[0].set_title("Temperature_Le_Croci")
ax[1].plot(Lake_Bilancino['Date_dt'], Lake_Bilancino['Lake_Level'])
ax[1].set_title("Lake_Level")
ax[2].plot(Lake_Bilancino['Date_dt'], Lake_Bilancino['Flow_Rate'])
ax[2].set_title("Flow_Rate")

plt.show()

png

Reference Notebook

https://www.kaggle.com/iamleonie/eda-quenching-the-thirst-for-insights