Challenge 1 – Data Cleaning & Outlier Detection for Smart-City Energy

You are part of a smart-city analytics team. The city has installed smart meters in several residential buildings and a public school. The goal is to understand typical and atypical daily energy consumption patterns as a first step towards intelligent demand management.

Learning Objectives

– Perform basic data-quality checks on a real energy dataset.
– Handle missing values from a statistical perspective.
– Transform wide smart-meter data into a tidy daily dataset.
– Use k-means clustering to identify atypical (outlier) days.

Material

Dataset: We use the Household Data package from Open Power System Data (OPSD), 60‑minute resolution.

Package page: https://data.open-power-system-data.org/household_data/2020-04-15/
Direct CSV (60‑minute, single index): https://data.open-power-system-data.org/household_data/2020-04-15/household_data_60min_singleindex.csv

Initial notebook to open in colab: https://drive.google.com/file/d/1cZdDM2zz9dnwd-oUzaczpPgExjSIdLxn/view?usp=sharing

Main Tasks

Load the OPSD 60-minute household data CSV into a pandas DataFrame, parse the timestamp, and inspect the structure.
Perform data-quality checks: compute missing-value ratios, identify problematic columns, and choose a cleaning strategy (e.g. dropping high-NA columns, imputing with forward-fill/backward-fill or interpolation), with justification.
Select several building-level grid-import variables (e.g. DE_KN_residential1/3/4_grid_import, DE_KN_public1_grid_import) and aggregate them from hourly to daily energy consumption using resampling.
Transform the data from a wide format to a tidy long format with at least: date, building_id, daily_grid_import_kwh, building_type (e.g. residential or school), and location (e.g. urban or suburban).
Use k-means clustering on daily energy plus simple calendar features (e.g. day_of_week) to identify clusters of days and detect potential outlier days (the smallest cluster).
Store the resulting daily dataset as energy_daily_features.csv for use in Challenge 3 in your Google Drive or locally.

Deliverables (Challenge 1)

Completed Jupyter/Colab notebook implementing the data cleaning, aggregation, and k-means clustering steps.
Tidy CSV file energy_daily_features.csv containing the daily energy dataset with metadata.
Short written justification (within the notebook) of the data-cleaning decisions and a brief interpretation of the detected outlier days.

Data Centric Smart Experiments

Turning reality phenomena into data thanks to the Big Data trend

Challenge 1 – Data Cleaning & Outlier Detection for Smart-City Energy

Learning Objectives

Material

Main Tasks

Deliverables (Challenge 1)