Challenge 1 – Data Cleaning & Outlier Detection for Smart-City Energy

You are part of a smart-city analytics team. The city has installed smart meters in several residential buildings and a public school. The goal is to understand typical and atypical daily energy consumption patterns as a first step towards intelligent demand management.

Learning Objectives

– Perform basic data-quality checks on a real energy dataset.
– Handle missing values from a statistical perspective.
– Transform wide smart-meter data into a tidy daily dataset.
– Use k-means clustering to identify atypical (outlier) days.

Material

Dataset: We use the Household Data package from Open Power System Data (OPSD), 60‑minute resolution.

Initial notebook to open in colab: https://drive.google.com/file/d/1cZdDM2zz9dnwd-oUzaczpPgExjSIdLxn/view?usp=sharing

Main Tasks

  1. Load the OPSD 60-minute household data CSV into a pandas DataFrame, parse the timestamp, and inspect the structure.
  2. Perform data-quality checks: compute missing-value ratios, identify problematic columns, and choose a cleaning strategy (e.g. dropping high-NA columns, imputing with forward-fill/backward-fill or interpolation), with justification.
  3. Select several building-level grid-import variables (e.g. DE_KN_residential1/3/4_grid_import, DE_KN_public1_grid_import) and aggregate them from hourly to daily energy consumption using resampling.
  4. Transform the data from a wide format to a tidy long format with at least: date, building_id, daily_grid_import_kwh, building_type (e.g. residential or school), and location (e.g. urban or suburban).
  5. Use k-means clustering on daily energy plus simple calendar features (e.g. day_of_week) to identify clusters of days and detect potential outlier days (the smallest cluster).
  6. Store the resulting daily dataset as energy_daily_features.csv for use in Challenge 3 in your Google Drive or locally.

Deliverables (Challenge 1)

  • Completed Jupyter/Colab notebook implementing the data cleaning, aggregation, and k-means clustering steps.
  • Tidy CSV file energy_daily_features.csv containing the daily energy dataset with metadata.
  • Short written justification (within the notebook) of the data-cleaning decisions and a brief interpretation of the detected outlier days.