You are part of a smart-city analytics team. The city has installed smart meters in several residential buildings and a public school. The goal is to understand typical and atypical daily energy consumption patterns as a first step towards intelligent demand management.
Learning Objectives
– Perform basic data-quality checks on a real energy dataset.
– Handle missing values from a statistical perspective.
– Transform wide smart-meter data into a tidy daily dataset.
– Use k-means clustering to identify atypical (outlier) days.
Material
Dataset: We use the Household Data package from Open Power System Data (OPSD), 60‑minute resolution.
- Package page: https://data.open-power-system-data.org/household_data/2020-04-15/
- Direct CSV (60‑minute, single index): https://data.open-power-system-data.org/household_data/2020-04-15/household_data_60min_singleindex.csv
Initial notebook to open in colab: https://drive.google.com/file/d/1cZdDM2zz9dnwd-oUzaczpPgExjSIdLxn/view?usp=sharing
Main Tasks
- Load the OPSD 60-minute household data CSV into a pandas DataFrame, parse the timestamp, and inspect the structure.
- Perform data-quality checks: compute missing-value ratios, identify problematic columns, and choose a cleaning strategy (e.g. dropping high-NA columns, imputing with forward-fill/backward-fill or interpolation), with justification.
- Select several building-level grid-import variables (e.g. DE_KN_residential1/3/4_grid_import, DE_KN_public1_grid_import) and aggregate them from hourly to daily energy consumption using resampling.
- Transform the data from a wide format to a tidy long format with at least: date, building_id, daily_grid_import_kwh, building_type (e.g. residential or school), and location (e.g. urban or suburban).
- Use k-means clustering on daily energy plus simple calendar features (e.g. day_of_week) to identify clusters of days and detect potential outlier days (the smallest cluster).
- Store the resulting daily dataset as energy_daily_features.csv for use in Challenge 3 in your Google Drive or locally.
Deliverables (Challenge 1)
- Completed Jupyter/Colab notebook implementing the data cleaning, aggregation, and k-means clustering steps.
- Tidy CSV file energy_daily_features.csv containing the daily energy dataset with metadata.
- Short written justification (within the notebook) of the data-cleaning decisions and a brief interpretation of the detected outlier days.