Challenge 2 – Data Quality, Fairness & SQL Sampling with AIF360

You are designing an AI system that will help a city decide which households should be offered energy-efficiency support measures (e.g. subsidies, home retrofits). Before deploying any model, you must evaluate and mitigate potential bias in the training data.

You now consider a socio-economic dataset that could be used to decide which households are eligible for smart-city energy-efficiency support measures (e.g. subsidies or home retrofits). Before training any model, you must evaluate and mitigate bias in the data.

This challenge introduces fairness concepts using the Adult Census Income dataset and the IBM AI Fairness 360 (AIF360) library, and connects to data management via SQL-based fair sampling.

Material

To focus on fairness methods, we use the Adult Census Income dataset from the UCI Machine Learning Repository, which is bundled with IBM’s AI Fairness 360 (AIF360) toolkit.

– UCI Adult dataset page: https://archive.ics.uci.edu/dataset/2/adult
– Colab initial notebook: https://drive.google.com/file/d/1nrMxSxZ7pD9oX7uSH5TrZk8pZ_xs16Pl/view?usp=sharing

Think of the label “high income” as a proxy for the ability to invest in energy-saving technologies, and demographic attributes (sex, race, etc.) as potential sources of unfair bias in a smart-city programme.

Learning objectives

– Perform basic quality checks on a socio-economic dataset.
– Use AIF360 to compute fairness metrics for a binary outcome.
– Apply the Reweighing algorithm to reduce bias in the dataset.
– Implement fair sampling strategies in SQL (balanced sampling by group).

Main Tasks

  1. Load the Adult dataset via the AIF360 AdultDataset class and convert it to a pandas DataFrame.
  2. Perform basic data-quality checks: inspect shape, columns, missing values, and descriptive statistics, and comment on potential quality issues.
  3. Define a protected attribute (sex) with privileged and unprivileged groups (Male vs Female) and use Binary Label Dataset Metric to compute fairness metrics such as statistical parity difference and disparate impact.
  4. Apply the Reweighing preprocessing algorithm to obtain a reweighted dataset that aims to reduce bias; recompute fairness metrics and compare to the original dataset.
  5. Create an in-memory SQLite database from the DataFrame and implement a balanced sampling query using window functions (ROW_NUMBER() OVER (PARTITION BY sex ORDER BY RANDOM())) to build a training sample with equal representation of each sex.
  6. Convert the SQL-balanced sample back into an AIF360 Binary Label Dataset, and recompute the fairness metrics to compare with the original and reweighted datasets.

Deliverables

  • Completed Jupyter/Colab notebook implementing fairness metrics with AIF360 and SQL-based fair sampling.
  • Short written comparison of fairness metrics for three cases: original dataset, reweighted dataset, and SQL-balanced sample.
  • Brief reflection (within the notebook) on how these fairness techniques could be applied in energy-related decision-making (e.g. targeting energy-efficiency support).