Content

Part I (5 hours)

Introduction [ pdf ]

  • Principle and objective
  • Dealing with huge data collections
  • Practical exercise 1: thinking map-reduce counting words with Pig

Part II (10 hours)

Summarization patterns [ pdf ]

  • Numerical summarization
  • Inverted Index summarization
  • Counting with counters

Filtering patterns

  • Filtering
  • Bloom
  • Top ten
  • Distinct

Part III (12 hours)

Data organization patterns

  • Structured to hierarchical
  • Partitioning
  • Binning
  • Total order sorting
  • Shuffling

Part IV  (12 hours)

Join patterns

  •  Do U remember how to program joins?
  • Reduce side join
  • Replicated join
  • Composite join
  • Cartesian product

Part V (4 hours)

Wrapping up: analysis, limitations, perspectives  [ pdf ]

  • Discussion material [ pdf ]