Intro to Hadoop and MapReduce

Notes I took during taking the Udacity course Introduction to Hadoop and MapReduce


##Lesson 1: Big Data

###3Vs of big data:

  • Volume
  • Variety
  • Velocity

Proof of Induction and DP

###Proof of induction

Try sometimes induction on portion of the whole problem.

Induction can be only a component of the whole proof, try simplify the problem then use induction as the last step to close the loop.

###Proof of DP

####3M: Meaning, Method, Meat

  • Meaning of each entry in the table
  • Method to fill the table (and sequence)
  • Meat: how to get the result (post-processing)

####Pattern(Must say in prooving DP)

  1. Consider an entry in the table
  2. Consider the optimal way to get there
  3. Consider the last choice you made in an optimal solution to get there

Competitive Analysis and Online Algorithm

###Competitive Analysis

OPT is not possible (eg. you cannot predict the future), but you want to be as close to OPT as possible. How do you benchmark it?- Competitve Analysis!

Try to performance within some constant range of OPT no matter how bad things were. (Try to perform as good as wizard no matter how evil the devil is.)

###Online Algorithm

You have to make decisions on the fly. And you have no idea what’s coming next.

###Algorithm

  • Ski Rental

    • Buy(500) or rent(50)?

    • The first 9 visits, rent; after 10th, buy.

  • Multiplicative weights update

    • Wrong experts get punished.

    • Try to be within a factor of the best single expert.

  • LRU

    The least recently used gets thrown out when the cache is full.

Map of Computer

###Components

  • Memory: size, latency, bandwidth
  • Cache
  • Cpu Pipeline - branch prediction

Latency Numbers Every Programmer Should Know

###Example

####Binary search vs. Linear search

#####Binary search:

if a[i] < key:
    go left
else:
    go right

The if statement is likely to fail

Runtime:

is the cost of incorrect prediction

#####Linear search

for i = 1 : n
    if a[i] = key:
        return i;

The if statement will only fail 1 time when it actually find the element. (Assumption: if branch prediction act like we want)

Runtime:


Chart showing Binary vs Linear search Picture source

Divide and Conquer

###Features:

  • Divide
  • Conquer
  • Merge

The time complexity for each part varies by different problems.

###Algorithms:

  1. K-selection problem(similar to quicksort)

    1. Select a random element x
    2. Count how many elements are less than x
    3. Decide which part of list to search

    De-randomize: divide into group of 5, pivot is median of medians

  2. FFT

    Group even/odd columns together and reduce the computation.

  3. K-th smallest in 2 sorted lists

    Binary search on 2 lists, knock out 1/4 each time.

  4. Find the closest pair in a plane

    Click me