Final Exam · Volume I

The crash course
revision sheet for CS-404.

Tight summaries of every lecture in the syllabus — core concepts, formulas, code patterns, and the kind of unseen scenarios the instructor is likely to ask. Optimized for the night-before grind.

Instructor Notes

All lectures are important — but you may skip Lectures 4–5 (Finding Similar Items) from the pre-mid portion.
Code in lecture slides is part of the syllabus. Expect code snippets paired with conceptual / analytical questions.
20–25% of HDFS & Spark questions are unseen scenarios — same concepts, new datasets and setups.

Week01

Introduction to Big Data Analytics

Foundation

Core. Big Data is data so large, fast, or messy that traditional DBMSs can’t handle it cost-effectively. Defined by the Five Vs: Volume, Variety, Velocity, Veracity, Value.

Key Points

5 Vs — Volume, Variety, Velocity, Veracity, Value.
Data types — Structured (~20%, RDBMS), Semi-structured (XML/JSON/HTML), Unstructured (audio/video/text, ~80%).
Sources — Machine (IoT/sensors), User (social media), Organization (transactional, scientific).
Why DBMS fails — schema rigidity, vertical-scaling limits, cost per TB, poor fit for unstructured data.
Stack — HDFS + MapReduce → Hive/Pig → Spark (≈100× faster, in-memory) → Python (Pandas, scikit-learn) + NoSQL (Mongo, Cassandra, HBase).
Applications — recommendations, fraud detection, real-time monitoring, churn prediction.

Likely Scenario Stems

Given a workload (e.g., Twitter firehose), identify which V dominates and justify tool choice.
Classify a dataset as structured / semi-structured / unstructured with reasoning.
Why is Spark preferred over Hadoop MapReduce for iterative ML?

Week02

Exploratory Data Analysis

Foundation

Core. EDA is the systematic probing of raw data with statistics and visuals to understand distributions, outliers, and relationships before any modeling.

Key Points

Univariate — distribution of one variable (histogram, box plot, mean / median / mode).
Bivariate — relationships between pairs (scatter, correlation, covariance).
Multivariate — 3+ variables (pair plots, heatmaps, PCA).
Missing data — detect → drop / impute (mean, median, forward-fill, model-based).
Outliers — IQR rule, z-score, visual (box plot whiskers).
Skew & spread — variance, std dev, quartiles, skewness; choose transforms (log, sqrt).

Formulas

Variance σ² = Σ(xᵢ − μ)² / n

z-score z = (x − μ) / σ

IQR Q3 − Q1 · outlier if x < Q1 − 1.5·IQR or x > Q3 + 1.5·IQR

Pearson r Σ((xᵢ−x̄)(yᵢ−ȳ)) / √(Σ(xᵢ−x̄)² · Σ(yᵢ−ȳ)²)

Code Idioms (Pandas)

EDA in one breathimport pandas as pd
df = pd.read_csv("data.csv")
df.info(); df.describe()           # shape, types, summary stats
df.isnull().sum()                  # missing per column
df.corr()                          # pairwise correlation
df.boxplot(column="price")        # spot outliers

Likely Scenario Stems

A column has 30% nulls — choose an imputation strategy and justify.
Compute IQR-based outlier bounds for a small dataset.
Interpret a correlation matrix — which feature pair would you drop and why?

Week03

Frequent Itemset Mining

Algorithmic

Core. Find sets of items that co-occur often in transactions, then derive association rules (X ⇒ Y) — the engine behind market-basket analysis.

Key Points

Apriori principle — any superset of an infrequent set is also infrequent (monotonicity) → prune aggressively.
Apriori algorithm — level-wise: generate Cₖ → scan DB → keep frequent Lₖ → join for Cₖ₊₁.
PCY — hash candidate pairs into buckets in pass 1; infrequent buckets eliminate many candidates before pass 2.
FP-Growth — build compressed FP-tree; mine via conditional pattern bases; no candidate generation → much faster.
Rules — filter frequent sets by confidence and lift.

Formulas

Support(X) = |{T : X ⊆ T}| / |T|

Confidence(X⇒Y) = Support(X ∪ Y) / Support(X)

Lift(X⇒Y) = Confidence(X⇒Y) / Support(Y) · >1 = positively correlated

Likely Scenario Stems

Given 8 transactions, compute support / confidence / lift for {Bread} ⇒ {Butter}.
Apply Apriori pruning over two passes; list the surviving candidates.
Why does PCY beat Apriori when memory is tight on pass 1?
Trace FP-tree construction for a small transaction set.

Week06

Graph Mining & PageRank

Algorithmic

Core. PageRank ranks nodes in a directed graph by simulating a random surfer. A page is important if many important pages link to it.

Key Points

Random surfer — at each step follow a random outlink; PageRank = stationary probability of being on that page.
Dead ends (no outlinks) leak rank → redistribute uniformly.
Spider traps (cycles with no exit) trap all rank → fix with teleportation.
Teleport β ≈ 0.85 — with prob (1−β) jump to a random page; handles both pathologies.
Power iteration — repeat r ← βMr + (1−β)/N until ‖rₜ₊₁ − rₜ‖₁ < ε (usually 20–30 iters).
Topic-sensitive PR — restrict teleport set to a topic → personalized rankings.

Formulas

Flow rⱼ = Σᵢ→ⱼ rᵢ / dᵢ · dᵢ = out-degree of i

Matrix r = M · r · Mᵢⱼ = 1/dⱼ if j→i else 0

With teleport rⱼ = β · Σᵢ→ⱼ rᵢ/dᵢ + (1−β)/N

Likely Scenario Stems

Given a 4-node graph, compute 2 iterations of PageRank by hand.
Identify dead ends and spider traps; explain how teleportation fixes each.
How does varying β between 0.8 and 0.95 affect convergence and final ranks?

Week07

Stream Data Processing

Algorithmic

Core. Streams are unbounded and seen once. Use space-efficient probabilistic sketches to approximate set membership, frequency, and counts.

Key Points

Constraints — single pass, bounded memory, respond before the next item arrives.
Bloom filter — k hash functions set k bits in an m-bit array; query returns "maybe" or "definitely not". No false negatives.
Count-Min sketch — d × w counter table; on update increment d cells; query = min across the d cells (over-estimates only).
Reservoir sampling — keep a uniform random sample of size k from a stream of unknown length.
DGIM — count 1s in last N bits of a binary stream in O(log²N) space.
Windows — tumbling, sliding, session; choose based on freshness vs cost.

Formulas

Bloom FP rate (1 − e^(−kn/m))^k

Optimal k k* = (m/n) · ln 2

Likely Scenario Stems

Size a Bloom filter for 1M URLs with ≤ 1% false positive rate — pick m and k.
Run a Count-Min sketch over a short stream and answer a frequency query.
Justify reservoir sampling vs storing the full stream for a given use case.

Week10

Recommender Systems

Algorithmic

Core. Predict user–item ratings using item features (content-based) or behavioral similarity across users / items (collaborative filtering).

Key Points

Content-based — build item feature vectors (TF-IDF, attributes); recommend similar items to user's history. No new-item cold start.
User-user CF — find k nearest users; predict from their ratings.
Item-item CF — find items similar to ones the user liked; usually more stable than user-user.
Cold start — new user (content / popularity), new item (content), new system (surveys / hybrid).
Evaluation — RMSE / MAE for ratings; precision@k, recall@k for top-N.
Hybrid & latent-factor (SVD / MF) — handle sparsity; scale better than memory-based CF.

Formulas

TF-IDF w_{ij} = TF_{ij} · log(|D| / df_i)

Cosine sim(x,y) = (x·y) / (‖x‖ · ‖y‖)

Pearson Σ((xᵢ−x̄)(yᵢ−ȳ)) / √(Σ(xᵢ−x̄)² · Σ(yᵢ−ȳ)²)

Predicted rating r̂_{u,i} = Σ_{j∈N(i;u)} sim(i,j)·r_{u,j} / Σ |sim(i,j)|

Likely Scenario Stems

Given a user–item matrix, compute cosine similarity between two users and predict a missing rating.
A platform launches with new users and items — design the recommender to mitigate cold start.
Choose between user-user and item-item CF for a sparse retail dataset; justify.

Week11

Hadoop & MapReduce

20–25% Scenario-based

Core. MapReduce splits work into independent map tasks (transform) and reduce tasks (aggregate), connected by a shuffle. Hadoop runs it atop HDFS for fault tolerance and scale.

Key Points

Phases — Input split → Map → (Combine) → Partition → Shuffle & Sort → Reduce → Output.
Map emits (key, value); Reduce sees all values for one key (grouped, sorted).
Combiner — mini reducer on the mapper's node; cuts shuffle traffic. Must be associative + commutative.
Partitioner — decides reducer per key (default = hash(key) mod R).
Data locality — schedule map task on the node already holding the block.
Fault tolerance — failed tasks rerun on healthy nodes; HDFS replicates blocks 3×.
Classic example — WordCount.

WordCount in MapReduce

Pseudocodemap(key, line):
  for word in line.split():
    emit(word, 1)

reduce(word, counts):
  emit(word, sum(counts))

Java Mapper / Reducer (Hadoop API)public class WCMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
  public void map(LongWritable k, Text v, Context ctx) {
    for (String w : v.toString().split("\\s+"))
      ctx.write(new Text(w), new IntWritable(1));
  }
}

public class WCReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
  public void reduce(Text k, Iterable<IntWritable> vs, Context ctx) {
    int sum = 0;
    for (IntWritable v : vs) sum += v.get();
    ctx.write(k, new IntWritable(sum));
  }
}

Likely Scenario Stems

Keys are skewed (one word dominates) — how do reducers behave, and how do you fix it (combiner, custom partitioner, salting)?
Design map + reduce for: max temperature per year from weather logs / average rating per movie / inverted index.
Walk through the lifecycle of a single (key, value) pair from input split to final output.
Does a combiner always produce the same result as without one? Why / why not?

Week12

HDFS & Hadoop Architecture

20–25% Scenario-based

Core. HDFS splits files into large blocks, replicates them across DataNodes, and tracks everything via a single NameNode. Built for high-throughput sequential reads on commodity hardware.

Key Points

NameNode — master; holds metadata (namespace, block locations) in RAM. Single point of failure (mitigated by Standby NN).
DataNode — slaves; store actual blocks; send heartbeats + block reports to NN.
Block size — default 128 MB (was 64). Larger ⇒ less metadata, better sequential I/O, fewer map tasks.
Replication factor — default 3. Rack-aware: 1st local, 2nd different rack, 3rd same rack as 2nd → balance fault tolerance vs network cost.
Read path — Client asks NN for block locations → reads blocks from nearest DataNode in parallel.
Write path — Client → NN allocates blocks → Client streams to DN pipeline (DN1 → DN2 → DN3) → ack reverses.
Failure handling — missed heartbeat ⇒ DN marked dead ⇒ NN re-replicates lost blocks.

Driver Skeleton

Hadoop Job Driver (Java)Job job = Job.getInstance(conf, "wordcount");
job.setJarByClass(WordCount.class);
job.setMapperClass(WCMapper.class);
job.setCombinerClass(WCReducer.class);   // reuse reducer as combiner
job.setReducerClass(WCReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);

HDFS Shell — basicshdfs dfs -mkdir /user/data
hdfs dfs -put local.txt /user/data/
hdfs dfs -ls /user/data
hdfs dfs -cat /user/data/local.txt
hdfs fsck /user/data/local.txt -files -blocks -locations

Likely Scenario Stems

A 1 GB file is written with RF=3 and block size 128 MB — how many blocks? How many replicas? Place them on 2 racks of 5 nodes.
The NameNode crashes — what happens to running jobs and to data? How does HA recover?
Why is HDFS bad for many small files? How would you mitigate it (HAR, SequenceFile, HBase)?
A DataNode goes silent for 10 minutes — describe NN's reaction step by step.

Week13

YARN & Cluster Resource Mgmt

20–25% Scenario-based

Core. YARN decouples resource management from job execution, so the same Hadoop cluster can run MapReduce, Spark, Flink, etc. side by side.

Key Points

ResourceManager (RM) — global; Scheduler + ApplicationsManager. Allocates containers.
NodeManager (NM) — per-node; launches/monitors containers, reports resources to RM.
ApplicationMaster (AM) — per-job; negotiates resources with RM, schedules tasks on NMs, handles task failures.
Container — bundle of (CPU, RAM) on a node; isolated JVM process per task.
Job flow — Client → RM (submit) → RM launches AM container → AM requests task containers → NMs run tasks → AM tracks → exit.
Schedulers — FIFO, Capacity (queues + guaranteed share), Fair (equal share over time).
vs Hadoop v1 — old JobTracker did scheduling AND coordination → bottleneck. YARN splits these → 4k+ nodes, multi-tenant.

Likely Scenario Stems

A cluster runs both Spark and MapReduce — explain how YARN allocates resources fairly using the Capacity Scheduler.
An ApplicationMaster crashes mid-job — does the job fail? Walk through recovery.
Container OOM-killed by NM — what does the AM do next? When does the whole job fail?
Compare JobTracker / TaskTracker vs RM / AM / NM — why was the redesign needed?

Week14

Spark Core — RDDs & Architecture

20–25% Scenario-based

Core. Spark is an in-memory distributed engine. Data is an RDD — immutable, partitioned, lineage-tracked. Transformations are lazy; actions trigger execution.

Key Points

RDD — Resilient (lineage-recoverable), Distributed (partitions across executors), Dataset (immutable collection).
Transformations (lazy) — map, filter, flatMap, groupByKey, reduceByKey, join, union, distinct.
Actions (eager) — collect, count, take, reduce, saveAsTextFile, foreach.
Narrow dep. — 1 parent partition → 1 child (map, filter); no shuffle.
Wide dep. — many parents → 1 child (groupByKey, join); shuffle, stage boundary.
Lineage — DAG of parents + transforms; on partition loss, recompute from lineage — no replication needed.
Architecture — Driver (DAG, scheduler) → Cluster Manager (YARN/Mesos/Standalone) → Executors (tasks, cached data).
Persistence — cache() = MEMORY_ONLY; persist(level) for disk / serialized / replicated.
vs MapReduce — keeps intermediate data in RAM ⇒ 10–100× faster for iterative / interactive workloads.
DataFrame — RDD + schema; optimized by Catalyst, executed via Tungsten.

Code You Should Recognize

PySpark — WordCount with RDDfrom pyspark import SparkContext
sc = SparkContext("local", "wc")

counts = (sc.textFile("input.txt")
            .flatMap(lambda line: line.split())
            .map(lambda w: (w, 1))
            .reduceByKey(lambda a, b: a + b))

counts.saveAsTextFile("out")

DataFrame API + SQLfrom pyspark.sql import SparkSession
spark = SparkSession.builder.appName("demo").getOrCreate()

df = spark.read.json("orders.json")
df.filter("amount > 100") \
  .groupBy("user_id") \
  .count() \
  .orderBy("count", ascending=False) \
  .show()

df.createOrReplaceTempView("orders")
spark.sql("SELECT user_id, COUNT(*) FROM orders GROUP BY user_id").show()

Lineage & cachingrdd = sc.textFile("big.log").filter(lambda x: "ERROR" in x)
rdd.cache()                       # keep in memory across actions
rdd.count()                       # 1st action — materializes & caches
rdd.filter(lambda x: "DB" in x).count()  # reuses cached

Likely Scenario Stems

Given a Spark pipeline, mark each transformation as narrow / wide and identify stage boundaries.
An executor dies — does the job restart? What does Spark do with the lost partitions? Why is replication unnecessary?
Why is reduceByKey preferred over groupByKey for aggregation? Sketch what each shuffles.
When should you call .cache()? Give a wrong placement and a right placement.
Convert a given MapReduce WordCount into PySpark with the same semantics.

Week15

Spark Execution & Catalyst

20–25% Scenario-based

Core. Spark turns your DataFrame / SQL into an optimized physical plan via the Catalyst optimizer, then runs it as a DAG of stages and tasks. Lazy evaluation lets it optimize before executing.

Key Points

Catalyst pipeline — Unresolved Logical → Analyzed (catalog resolves names) → Optimized Logical (predicate pushdown, projection pruning, constant folding) → Physical Plans → Cost-based selection.
Tungsten — whole-stage code generation, off-heap memory, columnar in-memory format; avoids JVM virtual calls.
Stages — split by shuffle boundaries; tasks within a stage = #partitions, run in parallel.
Joins — Broadcast (small side ≤ spark.sql.autoBroadcastJoinThreshold, default 10 MB), Sort-Merge (both large + sorted), Shuffle-Hash.
Partitioning — repartition(n) (full shuffle, even sizes), coalesce(n) (narrow, shrink only). Tune to avoid skew & small files.
Inspect plans — df.explain(True) shows logical + physical plan; look for Exchange (= shuffle).
Adaptive Query Execution (AQE) — re-optimizes stages at runtime using actual stats.

Optimization Patterns

Broadcast a small dim tablefrom pyspark.sql.functions import broadcast
result = orders.join(broadcast(countries), "country_id")

Predicate pushdown / projection pruning — for freedf = spark.read.parquet("events")
df.filter("day = '2026-05-18'").select("user", "event").show()
# Spark reads only matching files + only those 2 columns

Likely Scenario Stems

Read an explain() plan — locate the shuffle and propose a rewrite (broadcast, filter earlier, repartition).
One reducer is much slower than the rest — diagnose the skew and fix it.
Joining a 5 GB fact table with a 5 MB dim table — choose & justify the join strategy.
How does Catalyst optimize SELECT a FROM t WHERE b = 1 over Parquet? Walk through plan stages.
repartition(200) vs coalesce(200) — when do you pick each?

Final-Stretch Strategy

A few high-leverage moves for the night before, ranked by ROI.

Memorize the formula box

Support / confidence / lift, PageRank, cosine, Bloom FP rate, TF-IDF — these come up almost verbatim.

ii.

Re-draw HDFS & YARN

Be able to sketch NameNode/DataNode and RM/AM/NM/Container from memory in 60 seconds.

iii.

Narrow vs wide

For any Spark transformation, instantly know if it shuffles. That alone unlocks half the scenario questions.

iv.

WordCount in three forms

MapReduce pseudocode, Java mapper/reducer, PySpark one-liner. One of these almost always appears.

Practice scenarios cold

For HDFS / Spark, attempt the Q stems on each card with slides closed — then check.

vi.

Skip wisely

Weeks 4–5 (Finding Similar Items) are officially droppable. Don't burn revision time there.

Pre-Mid

Post-Mid

Exam

Introduction to Big Data Analytics

Key Points

Likely Scenario Stems

Exploratory Data Analysis

Key Points

Formulas

Code Idioms (Pandas)

Likely Scenario Stems

Frequent Itemset Mining

Key Points

Formulas

Likely Scenario Stems

Graph Mining & PageRank

Key Points

Formulas

Likely Scenario Stems

Stream Data Processing

Key Points

Formulas

Likely Scenario Stems

Recommender Systems

Key Points

Formulas

Likely Scenario Stems

Hadoop & MapReduce

Key Points

WordCount in MapReduce

Likely Scenario Stems

HDFS & Hadoop Architecture

Key Points

Driver Skeleton

Likely Scenario Stems

YARN & Cluster Resource Mgmt

Key Points

Likely Scenario Stems

Spark Core — RDDs & Architecture

Key Points

Code You Should Recognize

Likely Scenario Stems

Spark Execution & Catalyst

Key Points

Optimization Patterns

Likely Scenario Stems

Final-Stretch Strategy

Memorize the formula box

Re-draw HDFS & YARN

Narrow vs wide

WordCount in three forms

Practice scenarios cold

Skip wisely