The crash course
revision sheet for CS-404.
Tight summaries of every lecture in the syllabus — core concepts, formulas, code patterns, and the kind of unseen scenarios the instructor is likely to ask. Optimized for the night-before grind.
Instructor Notes
- All lectures are important — but you may skip Lectures 4–5 (Finding Similar Items) from the pre-mid portion.
- Code in lecture slides is part of the syllabus. Expect code snippets paired with conceptual / analytical questions.
- 20–25% of HDFS & Spark questions are unseen scenarios — same concepts, new datasets and setups.
Introduction to Big Data Analytics
Core. Big Data is data so large, fast, or messy that traditional DBMSs can’t handle it cost-effectively. Defined by the Five Vs: Volume, Variety, Velocity, Veracity, Value.
Key Points
- 5 Vs — Volume, Variety, Velocity, Veracity, Value.
- Data types — Structured (~20%, RDBMS), Semi-structured (XML/JSON/HTML), Unstructured (audio/video/text, ~80%).
- Sources — Machine (IoT/sensors), User (social media), Organization (transactional, scientific).
- Why DBMS fails — schema rigidity, vertical-scaling limits, cost per TB, poor fit for unstructured data.
- Stack — HDFS + MapReduce → Hive/Pig → Spark (≈100× faster, in-memory) → Python (Pandas, scikit-learn) + NoSQL (Mongo, Cassandra, HBase).
- Applications — recommendations, fraud detection, real-time monitoring, churn prediction.
Likely Scenario Stems
- Given a workload (e.g., Twitter firehose), identify which V dominates and justify tool choice.
- Classify a dataset as structured / semi-structured / unstructured with reasoning.
- Why is Spark preferred over Hadoop MapReduce for iterative ML?
Exploratory Data Analysis
Core. EDA is the systematic probing of raw data with statistics and visuals to understand distributions, outliers, and relationships before any modeling.
Key Points
- Univariate — distribution of one variable (histogram, box plot, mean / median / mode).
- Bivariate — relationships between pairs (scatter, correlation, covariance).
- Multivariate — 3+ variables (pair plots, heatmaps, PCA).
- Missing data — detect → drop / impute (mean, median, forward-fill, model-based).
- Outliers — IQR rule, z-score, visual (box plot whiskers).
- Skew & spread — variance, std dev, quartiles, skewness; choose transforms (log, sqrt).
Formulas
Code Idioms (Pandas)
EDA in one breathimport pandas as pd
df = pd.read_csv("data.csv")
df.info(); df.describe() # shape, types, summary stats
df.isnull().sum() # missing per column
df.corr() # pairwise correlation
df.boxplot(column="price") # spot outliers
Likely Scenario Stems
- A column has 30% nulls — choose an imputation strategy and justify.
- Compute IQR-based outlier bounds for a small dataset.
- Interpret a correlation matrix — which feature pair would you drop and why?
Frequent Itemset Mining
Core. Find sets of items that co-occur often in transactions, then derive association rules (X ⇒ Y) — the engine behind market-basket analysis.
Key Points
- Apriori principle — any superset of an infrequent set is also infrequent (monotonicity) → prune aggressively.
- Apriori algorithm — level-wise: generate Cₖ → scan DB → keep frequent Lₖ → join for Cₖ₊₁.
- PCY — hash candidate pairs into buckets in pass 1; infrequent buckets eliminate many candidates before pass 2.
- FP-Growth — build compressed FP-tree; mine via conditional pattern bases; no candidate generation → much faster.
- Rules — filter frequent sets by confidence and lift.
Formulas
Likely Scenario Stems
- Given 8 transactions, compute support / confidence / lift for {Bread} ⇒ {Butter}.
- Apply Apriori pruning over two passes; list the surviving candidates.
- Why does PCY beat Apriori when memory is tight on pass 1?
- Trace FP-tree construction for a small transaction set.
Graph Mining & PageRank
Core. PageRank ranks nodes in a directed graph by simulating a random surfer. A page is important if many important pages link to it.
Key Points
- Random surfer — at each step follow a random outlink; PageRank = stationary probability of being on that page.
- Dead ends (no outlinks) leak rank → redistribute uniformly.
- Spider traps (cycles with no exit) trap all rank → fix with teleportation.
- Teleport β ≈ 0.85 — with prob (1−β) jump to a random page; handles both pathologies.
- Power iteration — repeat r ← βMr + (1−β)/N until ‖rₜ₊₁ − rₜ‖₁ < ε (usually 20–30 iters).
- Topic-sensitive PR — restrict teleport set to a topic → personalized rankings.
Formulas
Likely Scenario Stems
- Given a 4-node graph, compute 2 iterations of PageRank by hand.
- Identify dead ends and spider traps; explain how teleportation fixes each.
- How does varying β between 0.8 and 0.95 affect convergence and final ranks?
Stream Data Processing
Core. Streams are unbounded and seen once. Use space-efficient probabilistic sketches to approximate set membership, frequency, and counts.
Key Points
- Constraints — single pass, bounded memory, respond before the next item arrives.
- Bloom filter — k hash functions set k bits in an m-bit array; query returns "maybe" or "definitely not". No false negatives.
- Count-Min sketch — d × w counter table; on update increment d cells; query = min across the d cells (over-estimates only).
- Reservoir sampling — keep a uniform random sample of size k from a stream of unknown length.
- DGIM — count 1s in last N bits of a binary stream in O(log²N) space.
- Windows — tumbling, sliding, session; choose based on freshness vs cost.
Formulas
Likely Scenario Stems
- Size a Bloom filter for 1M URLs with ≤ 1% false positive rate — pick m and k.
- Run a Count-Min sketch over a short stream and answer a frequency query.
- Justify reservoir sampling vs storing the full stream for a given use case.
Recommender Systems
Core. Predict user–item ratings using item features (content-based) or behavioral similarity across users / items (collaborative filtering).
Key Points
- Content-based — build item feature vectors (TF-IDF, attributes); recommend similar items to user's history. No new-item cold start.
- User-user CF — find k nearest users; predict from their ratings.
- Item-item CF — find items similar to ones the user liked; usually more stable than user-user.
- Cold start — new user (content / popularity), new item (content), new system (surveys / hybrid).
- Evaluation — RMSE / MAE for ratings; precision@k, recall@k for top-N.
- Hybrid & latent-factor (SVD / MF) — handle sparsity; scale better than memory-based CF.
Formulas
Likely Scenario Stems
- Given a user–item matrix, compute cosine similarity between two users and predict a missing rating.
- A platform launches with new users and items — design the recommender to mitigate cold start.
- Choose between user-user and item-item CF for a sparse retail dataset; justify.
Hadoop & MapReduce
Core. MapReduce splits work into independent map tasks (transform) and reduce tasks (aggregate), connected by a shuffle. Hadoop runs it atop HDFS for fault tolerance and scale.
Key Points
- Phases — Input split → Map → (Combine) → Partition → Shuffle & Sort → Reduce → Output.
- Map emits (key, value); Reduce sees all values for one key (grouped, sorted).
- Combiner — mini reducer on the mapper's node; cuts shuffle traffic. Must be associative + commutative.
- Partitioner — decides reducer per key (default = hash(key) mod R).
- Data locality — schedule map task on the node already holding the block.
- Fault tolerance — failed tasks rerun on healthy nodes; HDFS replicates blocks 3×.
- Classic example — WordCount.
WordCount in MapReduce
Pseudocodemap(key, line):
for word in line.split():
emit(word, 1)
reduce(word, counts):
emit(word, sum(counts))
Java Mapper / Reducer (Hadoop API)public class WCMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
public void map(LongWritable k, Text v, Context ctx) {
for (String w : v.toString().split("\\s+"))
ctx.write(new Text(w), new IntWritable(1));
}
}
public class WCReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text k, Iterable<IntWritable> vs, Context ctx) {
int sum = 0;
for (IntWritable v : vs) sum += v.get();
ctx.write(k, new IntWritable(sum));
}
}
Likely Scenario Stems
- Keys are skewed (one word dominates) — how do reducers behave, and how do you fix it (combiner, custom partitioner, salting)?
- Design map + reduce for: max temperature per year from weather logs / average rating per movie / inverted index.
- Walk through the lifecycle of a single (key, value) pair from input split to final output.
- Does a combiner always produce the same result as without one? Why / why not?
HDFS & Hadoop Architecture
Core. HDFS splits files into large blocks, replicates them across DataNodes, and tracks everything via a single NameNode. Built for high-throughput sequential reads on commodity hardware.
Key Points
- NameNode — master; holds metadata (namespace, block locations) in RAM. Single point of failure (mitigated by Standby NN).
- DataNode — slaves; store actual blocks; send heartbeats + block reports to NN.
- Block size — default 128 MB (was 64). Larger ⇒ less metadata, better sequential I/O, fewer map tasks.
- Replication factor — default 3. Rack-aware: 1st local, 2nd different rack, 3rd same rack as 2nd → balance fault tolerance vs network cost.
- Read path — Client asks NN for block locations → reads blocks from nearest DataNode in parallel.
- Write path — Client → NN allocates blocks → Client streams to DN pipeline (DN1 → DN2 → DN3) → ack reverses.
- Failure handling — missed heartbeat ⇒ DN marked dead ⇒ NN re-replicates lost blocks.
Driver Skeleton
Hadoop Job Driver (Java)Job job = Job.getInstance(conf, "wordcount");
job.setJarByClass(WordCount.class);
job.setMapperClass(WCMapper.class);
job.setCombinerClass(WCReducer.class); // reuse reducer as combiner
job.setReducerClass(WCReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
HDFS Shell — basicshdfs dfs -mkdir /user/data
hdfs dfs -put local.txt /user/data/
hdfs dfs -ls /user/data
hdfs dfs -cat /user/data/local.txt
hdfs fsck /user/data/local.txt -files -blocks -locations
Likely Scenario Stems
- A 1 GB file is written with RF=3 and block size 128 MB — how many blocks? How many replicas? Place them on 2 racks of 5 nodes.
- The NameNode crashes — what happens to running jobs and to data? How does HA recover?
- Why is HDFS bad for many small files? How would you mitigate it (HAR, SequenceFile, HBase)?
- A DataNode goes silent for 10 minutes — describe NN's reaction step by step.
YARN & Cluster Resource Mgmt
Core. YARN decouples resource management from job execution, so the same Hadoop cluster can run MapReduce, Spark, Flink, etc. side by side.
Key Points
- ResourceManager (RM) — global; Scheduler + ApplicationsManager. Allocates containers.
- NodeManager (NM) — per-node; launches/monitors containers, reports resources to RM.
- ApplicationMaster (AM) — per-job; negotiates resources with RM, schedules tasks on NMs, handles task failures.
- Container — bundle of (CPU, RAM) on a node; isolated JVM process per task.
- Job flow — Client → RM (submit) → RM launches AM container → AM requests task containers → NMs run tasks → AM tracks → exit.
- Schedulers — FIFO, Capacity (queues + guaranteed share), Fair (equal share over time).
- vs Hadoop v1 — old JobTracker did scheduling AND coordination → bottleneck. YARN splits these → 4k+ nodes, multi-tenant.
Likely Scenario Stems
- A cluster runs both Spark and MapReduce — explain how YARN allocates resources fairly using the Capacity Scheduler.
- An ApplicationMaster crashes mid-job — does the job fail? Walk through recovery.
- Container OOM-killed by NM — what does the AM do next? When does the whole job fail?
- Compare JobTracker / TaskTracker vs RM / AM / NM — why was the redesign needed?
Spark Core — RDDs & Architecture
Core. Spark is an in-memory distributed engine. Data is an RDD — immutable, partitioned, lineage-tracked. Transformations are lazy; actions trigger execution.
Key Points
- RDD — Resilient (lineage-recoverable), Distributed (partitions across executors), Dataset (immutable collection).
- Transformations (lazy) —
map, filter, flatMap, groupByKey, reduceByKey, join, union, distinct. - Actions (eager) —
collect, count, take, reduce, saveAsTextFile, foreach. - Narrow dep. — 1 parent partition → 1 child (map, filter); no shuffle.
- Wide dep. — many parents → 1 child (groupByKey, join); shuffle, stage boundary.
- Lineage — DAG of parents + transforms; on partition loss, recompute from lineage — no replication needed.
- Architecture — Driver (DAG, scheduler) → Cluster Manager (YARN/Mesos/Standalone) → Executors (tasks, cached data).
- Persistence —
cache()= MEMORY_ONLY;persist(level)for disk / serialized / replicated. - vs MapReduce — keeps intermediate data in RAM ⇒ 10–100× faster for iterative / interactive workloads.
- DataFrame — RDD + schema; optimized by Catalyst, executed via Tungsten.
Code You Should Recognize
PySpark — WordCount with RDDfrom pyspark import SparkContext
sc = SparkContext("local", "wc")
counts = (sc.textFile("input.txt")
.flatMap(lambda line: line.split())
.map(lambda w: (w, 1))
.reduceByKey(lambda a, b: a + b))
counts.saveAsTextFile("out")
DataFrame API + SQLfrom pyspark.sql import SparkSession
spark = SparkSession.builder.appName("demo").getOrCreate()
df = spark.read.json("orders.json")
df.filter("amount > 100") \
.groupBy("user_id") \
.count() \
.orderBy("count", ascending=False) \
.show()
df.createOrReplaceTempView("orders")
spark.sql("SELECT user_id, COUNT(*) FROM orders GROUP BY user_id").show()
Lineage & cachingrdd = sc.textFile("big.log").filter(lambda x: "ERROR" in x)
rdd.cache() # keep in memory across actions
rdd.count() # 1st action — materializes & caches
rdd.filter(lambda x: "DB" in x).count() # reuses cached
Likely Scenario Stems
- Given a Spark pipeline, mark each transformation as narrow / wide and identify stage boundaries.
- An executor dies — does the job restart? What does Spark do with the lost partitions? Why is replication unnecessary?
- Why is
reduceByKeypreferred overgroupByKeyfor aggregation? Sketch what each shuffles. - When should you call
.cache()? Give a wrong placement and a right placement. - Convert a given MapReduce WordCount into PySpark with the same semantics.
Spark Execution & Catalyst
Core. Spark turns your DataFrame / SQL into an optimized physical plan via the Catalyst optimizer, then runs it as a DAG of stages and tasks. Lazy evaluation lets it optimize before executing.
Key Points
- Catalyst pipeline — Unresolved Logical → Analyzed (catalog resolves names) → Optimized Logical (predicate pushdown, projection pruning, constant folding) → Physical Plans → Cost-based selection.
- Tungsten — whole-stage code generation, off-heap memory, columnar in-memory format; avoids JVM virtual calls.
- Stages — split by shuffle boundaries; tasks within a stage = #partitions, run in parallel.
- Joins — Broadcast (small side ≤
spark.sql.autoBroadcastJoinThreshold, default 10 MB), Sort-Merge (both large + sorted), Shuffle-Hash. - Partitioning —
repartition(n)(full shuffle, even sizes),coalesce(n)(narrow, shrink only). Tune to avoid skew & small files. - Inspect plans —
df.explain(True)shows logical + physical plan; look for Exchange (= shuffle). - Adaptive Query Execution (AQE) — re-optimizes stages at runtime using actual stats.
Optimization Patterns
Broadcast a small dim tablefrom pyspark.sql.functions import broadcast
result = orders.join(broadcast(countries), "country_id")
Predicate pushdown / projection pruning — for freedf = spark.read.parquet("events")
df.filter("day = '2026-05-18'").select("user", "event").show()
# Spark reads only matching files + only those 2 columns
Likely Scenario Stems
- Read an
explain()plan — locate the shuffle and propose a rewrite (broadcast, filter earlier, repartition). - One reducer is much slower than the rest — diagnose the skew and fix it.
- Joining a 5 GB fact table with a 5 MB dim table — choose & justify the join strategy.
- How does Catalyst optimize
SELECT a FROM t WHERE b = 1over Parquet? Walk through plan stages. repartition(200)vscoalesce(200)— when do you pick each?
Final-Stretch Strategy
A few high-leverage moves for the night before, ranked by ROI.
Memorize the formula box
Support / confidence / lift, PageRank, cosine, Bloom FP rate, TF-IDF — these come up almost verbatim.
Re-draw HDFS & YARN
Be able to sketch NameNode/DataNode and RM/AM/NM/Container from memory in 60 seconds.
Narrow vs wide
For any Spark transformation, instantly know if it shuffles. That alone unlocks half the scenario questions.
WordCount in three forms
MapReduce pseudocode, Java mapper/reducer, PySpark one-liner. One of these almost always appears.
Practice scenarios cold
For HDFS / Spark, attempt the Q stems on each card with slides closed — then check.
Skip wisely
Weeks 4–5 (Finding Similar Items) are officially droppable. Don't burn revision time there.