Dynamic Checkpointing in Apache Flink
2024ProjectInternal / private
BU master's thesis. Static checkpoint intervals are a tax during idle periods and a stall during bursts. I built an adaptive controller that surfaces per-operator backpressure ratios from the Flink JobManager and uses them as a control signal: shorten cadence when load is low, lengthen under sustained backpressure to avoid amplifying stalls. Validated on the NEXMARK streaming benchmark with the RocksDB state backend; quantified write-amplification tradeoffs vs. in-memory state. Measured tail-latency wins on bursty workloads.
Highlights
- Instrumented JobManager to surface per-operator backpressure ratios
- Adaptive checkpoint cadence — short under low load, long under burst
- RocksDB state backend benchmarked vs. in-memory; write-amplification quantified
- Validated on NEXMARK queries with bursty event rates
- Tail-latency wins on bursty workloads vs. fixed-interval baseline
Tech
JavaApache FlinkRocksDBNEXMARKJVM