Article: From Batch to Micro-Batch Streaming: Lessons Learned the Hard Way in a Delta Index Pipeline
| Source: InfoQ AI/ML
Tags: Spark Structured Streaming, micro-batch, data pipelines, S3, data engineering, delta index
An InfoQ deep-dive documents a production migration from scheduled batch jobs to Spark Structured Streaming micro-batch for a search-and-ads delta index, arguing scheduling delays — not compute cost — are the primary bottleneck in most real-world batch pipelines.
Details
Parveen Saini's InfoQ article documents a production migration at a search-and-ads retrieval system, where scheduled batch jobs were converted to continuously running micro-batch jobs using Spark Structured Streaming. The core finding: most engineering teams overengineer toward full record-level streaming when micro-batching eliminates the vast majority of latency penalty at far lower operational cost. The pipeline ran on time-partitioned S3 data, which made Spark's native event-time watermarks unsuitable. The team built an external logical watermark tracking the latest processed partition by timestamp. This decoupling proved more reliable for snapshot-style batch sources where S3 eventual consistency made "success file" completion markers unreliable. A key operational insight: long-running streaming jobs should treat restarts as a normal mechanism, not a failure indicator. For freshness-driven pipelines, skipping to the latest partition after a restart often delivers more value than exhaustively replaying historical backlog — a counter-intuitive but practical trade-off for search and ads systems. The article covers concrete guidance on lag handling, overlapping window semantics, and avoiding the operational risks of log-based ingestion for batch-oriented sources. Useful reading for any data engineering team considering a similar migration.