Large Scale Offline Data Handling: An Event Streaming Based Kappa Architecture
Processing massive volumes of historical offline data through complex, heavy-weight service applications presents a significant engineering challenge. Traditional batch processing methods often require refactoring sophisticated production logic, leading to code duplication and maintenance overhead. This paper proposes a methodology for large-scale offline data processing utilizing a Kappa Architecture. By treating offline data warehouses (DW) as streaming sources and wrapping complexservice applications as asynchronous consumers, we enable high- throughput processing without rewriting core application logic. We compare this approach against Spark-native and micro- batch paradigms, and show that it better preserves logical paritywhile maintaining engineering velocity. We also present a multi- region intelligent job dispatcher that improves availability and throughput.