Back to Projects
Stream processing, OLAP storage, and high-throughput data ingestion
Real-Time Analytics Pipeline
An event ingestion and analytics pipeline processing 500K+ events per minute with sub-second query latency.
PythonApache KafkaClickHouseFastAPIDocker
Problem
The analytics team needed real-time visibility into product metrics, but the existing batch ETL pipeline had a 6-hour delay and could not support ad-hoc queries.
Solution
Built a streaming pipeline using Kafka for event ingestion, a Python stream processor for transformation and enrichment, and ClickHouse as the OLAP store for sub-second analytical queries.
Architecture
Events flow from client SDKs through a Kafka topic, are consumed by a pool of stream processors that enrich and validate the data, then batch-insert into ClickHouse. A FastAPI service exposes a query interface with predefined and ad-hoc query support.
Key Decisions
- Selected ClickHouse over Druid for better SQL compatibility and simpler operations
- Used Kafka consumer groups for horizontal scaling of stream processors
- Implemented a schema registry to enforce event contracts
- Added a dead-letter queue for malformed events with alerting