What is Apache Druid?
Apache Druid is a distributed, columnar analytics database designed for low-latency queries on large-scale datasets. It excels in scenarios that require real-time data ingestion, high-speed aggregation, and ad-hoc analytics.
Druid’s unique architecture combines the best of traditional databases, search systems, and data warehouses. It’s widely used in industries like finance, ad-tech, and telecommunications to power dashboards, exploratory data analysis, and operational reporting.
Why Use Apache Druid?
Apache Druid stands out due to its exceptional performance, scalability, and flexibility. Here’s why it might be the right choice for your data needs:
- Real-Time Data Ingestion:
- Druid can ingest and query data simultaneously, allowing you to analyze events as they happen.
- High Query Performance:
- Optimized for low-latency queries, Druid can process complex aggregations and filters on billions of rows within milliseconds.
- Scalability:
- Designed for distributed systems, Druid can scale horizontally to handle petabytes of data and thousands of queries per second.
- Flexible Data Model:
- Supports a variety of data types and offers powerful indexing capabilities to optimize query performance.
- Cost Efficiency:
- By storing data in a compressed columnar format and supporting tiered storage, Druid minimizes infrastructure costs.
Druid Services
Druid has several types of services:
- Coordinator: manages data availability on the cluster.
- Overlord: controls the assignment of data ingestion workloads.
- Broker: handles queries from external clients.
- Router: routes requests to Brokers, Coordinators, and Overlords.
- Historical: stores query-able data.
- Middle Manager and Peon: ingest data.
- Indexer: serves an alternative to the Middle Manager + Peon task execution system.
Architecture of Apache Druid
Understanding Druid’s architecture is key to leveraging its full potential. It consists of several components, each designed to handle specific tasks within the system:

Druid servers
You can deploy Druid services according to your preferences. For ease of deployment, we recommend organizing them into three server types: Master, Query, and Data.
- Master server
A Master server manages data ingestion and availability. It is responsible for starting new ingestion jobs and coordinating availability of data on the Data server.
Master servers divide operations between Coordinator and Overlord services.
a. Coordinator service
- Coordinator services watch over the Historical Services on the Data servers. They are responsible for assigning segments to specific servers, and for ensuring segments are well-balanced across Historical.
b. Overlord service
- Overlord services watch over the Middle Manager Services on the Data servers and are the controllers of data ingestion into Druid. They are responsible for assigning ingestion tasks to Middle Managers and for coordinating segment publishing.
- Query server
A Query server provides the endpoints that users and client applications interact with, routing queries to Data servers or other Query servers (and optionally proxied Master server requests).
Query servers divide operations between Broker and Router services.
a. Broker service
- Broker services receive queries from external clients and forward those queries to Data servers. When Brokers receive results from those subqueries, they merge those results and return them to the caller. Typically, you query Brokers rather than querying Historical or Middle Manager services on Data servers directly.
b. Router service
- Router services provide a unified API gateway in front of Brokers, Overlords, and Coordinators.
The Router service also runs the web console, a UI for loading data, managing data sources and tasks, and viewing server status and segment information.
- Data server
A Data server executes ingestion jobs and stores query-able data.
Data servers divide operations between Historical and Middle Manager services.
a. Historical service
- Middle Manager services handle ingestion of new data into the cluster. They are responsible for reading from external data sources and publishing new Druid segments.
c. Peon service
- Indexer services are an alternative to Middle Managers and Peons. Instead of forking separate JVM processes per-task, the Indexer runs tasks as individual threads within a single JVM process.
The Indexer is designed to be easier to configure and deploy compared to the Middle Manager + Peon system and to better enable resource sharing across tasks. The Indexer is a newer feature and is currently designated experimental due to the fact that its memory management system is still under development. It will continue to mature in future versions of Druid.
Typically, you would deploy either Middle Managers or Indexers, but not both.
Now, lets understand the different types of nodes in Druid:
- Data Nodes:
- Historical Nodes:
- Store immutable data and respond to query requests.
- Optimize data access with columnar storage and indexing.
- Realtime Nodes:
- Ingest and index incoming data streams, providing immediate queryability.
- Query Nodes:
- Broker Nodes:
- Coordinate query execution by routing user queries to the appropriate data nodes.
- Aggregate and merge results from multiple nodes.
- Router Nodes:
- Direct API calls to the appropriate Druid services.
- Coordination Nodes:
- Coordinator Nodes:
- Manage data distribution, balancing, and replication across historical nodes.
- Handle segment lifecycle management.
- Overlord Nodes:
- Manage task assignments and monitor real-time ingestion tasks.
External dependencies
- Deep Storage:
- Acts as a long-term storage layer for raw and pre-aggregated data. Popular options include HDFS, S3, and Google Cloud Storage.
- Metadata Store:
- Typically backed by a relational database (e.g., PostgreSQL or MySQL), it stores configuration, schema, and metadata.
Key Features of Apache Druid
- Columnar Storage:
- Optimized for analytical queries, Druid stores data in a columnar format to reduce I/O and improve aggregation speed.
- Real-Time Indexing:
- Supports continuous ingestion from streaming sources like Apache Kafka, Amazon Kinesis, or batch ingestion from files.
- Flexible Querying:
- Offers native JSON-based query language and SQL-like querying capabilities.
- Roll-Up Aggregations:
- Reduces storage footprint and query complexity by pre-aggregating data during ingestion.
- Advanced Indexing:
- Includes bitmap, inverted, and range indexes for efficient filtering and scanning.
- Fault Tolerance:
- Ensures high availability with replication, automatic failover, and distributed query execution.
- Time-based partitioning:
-
- Druid first partitions data by time. You can optionally implement additional partitioning based upon other fields. Time-based queries only access the partitions that match the time range of the query which leads to significant performance improvements.
- Approximate algorithms:
-
- Druid includes algorithms for approximate count-distinct, approximate ranking, and computation of approximate histograms and quantiles. These algorithms offer bounded memory usage and are often substantially faster than exact computations. For situations where accuracy is more important than speed, Druid also offers exact count-distinct and exact ranking.
- Automatic summarization at ingest time:
-
- Druid optionally supports data summarization at ingestion time. This summarization partially pre-aggregates your data, potentially leading to significant cost savings and performance boosts.
Common Use Cases for Apache Druid
- Operational Dashboards:
- Monitor KPIs and system performance in real-time.
- Ad-Tech Analytics:
- Power ad bidding platforms and user segmentation with real-time event data.
- IoT Monitoring:
- Process and visualize sensor data from connected devices.
- Clickstream Analysis:
- Track user behavior and website interactions for e-commerce platforms.
- Log Analysis:
- Aggregate and analyze server and application logs to identify issues quickly.
Challenges of Using Apache Druid
- Complex Deployment:
- Setting up and managing Druid’s distributed architecture requires expertise.
- High Memory Requirements:
- Druid’s performance is memory-intensive, which can increase infrastructure costs.
- Write Limitations:
- Not optimized for frequent updates or transactional workloads.
Conclusion
Apache Druid is a powerful tool for real-time data analytics, offering exceptional speed, scalability, and flexibility. Its distributed architecture and advanced indexing capabilities make it a preferred choice for businesses seeking actionable insights from high-velocity data streams. While it comes with a learning curve, the rewards of using Druid far outweigh the challenges.
Yatika Sheth