Druid has emerged as a powerful real-time analytics database, designed for fast slice-and-dice analytics on large-scale datasets. One of Druid’s core strengths is its flexible ingestion capabilities, which allow users to efficiently load data from various sources. In this blog post, we’ll explore Druid’s ingestion methods with a special focus on batch ingestion—how it works, its implementation approaches, and when to use it.
Druid supports two primary ingestion patterns:
While streaming ingestion is ideal for real-time analytics, batch ingestion remains crucial for historical data loading, periodic updates, and initial data population.
Batch ingestion in Druid involves loading pre-existing datasets in bulk. It’s designed for scenarios where:
When to Use Batch Ingestion?
Druid offers several approaches to batch ingestion, each with its own implementation details and use cases.
The most straightforward batch ingestion method in Druid is the local batch ingestion using an index task.
How it works:
Implementation:
{
“type”: “index”,
“spec”: {
“ioConfig”: {
“type”: “index”,
“firehose”: {
“type”: “local”,
“baseDir”: “/data/events/”,
“filter”: “*.json”
}
},
“dataSchema”: {
“dataSource”: “user_events”,
“parser”: {
“type”: “json”,
“timestampSpec”: { “column”: “timestamp”, “format”: “iso” },
“dimensionsSpec”: { “dimensions”: [“user_id”, “event_type”] }
},
“metricsSpec”: [{ “type”: “count”, “name”: “events” }],
“granularitySpec”: { “segmentGranularity”: “DAY” }
}
}
}
Usage:
For larger datasets, Druid offers Hadoop-based batch ingestion, which leverages Hadoop’s distributed processing capabilities.
How it works:
Implementation:
{
“type”: “index_hadoop”,
“spec”: {
“ioConfig”: {
“type”: “index_hadoop”,
“inputSpec”: {
“type”: “static”,
“inputFormat”: “org.apache.hadoop.mapred.TextInputFormat”,
“paths”: “/path/to/data/on/hdfs”
}
},
“dataSchema”: {
// Similar to local batch ingestion
},
“tuningConfig”: {
“type”: “hadoop”,
“jobProperties”: {
“mapreduce.job.queuename”: “your_queue”
}
}
}
}
Usage:
Druid also supports batch ingestion through SQL statements, providing a familiar interface for data analysts and engineers.
How it works:
Implementation:
— Insert data from external table
INSERT INTO “target_datasource”
SELECT
TIME_PARSE(timestamp) AS “__time”,
dim1, dim2,
SUM(metric1) AS metric1_sum
FROM “external_table”
GROUP BY 1, 2, 3
PARTITIONED BY DAY
Usage:
For optimal performance with large datasets, Druid supports parallel batch ingest
Use Case: Large datasets (TBs), cloud-native environments.
How It Works:
Implementation:
{
“type”: “index_parallel”,
“spec”: {
// Similar configuration to local batch
“ioConfig”: {
“type”: “index_parallel”,
“inputSource”: {
“type”: “local”,
“baseDir”: “/path/to/data”,
“filter”: “file_pattern”
},
“inputFormat”: {
“type”: “json”
}
},
“tuningConfig”: {
“type”: “index_parallel”,
“partitionsSpec”: {
“type”: “dynamic”,
“maxRowsPerSegment”: 5000000
}
}
}
}
Batch ingestion is the ideal choice when:
Batch ingestion remains a cornerstone of data loading strategies in Apache Druid. Whether you’re using local batch ingestion for smaller datasets, Hadoop-based ingestion for massive data volumes and for Hadoop ecosystems , or SQL-based ingestion for its simplicity and index_parallel
for scalable cloud-native ingestion. Druid offers flexible options to meet your needs.
In our next blog, we’ll explore why index_parallel
is revolutionizing batch ingestion and why it outshines index_hadoop
.
Varsha S