Druid Ingestion Methods: A Deep Dive into Batch Ingestion

Blogs

Using Azure Data Factory to Copy SharePoint List Data to Blob Storage
July 16, 2025
Why Druid’s index_parallel is the Future of Batch Ingestion
August 15, 2025

Druid Ingestion Methods: A Deep Dive into Batch Ingestion

Druid has emerged as a powerful real-time analytics database, designed for fast slice-and-dice analytics on large-scale datasets. One of Druid’s core strengths is its flexible ingestion capabilities, which allow users to efficiently load data from various sources. In this blog post, we’ll explore Druid’s ingestion methods with a special focus on batch ingestion—how it works, its implementation approaches, and when to use it.

Why Choose Druid?

  • Sub-second queries on billions of rows
  • Real-time and batch ingestion in a unified system
  • Cloud-native with horizontal scalability
  • High compression and cost-efficient storage
  • Support for time-based and high-cardinality data

Types of Ingestion in Druid

Druid supports two primary ingestion patterns:

  1. Streaming Ingestion: For real-time data ingestion with minimal latency (typically seconds)
  2. Batch Ingestion: For loading large volumes of data at scheduled intervals

While streaming ingestion is ideal for real-time analytics, batch ingestion remains crucial for historical data loading, periodic updates, and initial data population.

Batch Ingestion in Druid

Batch ingestion in Druid involves loading pre-existing datasets in bulk. It’s designed for scenarios where:

When to Use Batch Ingestion?

  • You need to import large historical datasets
  • Data arrives in periodic batches (daily, hourly, etc.)
  • You require precise control over data segmentation
  • You’re performing initial data loading
  • Complex data transformations requiring preprocessing

Druid offers several approaches to batch ingestion, each with its own implementation details and use cases.

Native Batch Ingestion Methods

1. Local Batch Ingestion (Index Task)

The most straightforward batch ingestion method in Druid is the local batch ingestion using an index task.

How it works:

  1. Data is read directly from local files or accessible cloud storage
  2. The ingestion process runs on Druid MiddleManager nodes
  3. Data is partitioned, indexed, and converted to Druid’s columnar format

Implementation:

{
“type”: “index”,
“spec”: {
“ioConfig”: {
“type”: “index”,
“firehose”: {
“type”: “local”,
“baseDir”: “/data/events/”,
“filter”: “*.json”
}
},
“dataSchema”: {
“dataSource”: “user_events”,
“parser”: {
“type”: “json”,
“timestampSpec”: { “column”: “timestamp”, “format”: “iso” },
“dimensionsSpec”: { “dimensions”: [“user_id”, “event_type”] }
},
“metricsSpec”: [{ “type”: “count”, “name”: “events” }],
“granularitySpec”: { “segmentGranularity”: “DAY” }
}
}
}

Usage:

  • Submit the ingestion spec via Druid’s indexing service
  • Monitor task progress through Druid’s web console
  • Best suited for small to medium datasets (up to hundreds of GB)

2. Hadoop-Based Batch Ingestion

For larger datasets, Druid offers Hadoop-based batch ingestion, which leverages Hadoop’s distributed processing capabilities.

How it works:

  • Uses Hadoop MapReduce or Spark for distributed processing
  • Data is read from HDFS, S3, or other Hadoop-compatible storage
  • Generates Druid segments in a distributed manner

Implementation:

{
“type”: “index_hadoop”,
“spec”: {
“ioConfig”: {
“type”: “index_hadoop”,
“inputSpec”: {
“type”: “static”,
“inputFormat”: “org.apache.hadoop.mapred.TextInputFormat”,
“paths”: “/path/to/data/on/hdfs”
}
},
“dataSchema”: {
// Similar to local batch ingestion
},
“tuningConfig”: {
“type”: “hadoop”,
“jobProperties”: {
“mapreduce.job.queuename”: “your_queue”
}
}
}
}

Usage:

  • Requires a Hadoop cluster (or Spark cluster)
  • Ideal for very large datasets (TB scale)
  • More complex setup but better resource utilization for big data

3. SQL-Based Batch Ingestion

Druid also supports batch ingestion through SQL statements, providing a familiar interface for data analysts and engineers.

How it works:

  • Uses standard SQL INSERT or REPLACE statements
  • Converts SQL queries into ingestion tasks
  • Supports both local and distributed execution

Implementation:

— Insert data from external table
INSERT INTO “target_datasource”
SELECT
TIME_PARSE(timestamp) AS “__time”,
dim1, dim2,
SUM(metric1) AS metric1_sum
FROM “external_table”
GROUP BY 1, 2, 3
PARTITIONED BY DAY

Usage:

  • Execute via Druid’s SQL interface or web console
  • Best for users comfortable with SQL
  • Good for medium-sized datasets and incremental updates

Advanced Batch Ingestion Techniques

Parallel Batch Ingestion

For optimal performance with large datasets, Druid supports parallel batch ingest

Use Case: Large datasets (TBs), cloud-native environments.
How It Works:

  • Distributed ingestion using Druid MiddleManagers.
  • Direct access to cloud storage (no Hadoop dependency).
  • Dynamic partitioning and fault tolerance.

Implementation:

{
“type”: “index_parallel”,
“spec”: {
// Similar configuration to local batch
“ioConfig”: {
“type”: “index_parallel”,
“inputSource”: {
“type”: “local”,
“baseDir”: “/path/to/data”,
“filter”: “file_pattern”
},
“inputFormat”: {
“type”: “json”
}
},
“tuningConfig”: {
“type”: “index_parallel”,
“partitionsSpec”: {
“type”: “dynamic”,
“maxRowsPerSegment”: 5000000
}
}
}
}

Best Practices for Batch Ingestion

  1. Segment Size Optimization: Target segments between 300MB-700MB for optimal query performance
  2. Partitioning Strategy: Choose appropriate segment granularity (DAY, HOUR, etc.) based on query patterns
  3. Resource Management: Allocate sufficient memory and processing power for ingestion tasks
  4. Data Validation: Implement pre-ingestion validation to ensure data quality
  5. Incremental Loading: Use delta ingestion for regular updates instead of full reloads

When to Choose Batch Ingestion

Batch ingestion is the ideal choice when:

  • Loading historical data archives
  • Processing daily/hourly data dumps
  • Initial data population of a new Druid cluster
  • Data transformations are complex and require preprocessing
  • Precise control over data partitioning is needed

Conclusion

Batch ingestion remains a cornerstone of data loading strategies in Apache Druid. Whether you’re using local batch ingestion for smaller datasets, Hadoop-based ingestion for massive data volumes and for Hadoop ecosystems , or SQL-based ingestion for its simplicity and index_parallel for scalable cloud-native ingestion. Druid offers flexible options to meet your needs.
In our next blog, we’ll explore why index_parallel is revolutionizing batch ingestion and why it outshines index_hadoop.


Varsha S

Leave a Reply

Your email address will not be published. Required fields are marked *