Using Azure Data Factory to Copy SharePoint List Data to Blob Storage

July 16, 2025

Why Druid’s index_parallel is the Future of Batch Ingestion

August 15, 2025

Druid Ingestion Methods: A Deep Dive into Batch Ingestion

Druid has emerged as a powerful real-time analytics database, designed for fast slice-and-dice analytics on large-scale datasets. One of Druid’s core strengths is its flexible ingestion capabilities, which allow users to efficiently load data from various sources. In this blog post, we’ll explore Druid’s ingestion methods with a special focus on batch ingestion—how it works, its implementation approaches, and when to use it.

Why Choose Druid?

Sub-second queries on billions of rows
Real-time and batch ingestion in a unified system
Cloud-native with horizontal scalability
High compression and cost-efficient storage
Support for time-based and high-cardinality data

Types of Ingestion in Druid

Druid supports two primary ingestion patterns:

Streaming Ingestion: For real-time data ingestion with minimal latency (typically seconds)
Batch Ingestion: For loading large volumes of data at scheduled intervals

While streaming ingestion is ideal for real-time analytics, batch ingestion remains crucial for historical data loading, periodic updates, and initial data population.

Batch Ingestion in Druid

Batch ingestion in Druid involves loading pre-existing datasets in bulk. It’s designed for scenarios where:

When to Use Batch Ingestion?

You need to import large historical datasets
Data arrives in periodic batches (daily, hourly, etc.)
You require precise control over data segmentation
You’re performing initial data loading
Complex data transformations requiring preprocessing

Druid offers several approaches to batch ingestion, each with its own implementation details and use cases.

Native Batch Ingestion Methods

1. Local Batch Ingestion (Index Task)

The most straightforward batch ingestion method in Druid is the local batch ingestion using an index task.

How it works:

Data is read directly from local files or accessible cloud storage
The ingestion process runs on Druid MiddleManager nodes
Data is partitioned, indexed, and converted to Druid’s columnar format

Implementation:

{
“type”: “index”,
“spec”: {
“ioConfig”: {
“type”: “index”,
“firehose”: {
“type”: “local”,
“baseDir”: “/data/events/”,
“filter”: “*.json”
}
},
“dataSchema”: {
“dataSource”: “user_events”,
“parser”: {
“type”: “json”,
“timestampSpec”: { “column”: “timestamp”, “format”: “iso” },
“dimensionsSpec”: { “dimensions”: [“user_id”, “event_type”] }
},
“metricsSpec”: [{ “type”: “count”, “name”: “events” }],
“granularitySpec”: { “segmentGranularity”: “DAY” }
}
}
}

Usage:

Submit the ingestion spec via Druid’s indexing service
Monitor task progress through Druid’s web console
Best suited for small to medium datasets (up to hundreds of GB)

2. Hadoop-Based Batch Ingestion

For larger datasets, Druid offers Hadoop-based batch ingestion, which leverages Hadoop’s distributed processing capabilities.

How it works:

Uses Hadoop MapReduce or Spark for distributed processing
Data is read from HDFS, S3, or other Hadoop-compatible storage
Generates Druid segments in a distributed manner

Implementation:

{
“type”: “index_hadoop”,
“spec”: {
“ioConfig”: {
“type”: “index_hadoop”,
“inputSpec”: {
“type”: “static”,
“inputFormat”: “org.apache.hadoop.mapred.TextInputFormat”,
“paths”: “/path/to/data/on/hdfs”
}
},
“dataSchema”: {
// Similar to local batch ingestion
},
“tuningConfig”: {
“type”: “hadoop”,
“jobProperties”: {
“mapreduce.job.queuename”: “your_queue”
}
}
}
}

Usage:

Requires a Hadoop cluster (or Spark cluster)
Ideal for very large datasets (TB scale)
More complex setup but better resource utilization for big data

3. SQL-Based Batch Ingestion

Druid also supports batch ingestion through SQL statements, providing a familiar interface for data analysts and engineers.

How it works:

Uses standard SQL INSERT or REPLACE statements
Converts SQL queries into ingestion tasks
Supports both local and distributed execution

Implementation:

— Insert data from external table
INSERT INTO “target_datasource”
SELECT
TIME_PARSE(timestamp) AS “__time”,
dim1, dim2,
SUM(metric1) AS metric1_sum
FROM “external_table”
GROUP BY 1, 2, 3
PARTITIONED BY DAY

Usage:

Execute via Druid’s SQL interface or web console
Best for users comfortable with SQL
Good for medium-sized datasets and incremental updates

Advanced Batch Ingestion Techniques

Parallel Batch Ingestion

For optimal performance with large datasets, Druid supports parallel batch ingest

Use Case: Large datasets (TBs), cloud-native environments.
How It Works:

Distributed ingestion using Druid MiddleManagers.
Direct access to cloud storage (no Hadoop dependency).
Dynamic partitioning and fault tolerance.

Implementation:

{
“type”: “index_parallel”,
“spec”: {
// Similar configuration to local batch
“ioConfig”: {
“type”: “index_parallel”,
“inputSource”: {
“type”: “local”,
“baseDir”: “/path/to/data”,
“filter”: “file_pattern”
},
“inputFormat”: {
“type”: “json”
}
},
“tuningConfig”: {
“type”: “index_parallel”,
“partitionsSpec”: {
“type”: “dynamic”,
“maxRowsPerSegment”: 5000000
}
}
}
}

Best Practices for Batch Ingestion

Segment Size Optimization: Target segments between 300MB-700MB for optimal query performance
Partitioning Strategy: Choose appropriate segment granularity (DAY, HOUR, etc.) based on query patterns
Resource Management: Allocate sufficient memory and processing power for ingestion tasks
Data Validation: Implement pre-ingestion validation to ensure data quality
Incremental Loading: Use delta ingestion for regular updates instead of full reloads

When to Choose Batch Ingestion

Batch ingestion is the ideal choice when:

Loading historical data archives
Processing daily/hourly data dumps
Initial data population of a new Druid cluster
Data transformations are complex and require preprocessing
Precise control over data partitioning is needed

Conclusion

Batch ingestion remains a cornerstone of data loading strategies in Apache Druid. Whether you’re using local batch ingestion for smaller datasets, Hadoop-based ingestion for massive data volumes and for Hadoop ecosystems , or SQL-based ingestion for its simplicity and index_parallel for scalable cloud-native ingestion. Druid offers flexible options to meet your needs.
In our next blog, we’ll explore why index_parallel is revolutionizing batch ingestion and why it outshines index_hadoop.

Varsha S

Druid Ingestion Methods: A Deep Dive into Batch Ingestion

Blogs

Using Azure Data Factory to Copy SharePoint List Data to Blob Storage

Why Druid’s index_parallel is the Future of Batch Ingestion

Druid Ingestion Methods: A Deep Dive into Batch Ingestion

Why Choose Druid?

Types of Ingestion in Druid

Batch Ingestion in Druid

Native Batch Ingestion Methods

1. Local Batch Ingestion (Index Task)

2. Hadoop-Based Batch Ingestion

3. SQL-Based Batch Ingestion

Advanced Batch Ingestion Techniques

Parallel Batch Ingestion

Best Practices for Batch Ingestion

When to Choose Batch Ingestion

Conclusion

Data Sturdy

Leave a Reply Cancel reply

Druid Ingestion Methods: A Deep Dive into Batch Ingestion

Blogs

Using Azure Data Factory to Copy SharePoint List Data to Blob Storage

Why Druid’s index_parallel is the Future of Batch Ingestion

Druid Ingestion Methods: A Deep Dive into Batch Ingestion

Why Choose Druid?

Types of Ingestion in Druid

Batch Ingestion in Druid

Native Batch Ingestion Methods

1. Local Batch Ingestion (Index Task)

2. Hadoop-Based Batch Ingestion

3. SQL-Based Batch Ingestion

Advanced Batch Ingestion Techniques

Parallel Batch Ingestion

Best Practices for Batch Ingestion

When to Choose Batch Ingestion

Conclusion

Data Sturdy

Related posts

Failover Clustering setup in Postgres

Disaster Recovery setup in Postgres

Seamless CI/CD with Azure DevOps and Databricks: Automating Data Engineering Workflows

Leave a Reply Cancel reply