Druid Ingestion Methods: A Deep Dive into Batch Ingestion

August 15, 2025

Understanding Apache Iceberg: The Architecture and Features

August 20, 2025

Why Druid’s index_parallel is the Future of Batch Ingestion

Batch ingestion in Druid has evolved significantly. While index_hadoop was once the go-to for large datasets, index_parallel has emerged as the modern standard. In this blog, we’ll dissect why index_parallel outperforms index_hadoop through advanced features, tuning capabilities, failure scenarios, and real-world use cases – with GCS examples.

The Problem with `index_hadoop`

Despite its power, index_hadoop has critical limitations:

1. Infrastructure Complexity

Dependency on Hadoop: Requires a full Hadoop/Spark ecosystem (YARN, HDFS, MapReduce).
Operational Overhead: Managing JVM tuning, Hadoop configs, and cluster scaling.
Cost Inefficiency: Underutilized resources during ingestion lulls.

2. Performance Bottlenecks

Disk I/O Overhead: Data must be written to HDFS intermediate storage.
Serialization Costs: Converting data between Hadoop and Druid formats.
Slow Fault Recovery: Failed MapReduce tasks restart from scratch.

3. Cloud-Native Limitations

No Direct Cloud Access: Requires Hadoop GCS connectors (e.g., gcs-connector), adding latency.
Static Partitioning: Predefined partitions lead to uneven segment sizes.

4. Tuning Limitations

Manual Shard Tuning: Requires setting numShards statically, causing data skew.
Lack of Dynamic Scaling: Fixed reducer count at job submission.

The Rise of `index_parallel`

index_parallel eliminates these issues with a cloud-native, distributed architecture:

Key Advantages

No Hadoop Dependency
- Runs directly on Druid MiddleManagers.
- Native cloud storage access (GCS, S3, Azure).
Dynamic Partitioning
- Automatically optimizes segment sizes (300-700MB).
- Adapts to data skew in real-time.
Fault Tolerance
- Checkpoint-based recovery: Failed tasks resume from last checkpoint.
- Worker isolation: One failed task doesn’t block the entire job.
Cloud Optimization
- Direct GCS access via Druid’s inputSource.
- Native support for Parquet, ORC, JSON, CSV.
Advanced Tuning Capabilities
- Dynamic Scaling: Adjusts worker count based on data volume.
- Memory Management: Fine-grained control over task memory.
- Automatic Rollup: Optimizes data rollup without manual intervention.

Syntax Comparison: GCS Ingestion

`index_hadoop` Spec (Legacy)

{
“type”: “index_hadoop”,
“spec”: {
“ioConfig”: {
“type”: “index_hadoop”,
“inputSpec”: {
“type”: “static”,
“inputFormat”: “org.apache.hadoop.mapred.TextInputFormat”,
“paths”: “gs://your-bucket/events/*.json”
}
},
“tuningConfig”: {
“type”: “hadoop”,
“partitionsSpec”: { “type”: “hashed”, “numShards”: 32 },
“jobProperties”: {
“fs.gs.project.id”: “your-gcp-project”,
“fs.gs.auth.service.account.enable”: “true”,
“fs.gs.auth.service.account.json.keyfile”: “/path/to/key.json”
}
}
}
}

Pain Points:

Requires Hadoop GCS connector and service account setup.
Manual numShards tuning leads to data skew.
Static settings don’t adapt to data volume.

`index_parallel` Spec (Modern)

{
“type”: “index_parallel”,
“spec”: {
“ioConfig”: {
“type”: “index_parallel”,
“inputSource”: {
“type”: “google”,
“uris”: [“gs://your-bucket/events/*.json”],
“auth”: {
“type”: “service_account”,
“clientEmail”: “your-service@project.iam.gserviceaccount.com”,
“key”: “—–BEGIN PRIVATE KEY—–n…n—–END PRIVATE KEY—–n”
}
},
“inputFormat”: { “type”: “json” }
},
“tuningConfig”: {
“type”: “index_parallel”,
“partitionsSpec”: {
“type”: “dynamic”,
“maxRowsPerSegment”: 5000000,
“maxTotalRows”: 20000000
},
“maxRowsInMemory”: 1000000,
“maxBytesInMemory”: 0
}
}
}

Advantages:

Direct GCS access with built-in auth (no Hadoop connector).
Dynamic partitioning auto-adjusts segment sizes.
Advanced tuning: maxRowsInMemory, maxBytesInMemory.

Tuning Deep Dive: What `index_parallel` Can Do That `index_hadoop` Can’t

1. Dynamic Partitioning

“partitionsSpec”: {
“type”: “dynamic”,
“maxRowsPerSegment”: 5000000,
“maxTotalRows”: 20000000
}

What it does: Auto-splits data into optimal segments.
Why index_hadoop fails: Requires manual numShards tuning.

2. Memory Management

“maxRowsInMemory”: 1000000,
“maxBytesInMemory”: 0

What it does: Controls memory per task to avoid OOM errors.
Why index_hadoop fails: Relies on Hadoop’s opaque JVM settings.

3. Checkpoint-Based Recovery

What it does: Saves progress to persistent storage (GCS).
Why index_hadoop fails: No checkpointing; full restart on failure.

4. Auto-Scaling Workers

What it does: Adjusts worker count based on data volume.
Why index_hadoop fails: Fixed reducer count at job submission.

When to Still Use `index_hadoop`?

Legacy Hadoop Shops: If you have a mature Hadoop ecosystem.
Complex Transformations: When Spark UDFs are unavoidable (though Druid SQL often replaces this).
Very Large Clusters: If you have a massive Hadoop cluster with spare capacity.

Conclusion

index_parallel isn’t just an incremental improvement – it’s a paradigm shift. By eliminating Hadoop dependencies, embracing cloud-native design, and introducing dynamic partitioning and advanced tuning, it delivers faster, cheaper, and more resilient batch ingestion. For modern data teams, index_parallel is the undisputed champion.

Varsha S

Why Druid’s index_parallel is the Future of Batch Ingestion

Blogs

Druid Ingestion Methods: A Deep Dive into Batch Ingestion

Understanding Apache Iceberg: The Architecture and Features

Why Druid’s index_parallel is the Future of Batch Ingestion

The Problem with `index_hadoop`

1. Infrastructure Complexity

2. Performance Bottlenecks

3. Cloud-Native Limitations

4. Tuning Limitations

The Rise of `index_parallel`

Key Advantages

Syntax Comparison: GCS Ingestion

`index_hadoop` Spec (Legacy)

`index_parallel` Spec (Modern)

Tuning Deep Dive: What `index_parallel` Can Do That `index_hadoop` Can’t

1. Dynamic Partitioning

2. Memory Management

3. Checkpoint-Based Recovery

4. Auto-Scaling Workers

When to Still Use `index_hadoop`?

Conclusion

Data Sturdy