Why Druid’s index_parallel is the Future of Batch Ingestion

Blogs

Druid Ingestion Methods: A Deep Dive into Batch Ingestion
August 15, 2025
Understanding Apache Iceberg: The Architecture and Features
August 20, 2025

Why Druid’s index_parallel is the Future of Batch Ingestion

Batch ingestion in Druid has evolved significantly. While index_hadoop was once the go-to for large datasets, index_parallel has emerged as the modern standard. In this blog, we’ll dissect why index_parallel outperforms index_hadoop through advanced features, tuning capabilities, failure scenarios, and real-world use cases – with GCS examples.


The Problem with index_hadoop

Despite its power, index_hadoop has critical limitations:

1. Infrastructure Complexity

  • Dependency on Hadoop: Requires a full Hadoop/Spark ecosystem (YARN, HDFS, MapReduce).
  • Operational Overhead: Managing JVM tuning, Hadoop configs, and cluster scaling.
  • Cost Inefficiency: Underutilized resources during ingestion lulls.

2. Performance Bottlenecks

  • Disk I/O Overhead: Data must be written to HDFS intermediate storage.
  • Serialization Costs: Converting data between Hadoop and Druid formats.
  • Slow Fault Recovery: Failed MapReduce tasks restart from scratch.

3. Cloud-Native Limitations

  • No Direct Cloud Access: Requires Hadoop GCS connectors (e.g., gcs-connector), adding latency.
  • Static Partitioning: Predefined partitions lead to uneven segment sizes.

4. Tuning Limitations

  • Manual Shard Tuning: Requires setting numShards statically, causing data skew.
  • Lack of Dynamic Scaling: Fixed reducer count at job submission.

The Rise of index_parallel

index_parallel eliminates these issues with a cloud-native, distributed architecture:

Key Advantages

  1. No Hadoop Dependency

    • Runs directly on Druid MiddleManagers.
    • Native cloud storage access (GCS, S3, Azure).
  2. Dynamic Partitioning

    • Automatically optimizes segment sizes (300-700MB).
    • Adapts to data skew in real-time.
  3. Fault Tolerance

    • Checkpoint-based recovery: Failed tasks resume from last checkpoint.
    • Worker isolation: One failed task doesn’t block the entire job.
  4. Cloud Optimization

    • Direct GCS access via Druid’s inputSource.
    • Native support for Parquet, ORC, JSON, CSV.
  5. Advanced Tuning Capabilities

    • Dynamic Scaling: Adjusts worker count based on data volume.
    • Memory Management: Fine-grained control over task memory.
    • Automatic Rollup: Optimizes data rollup without manual intervention.

Syntax Comparison: GCS Ingestion

index_hadoop Spec (Legacy)

{
“type”: “index_hadoop”,
“spec”: {
“ioConfig”: {
“type”: “index_hadoop”,
“inputSpec”: {
“type”: “static”,
“inputFormat”: “org.apache.hadoop.mapred.TextInputFormat”,
“paths”: “gs://your-bucket/events/*.json”
}
},
“tuningConfig”: {
“type”: “hadoop”,
“partitionsSpec”: { “type”: “hashed”, “numShards”: 32 },
“jobProperties”: {
“fs.gs.project.id”: “your-gcp-project”,
“fs.gs.auth.service.account.enable”: “true”,
“fs.gs.auth.service.account.json.keyfile”: “/path/to/key.json”
}
}
}
}

Pain Points:

  • Requires Hadoop GCS connector and service account setup.
  • Manual numShards tuning leads to data skew.
  • Static settings don’t adapt to data volume.

index_parallel Spec (Modern)

{
“type”: “index_parallel”,
“spec”: {
“ioConfig”: {
“type”: “index_parallel”,
“inputSource”: {
“type”: “google”,
“uris”: [“gs://your-bucket/events/*.json”],
“auth”: {
“type”: “service_account”,
“clientEmail”: “your-service@project.iam.gserviceaccount.com”,
“key”: “—–BEGIN PRIVATE KEY—–n…n—–END PRIVATE KEY—–n”
}
},
“inputFormat”: { “type”: “json” }
},
“tuningConfig”: {
“type”: “index_parallel”,
“partitionsSpec”: {
“type”: “dynamic”,
“maxRowsPerSegment”: 5000000,
“maxTotalRows”: 20000000
},
“maxRowsInMemory”: 1000000,
“maxBytesInMemory”: 0
}
}
}

Advantages:

  • Direct GCS access with built-in auth (no Hadoop connector).
  • Dynamic partitioning auto-adjusts segment sizes.
  • Advanced tuning: maxRowsInMemory, maxBytesInMemory.

Tuning Deep Dive: What index_parallel Can Do That index_hadoop Can’t

1. Dynamic Partitioning

“partitionsSpec”: {
“type”: “dynamic”,
“maxRowsPerSegment”: 5000000,
“maxTotalRows”: 20000000
}

  • What it does: Auto-splits data into optimal segments.
  • Why index_hadoop fails: Requires manual numShards tuning.

2. Memory Management

“maxRowsInMemory”: 1000000,
“maxBytesInMemory”: 0

  • What it does: Controls memory per task to avoid OOM errors.
  • Why index_hadoop fails: Relies on Hadoop’s opaque JVM settings.

3. Checkpoint-Based Recovery

  • What it does: Saves progress to persistent storage (GCS).
  • Why index_hadoop fails: No checkpointing; full restart on failure.

4. Auto-Scaling Workers

  • What it does: Adjusts worker count based on data volume.
  • Why index_hadoop fails: Fixed reducer count at job submission.

When to Still Use index_hadoop?

  • Legacy Hadoop Shops: If you have a mature Hadoop ecosystem.
  • Complex Transformations: When Spark UDFs are unavoidable (though Druid SQL often replaces this).
  • Very Large Clusters: If you have a massive Hadoop cluster with spare capacity.

Conclusion

index_parallel isn’t just an incremental improvement – it’s a paradigm shift. By eliminating Hadoop dependencies, embracing cloud-native design, and introducing dynamic partitioning and advanced tuning, it delivers faster, cheaper, and more resilient batch ingestion. For modern data teams, index_parallel is the undisputed champion.


Varsha S

Leave a Reply

Your email address will not be published. Required fields are marked *