

Batch ingestion in Druid has evolved significantly. While index_hadoop was once the go-to for large datasets, index_parallel has emerged as the modern standard. In this blog, we’ll dissect why index_parallel outperforms index_hadoop through advanced features, tuning capabilities, failure scenarios, and real-world use cases – with GCS examples.
index_hadoopDespite its power, index_hadoop has critical limitations:
gcs-connector), adding latency.numShards statically, causing data skew.index_parallelindex_parallel eliminates these issues with a cloud-native, distributed architecture:
No Hadoop Dependency
Dynamic Partitioning
Fault Tolerance
Cloud Optimization
inputSource.Advanced Tuning Capabilities
index_hadoop Spec (Legacy){
“type”: “index_hadoop”,
“spec”: {
“ioConfig”: {
“type”: “index_hadoop”,
“inputSpec”: {
“type”: “static”,
“inputFormat”: “org.apache.hadoop.mapred.TextInputFormat”,
“paths”: “gs://your-bucket/events/*.json”
}
},
“tuningConfig”: {
“type”: “hadoop”,
“partitionsSpec”: { “type”: “hashed”, “numShards”: 32 },
“jobProperties”: {
“fs.gs.project.id”: “your-gcp-project”,
“fs.gs.auth.service.account.enable”: “true”,
“fs.gs.auth.service.account.json.keyfile”: “/path/to/key.json”
}
}
}
}
Pain Points:
numShards tuning leads to data skew.index_parallel Spec (Modern){
“type”: “index_parallel”,
“spec”: {
“ioConfig”: {
“type”: “index_parallel”,
“inputSource”: {
“type”: “google”,
“uris”: [“gs://your-bucket/events/*.json”],
“auth”: {
“type”: “service_account”,
“clientEmail”: “your-service@project.iam.gserviceaccount.com”,
“key”: “—–BEGIN PRIVATE KEY—–n…n—–END PRIVATE KEY—–n”
}
},
“inputFormat”: { “type”: “json” }
},
“tuningConfig”: {
“type”: “index_parallel”,
“partitionsSpec”: {
“type”: “dynamic”,
“maxRowsPerSegment”: 5000000,
“maxTotalRows”: 20000000
},
“maxRowsInMemory”: 1000000,
“maxBytesInMemory”: 0
}
}
}
Advantages:
maxRowsInMemory, maxBytesInMemory.index_parallel Can Do That index_hadoop Can’t“partitionsSpec”: {
“type”: “dynamic”,
“maxRowsPerSegment”: 5000000,
“maxTotalRows”: 20000000
}
index_hadoop fails: Requires manual numShards tuning.“maxRowsInMemory”: 1000000,
“maxBytesInMemory”: 0
index_hadoop fails: Relies on Hadoop’s opaque JVM settings.index_hadoop fails: No checkpointing; full restart on failure.index_hadoop fails: Fixed reducer count at job submission.index_hadoop?index_parallel isn’t just an incremental improvement – it’s a paradigm shift. By eliminating Hadoop dependencies, embracing cloud-native design, and introducing dynamic partitioning and advanced tuning, it delivers faster, cheaper, and more resilient batch ingestion. For modern data teams, index_parallel is the undisputed champion.
Varsha S