Batch ingestion in Druid has evolved significantly. While index_hadoop
was once the go-to for large datasets, index_parallel
has emerged as the modern standard. In this blog, we’ll dissect why index_parallel
outperforms index_hadoop
through advanced features, tuning capabilities, failure scenarios, and real-world use cases – with GCS examples.
index_hadoop
Despite its power, index_hadoop
has critical limitations:
gcs-connector
), adding latency.numShards
statically, causing data skew.index_parallel
index_parallel
eliminates these issues with a cloud-native, distributed architecture:
No Hadoop Dependency
Dynamic Partitioning
Fault Tolerance
Cloud Optimization
inputSource
.Advanced Tuning Capabilities
index_hadoop
Spec (Legacy){
“type”: “index_hadoop”,
“spec”: {
“ioConfig”: {
“type”: “index_hadoop”,
“inputSpec”: {
“type”: “static”,
“inputFormat”: “org.apache.hadoop.mapred.TextInputFormat”,
“paths”: “gs://your-bucket/events/*.json”
}
},
“tuningConfig”: {
“type”: “hadoop”,
“partitionsSpec”: { “type”: “hashed”, “numShards”: 32 },
“jobProperties”: {
“fs.gs.project.id”: “your-gcp-project”,
“fs.gs.auth.service.account.enable”: “true”,
“fs.gs.auth.service.account.json.keyfile”: “/path/to/key.json”
}
}
}
}
Pain Points:
numShards
tuning leads to data skew.index_parallel
Spec (Modern){
“type”: “index_parallel”,
“spec”: {
“ioConfig”: {
“type”: “index_parallel”,
“inputSource”: {
“type”: “google”,
“uris”: [“gs://your-bucket/events/*.json”],
“auth”: {
“type”: “service_account”,
“clientEmail”: “your-service@project.iam.gserviceaccount.com”,
“key”: “—–BEGIN PRIVATE KEY—–n…n—–END PRIVATE KEY—–n”
}
},
“inputFormat”: { “type”: “json” }
},
“tuningConfig”: {
“type”: “index_parallel”,
“partitionsSpec”: {
“type”: “dynamic”,
“maxRowsPerSegment”: 5000000,
“maxTotalRows”: 20000000
},
“maxRowsInMemory”: 1000000,
“maxBytesInMemory”: 0
}
}
}
Advantages:
maxRowsInMemory
, maxBytesInMemory
.index_parallel
Can Do That index_hadoop
Can’t“partitionsSpec”: {
“type”: “dynamic”,
“maxRowsPerSegment”: 5000000,
“maxTotalRows”: 20000000
}
index_hadoop
fails: Requires manual numShards
tuning.“maxRowsInMemory”: 1000000,
“maxBytesInMemory”: 0
index_hadoop
fails: Relies on Hadoop’s opaque JVM settings.index_hadoop
fails: No checkpointing; full restart on failure.index_hadoop
fails: Fixed reducer count at job submission.index_hadoop
?index_parallel
isn’t just an incremental improvement – it’s a paradigm shift. By eliminating Hadoop dependencies, embracing cloud-native design, and introducing dynamic partitioning and advanced tuning, it delivers faster, cheaper, and more resilient batch ingestion. For modern data teams, index_parallel
is the undisputed champion.
Varsha S