-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve to storage partitioning scheme to be flexible and configurable at runtime. #126
Comments
If you didn't have the topic name, how would you separate out directories by topic partition? What exactly is your use case that cannot be solved by a regex router or a custom partitioner that you add to your own classpath? |
Without topic name, one will specify the complete desired path in And, we recommend to use this parameter only for connectors that sink single topic. Sinking multiple topics in the same directory and not separated by
For us, this is particularly interesting to facilitate smoother migration track (from Camus to Kafka Connect) and have all HDFS output paths (along with Hive tables) fully backward compatible as before. And further, if we use a regex router or a custom partitioner, this will break EOS semantics (offset recovery) achieved by HDFS file naming conventions. Hence, we propose this change along with this PR confluentinc/kafka-connect-hdfs#516 to retain the EOS semantics. |
So, you've created some other system to maintain this directory structure?
I've not heard of Kafka Connect being an exact migration path (used to use Camus as well). Apache Gobblin is the next iteration of Camus |
We have external Hive tables based on this HDFS directory structure. So, with this patch, we were able to retain same directory structure..
Yes, agreed. But, we chose Kafka Connect, as it was more aligned for our needs. P.S.- I updated my original comments to clarify better without noticing your latest comments |
I am also looking for a way to not strictly require a topic name to be present in the path (e.g. have custom mapping managed by partitioner) - more info in confluentinc/kafka-connect-hdfs#544 What is holding a progress or decision if this feature can get in in some form? |
Currently, the partitioning scheme includes topic name by default in all storage partitioning scheme. It will be flexible to have it configurable and let users chose.
Along the same lines, the partitioning scheme also assumes the 'timestamp' column unit in seconds by default. Instead, it will be flexible to have this configurable and let users specify a scale factor to boost or shrink to seconds scale to support units in different scales.
This is similar to #122, but a little more generic.
The text was updated successfully, but these errors were encountered: