Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve to storage partitioning scheme to be flexible and configurable at runtime. #126

Open
hariprasad-k opened this issue Mar 5, 2020 · 5 comments

Comments

@hariprasad-k
Copy link

Currently, the partitioning scheme includes topic name by default in all storage partitioning scheme. It will be flexible to have it configurable and let users chose.

Along the same lines, the partitioning scheme also assumes the 'timestamp' column unit in seconds by default. Instead, it will be flexible to have this configurable and let users specify a scale factor to boost or shrink to seconds scale to support units in different scales.
This is similar to #122, but a little more generic.

@OneCricketeer
Copy link

If you didn't have the topic name, how would you separate out directories by topic partition? What exactly is your use case that cannot be solved by a regex router or a custom partitioner that you add to your own classpath?

@hariprasad-k
Copy link
Author

hariprasad-k commented Sep 9, 2020

If you didn't have the topic name, how would you separate out directories by topic partition?

Without topic name, one will specify the complete desired path in topics.dir config. We consider this to be more flexible option, as exact topic name is not always desired in the path (sometimes it's preferred to have path with different delimiters or directory structures in HDFS).

And, we recommend to use this parameter only for connectors that sink single topic. Sinking multiple topics in the same directory and not separated by topic name subfolders would be generally not desirable.

What exactly is your use case that cannot be solved by a regex router or a custom partitioner that you add to your own classpath?

For us, this is particularly interesting to facilitate smoother migration track (from Camus to Kafka Connect) and have all HDFS output paths (along with Hive tables) fully backward compatible as before.

And further, if we use a regex router or a custom partitioner, this will break EOS semantics (offset recovery) achieved by HDFS file naming conventions. Hence, we propose this change along with this PR confluentinc/kafka-connect-hdfs#516 to retain the EOS semantics.

@OneCricketeer
Copy link

already consists topic name included with different delimiters or directory hierarchies based on topic names

So, you've created some other system to maintain this directory structure?

migration track (from Camus to Kafka Connect)

I've not heard of Kafka Connect being an exact migration path (used to use Camus as well). Apache Gobblin is the next iteration of Camus

@hariprasad-k
Copy link
Author

already consists topic name included with different delimiters or directory hierarchies based on topic names

So, you've created some other system to maintain this directory structure?

We have external Hive tables based on this HDFS directory structure. So, with this patch, we were able to retain same directory structure..

migration track (from Camus to Kafka Connect)

I've not heard of Kafka Connect being an exact migration path (used to use Camus as well). Apache Gobblin is the next iteration of Camus

Yes, agreed. But, we chose Kafka Connect, as it was more aligned for our needs.

P.S.- I updated my original comments to clarify better without noticing your latest comments

@JozoVilcek
Copy link

I am also looking for a way to not strictly require a topic name to be present in the path (e.g. have custom mapping managed by partitioner) - more info in confluentinc/kafka-connect-hdfs#544

What is holding a progress or decision if this feature can get in in some form?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants