Improve to storage partitioning scheme to be flexible and configurable at runtime. #126

hariprasad-k · 2020-03-05T10:24:17Z

Currently, the partitioning scheme includes topic name by default in all storage partitioning scheme. It will be flexible to have it configurable and let users chose.

Along the same lines, the partitioning scheme also assumes the 'timestamp' column unit in seconds by default. Instead, it will be flexible to have this configurable and let users specify a scale factor to boost or shrink to seconds scale to support units in different scales.
This is similar to #122, but a little more generic.

OneCricketeer · 2020-04-10T14:28:39Z

If you didn't have the topic name, how would you separate out directories by topic partition? What exactly is your use case that cannot be solved by a regex router or a custom partitioner that you add to your own classpath?

hariprasad-k · 2020-09-09T14:03:41Z

If you didn't have the topic name, how would you separate out directories by topic partition?

Without topic name, one will specify the complete desired path in topics.dir config. We consider this to be more flexible option, as exact topic name is not always desired in the path (sometimes it's preferred to have path with different delimiters or directory structures in HDFS).

And, we recommend to use this parameter only for connectors that sink single topic. Sinking multiple topics in the same directory and not separated by topic name subfolders would be generally not desirable.

What exactly is your use case that cannot be solved by a regex router or a custom partitioner that you add to your own classpath?

For us, this is particularly interesting to facilitate smoother migration track (from Camus to Kafka Connect) and have all HDFS output paths (along with Hive tables) fully backward compatible as before.

And further, if we use a regex router or a custom partitioner, this will break EOS semantics (offset recovery) achieved by HDFS file naming conventions. Hence, we propose this change along with this PR confluentinc/kafka-connect-hdfs#516 to retain the EOS semantics.

OneCricketeer · 2020-09-09T14:16:31Z

already consists topic name included with different delimiters or directory hierarchies based on topic names

So, you've created some other system to maintain this directory structure?

migration track (from Camus to Kafka Connect)

I've not heard of Kafka Connect being an exact migration path (used to use Camus as well). Apache Gobblin is the next iteration of Camus

hariprasad-k · 2020-09-09T14:48:56Z

already consists topic name included with different delimiters or directory hierarchies based on topic names

So, you've created some other system to maintain this directory structure?

We have external Hive tables based on this HDFS directory structure. So, with this patch, we were able to retain same directory structure..

migration track (from Camus to Kafka Connect)

I've not heard of Kafka Connect being an exact migration path (used to use Camus as well). Apache Gobblin is the next iteration of Camus

Yes, agreed. But, we chose Kafka Connect, as it was more aligned for our needs.

P.S.- I updated my original comments to clarify better without noticing your latest comments

JozoVilcek · 2021-02-04T12:34:49Z

I am also looking for a way to not strictly require a topic name to be present in the path (e.g. have custom mapping managed by partitioner) - more info in confluentinc/kafka-connect-hdfs#544

What is holding a progress or decision if this feature can get in in some form?

This was referenced Sep 9, 2020

Exactly-once should also work without topic name included in the path confluentinc/kafka-connect-hdfs#515

Open

Recover the offset from HDFS, even if topic name is not present in storage path confluentinc/kafka-connect-hdfs#516

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve to storage partitioning scheme to be flexible and configurable at runtime. #126

Improve to storage partitioning scheme to be flexible and configurable at runtime. #126

hariprasad-k commented Mar 5, 2020

OneCricketeer commented Apr 10, 2020

hariprasad-k commented Sep 9, 2020 •

edited

Loading

OneCricketeer commented Sep 9, 2020

hariprasad-k commented Sep 9, 2020

JozoVilcek commented Feb 4, 2021

Improve to storage partitioning scheme to be flexible and configurable at runtime. #126

Improve to storage partitioning scheme to be flexible and configurable at runtime. #126

Comments

hariprasad-k commented Mar 5, 2020

OneCricketeer commented Apr 10, 2020

hariprasad-k commented Sep 9, 2020 • edited Loading

OneCricketeer commented Sep 9, 2020

hariprasad-k commented Sep 9, 2020

JozoVilcek commented Feb 4, 2021

hariprasad-k commented Sep 9, 2020 •

edited

Loading