Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to allow the use of cascading.avro in Cascalog #11

Open
kkrugler opened this issue Aug 21, 2013 · 21 comments
Open

Changes to allow the use of cascading.avro in Cascalog #11

kkrugler opened this issue Aug 21, 2013 · 21 comments

Comments

@kkrugler
Copy link
Member

I made a number of changes (most notably was the overloaded constructor that added support for providing avro field names which may differ from the tuple field names). Cascalog uses prefixes in the tuples like ? and ! which are not allowed as avro fields. For example, someone can name the tuple "?name" and the avro field "name".

The README has details about other changes.

In short, this pull request provides:

  • new constructor to support tuple/avro field name mapping
  • updated avro to 1.6.3 (and made some adjustments as needed)
  • added change that ensured avro.codec is written out in the file
  • added automatic conversion of date
  • changes to POM to indicate new version/fork

I really just tried to stick with the coding style as much as possible, but feel this whole thing can be cleaned up a bit.

Pull as you please.

@kkrugler
Copy link
Member Author

Thanks - I'll take a look and see if all/some of the mods you have made can be merged into the project.

@kkrugler
Copy link
Member Author

Great. Let me know if you have any questions.

On Wed, May 9, 2012 at 9:52 AM, vmagotra <
[email protected]

wrote:

Thanks - I'll take a look and see if all/some of the mods you have made
can be merged into the project.


Reply to this email directly or view it on GitHub:
https://github.com/bixolabs/cascading.avro/pull/7#issuecomment-5600507

@kkrugler
Copy link
Member Author

Hi,

Updated to Avro 1.6.3. This had some interesting effects though. The newer Avro no longer supports nested Enums, so I needed to split out the test. I also ran into trouble with nullSchema and Map and Enum fields. So these are no longer "optional" if used.
<<
Just to clarify on the above : Do you mean that if you define a field in Avro that's a Map or an Enum, then it has to exist in the data being written (can't rely on nullSchema to fill in a null value for you) ?

I think the other changes are good to be rolled in...

@kkrugler
Copy link
Member Author

Hi Mike - we just created a 2.0 branch, and merged most the cascading-avro code (Sven's fork/modifications) with some of the cascading.avro code.

Once that gets merged into trunk, then I'll need to find time to look through your changes and figure out which ones to cherry-pick. E.g. I know cascading-avro had some support for Cascalog field renaming, but I haven't looked at how they implemented that.

-- Ken

@kkrugler
Copy link
Member Author

Hi Mike,

If you're still interested in this can you have a look at 2.0-develop and see if it will do what you need it to? If not, can you make a pull request on that branch?

@kkrugler
Copy link
Member Author

I'm definitely still interested in this and have been reliably using my fork for a little while now. I'm actually in the process of doing a wider upgrade in my own projects, and rolling up to cascading 2.0 in the process. This lead me back to this thread as i wanted to see where you guys were and if you had made any progress upgrading to 2.x. i now see the 2.0 develop branch and will check it out. i'll take a look and get back to you shortly.

@kkrugler
Copy link
Member Author

I'm going to need to migrate some of my changes onto this branch as there are things that i will need and don't think are supported in this (correct me if i'm wrong). I will need:

  • support for specifying the avro output codec (null, deflate, etc.). in my patch i was passing this in as part of the job conf.
  • support for renaming fields (e.g. Cascading Tuple Fiel Names mapped to Avro Field names). Cascalog uses field names like '?afield' and that is not a valid avro field name. So i needed a way to map ?a_field to an_avro_field. I had an overloaded constructor in my patch that handled this. I haven't seen how to do this yet on your branch (but could be missing it).
  • support for Date class types (auto convert to Long for avro file and back to date on Java side). this was the least important of my changes, but i found it convenient as i was working with a mixture of db taps and avro taps.

those three changes are important for my use case. i'd be happy to provide a patch.

@kkrugler
Copy link
Member Author

Hi Mike,

All those changes sound great. I can do the output codec if you don't want to worry about it but the others are probably best submitted as a patch. I think 2.0-develop will become master any day now and my guess is we'll have a separate develop branch where we can add new things.

One thing I just added which might be interesting for you is the ability to get the unpacked Avro record (similar to how SequenceFile support works) and also pass a packed Avro record to write out. I'm adding this to make it easier to use with the Scalding typed API but it might be useful for Cascalog too.

Regards,
Chris

@kkrugler
Copy link
Member Author

Thanks Chris. I'll take a look at the change you mentioned.

Another thing I remembered adding/needing is the ability to set fields as nullable. I hacked it so most/all my fields were nullable, and can hack it for Cascalog based on naming conventions. But it probably be best to come up with a more direct approach to specifying nullable-fields in the Avro output/schema.

@kkrugler
Copy link
Member Author

Hi Mike,

On Oct 26, 2012, at 9:22pm, Mike Stanley wrote:

Thanks Chris. I'll take a look at the change you mentioned.

Another thing I remembered adding/needing is the ability to set fields as nullable. I hacked it so most/all my fields were nullable, and can hack it for Cascalog based on naming conventions. But it probably be best to come up with a more direct approach to specifying nullable-fields in the Avro output/schema.

cascading.avro supports unions, where the "other" field value is nullable.

Are you suggesting an option to automagically add that to all fields?

And would this be for reading, writing, or both?

Thanks,

-- Ken


Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

@kkrugler
Copy link
Member Author

Hi all,

I'm going to look at merging 2.0-dev into master this weekend, with whatever is in that branch.

Then I'll do a 2.1.0 release to Conjars.

After that we can add in Mike's changes (hopefully as a pull request). I guess those would be a 2.2 release, since it's new functionality vs. just bug fixes.

Makes sense?

Thanks,

-- Ken

On Oct 26, 2012, at 9:17pm, Chris Severs wrote:

Hi Mike,

All those changes sound great. I can do the output codec if you don't want to worry about it but the others are probably best submitted as a patch. I think 2.0-develop will become master any day now and my guess is we'll have a separate develop branch where we can add new things.

One thing I just added which might be interesting for you is the ability to get the unpacked Avro record (similar to how SequenceFile support works) and also pass a packed Avro record to write out. I'm adding this to make it easier to use with the Scalding typed API but it might be useful for Cascalog too.

Regards,
Chris


Reply to this email directly or view it on GitHub.


Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

@kkrugler
Copy link
Member Author

Sounds good to me.


Chris

@kkrugler
Copy link
Member Author

On my fork, I simply made all fields nullable, but wouldn't recommend that
for a general solution. It's writing side where it matters I think.

In cascalog nullable fields are named !field instead of ?field. It be nice
to infer from that. I need to look at the changes in the dev branch more
closely before making any recommendation with regards to this enhancement.
It may be easier to do with the new implementation or it my be more
appropriate in some sort of cascalog lite weight wrapper.

I will let you know.

... Mike
Please excuse typos (fat thumbing an iPad)

On Oct 27, 2012, at 2:49 PM, Ken Krugler [email protected] wrote:

Hi Mike,

On Oct 26, 2012, at 9:22pm, Mike Stanley wrote:

Thanks Chris. I'll take a look at the change you mentioned.

Another thing I remembered adding/needing is the ability to set fields as
nullable. I hacked it so most/all my fields were nullable, and can hack it
for Cascalog based on naming conventions. But it probably be best to come
up with a more direct approach to specifying nullable-fields in the Avro
output/schema.

cascading.avro supports unions, where the "other" field value is nullable.

Are you suggesting an option to automagically add that to all fields?

And would this be for reading, writing, or both?

Thanks,

-- Ken


Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr


Reply to this email directly or view it on
GitHubhttps://github.com/bixolabs/cascading.avro/pull/7#issuecomment-9838466.

@kkrugler
Copy link
Member Author

Sounds good to me too. I will come back around with patches, once I have a
chance to take the 2.1.0 release for a spin.

... Mike
Please excuse typos (fat thumbing an iPad)

On Oct 27, 2012, at 2:52 PM, Ken Krugler [email protected] wrote:

Hi all,

I'm going to look at merging 2.0-dev into master this weekend, with
whatever is in that branch.

Then I'll do a 2.1.0 release to Conjars.

After that we can add in Mike's changes (hopefully as a pull request). I
guess those would be a 2.2 release, since it's new functionality vs. just
bug fixes.

Makes sense?

Thanks,

-- Ken

On Oct 26, 2012, at 9:17pm, Chris Severs wrote:

Hi Mike,

All those changes sound great. I can do the output codec if you don't
want to worry about it but the others are probably best submitted as a
patch. I think 2.0-develop will become master any day now and my guess is
we'll have a separate develop branch where we can add new things.

One thing I just added which might be interesting for you is the ability
to get the unpacked Avro record (similar to how SequenceFile support works)
and also pass a packed Avro record to write out. I'm adding this to make it
easier to use with the Scalding typed API but it might be useful for
Cascalog too.

Regards,
Chris


Reply to this email directly or view it on GitHub.


Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr


Reply to this email directly or view it on
GitHubhttps://github.com/bixolabs/cascading.avro/pull/7#issuecomment-9838506.

@kkrugler
Copy link
Member Author

Hi all,

a. I tagged master in GitHub as 1.0

b. I merged in the 2.1-develop branch

c. I set the version to be 2.1.0 in both the scheme and maven-plugin sub-project pom.xml files

d. I added a section to both pom.xml files:

conjars Concurrent Conjars repository http://conjars.org/repo

If you then add an appropriate section to your ~/.m2/settings.xml file, you too can deploy to Conjars:

conjars a registered username the password

e. I was able to deploy the scheme without any issues.

One oddity, though, is that since we're using cascading.avro as the groupId, this means it shows up in Conjars at http://conjars.org/repo/cascading/avro/

So it's in the Cascading namespace (for the Maven repo). I assume that's OK with Chris Wensel/Concurrent, but I should double-check.

f. I had an issue with deploying the maven-plugin

"mvn deploy" kind of worked here - it uploaded the jar/pom and associated files, but I got this error:

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-deploy-plugin:2.5:deploy (default-deploy) on project avro-maven-plugin: Failed to deploy metadata: Could not transfer metadata org.apache.maven.artifact.repository.metadata.MetadataBridge@6c5bdfae from/to conjars (http://conjars.org/repo): Failed to transfer file: http://conjars.org/repo/cascading/avro/maven-metadata.xml. Return code is: 401 -> [Help 1]

I'm not sure why it's trying to write out a maven-metadata.xml file at the root of the cascading.avro package - probably something in the pom.xml would tell me, but I'm out of time today.

And I'm also not sure why this was rejected, but I assume it's a config setting for Conjars, where you can only create directories and then write files out to specific release dirs.

g. I tagged this version of the code as 2.1.0

h. I edited the pom.xml versions to be 2.2-SNAPSHOT, and pushed.

So we should be ready for further development.

Take a look, and if it seems good then we can post something to the mailing list.

Thanks!

-- Ken

On Oct 27, 2012, at 7:45pm, Mike Stanley wrote:

Sounds good to me too. I will come back around with patches, once I have a
chance to take the 2.1.0 release for a spin.

... Mike
Please excuse typos (fat thumbing an iPad)

On Oct 27, 2012, at 2:52 PM, Ken Krugler [email protected] wrote:

Hi all,

I'm going to look at merging 2.0-dev into master this weekend, with
whatever is in that branch.

Then I'll do a 2.1.0 release to Conjars.

After that we can add in Mike's changes (hopefully as a pull request). I
guess those would be a 2.2 release, since it's new functionality vs. just
bug fixes.

Makes sense?

Thanks,

-- Ken

On Oct 26, 2012, at 9:17pm, Chris Severs wrote:

Hi Mike,

All those changes sound great. I can do the output codec if you don't
want to worry about it but the others are probably best submitted as a
patch. I think 2.0-develop will become master any day now and my guess is
we'll have a separate develop branch where we can add new things.

One thing I just added which might be interesting for you is the ability
to get the unpacked Avro record (similar to how SequenceFile support works)
and also pass a packed Avro record to write out. I'm adding this to make it
easier to use with the Scalding typed API but it might be useful for
Cascalog too.

Regards,
Chris


Reply to this email directly or view it on GitHub.


Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr


Reply to this email directly or view it on
GitHubhttps://github.com/bixolabs/cascading.avro/pull/7#issuecomment-9838506.

Reply to this email directly or view it on GitHub.


http://about.me/kkrugler
+1 530-210-6378

@kkrugler
Copy link
Member Author

This is from https://github.com/bixolabs/cascading.avro/pull/7

I'm hoping Mike Stanley can get back in sync with his modification.

@dkincaid
Copy link

Is there any hope of getting these changes in? Right now it seems that this is not usable at all from Cascalog.

@mikestanley
Copy link

I will take a look this week. No guarantees. It's been a long time since I needed anything further from this particular code and its literally just been running on autopilot. I'm probably years off the latest stuff. That said, I'm guessing the changes are still pretty relevant. I will happily look to see if I can bring it forward as a pull request.

@dkincaid
Copy link

Thanks, Mike. I took a look at it today, but couldn't figure out where the change was needed myself.

@dkincaid
Copy link

dkincaid commented Sep 7, 2016

Could someone at least point me to where in the code the changes to support Cascalog field names would need to be made?

@kkrugler
Copy link
Member Author

kkrugler commented Sep 7, 2016

Hi Dave. From what I can tell, the bulk of @mikestanley 's changes are at mikestanley@330d1f0. There is a (small) change as well at mikestanley@47bc6c7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants