-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Downgrade Avro dependency to 1.7.4? #39
Comments
Any updates on this issue. We are exactly hitting to this problem in our environment. I see it pretty old, almost like an year. is cascading-avro active? |
I found a workaround, but it requires relatively recent versions of Hadoop. It involves two settings:
The first setting causes remote MapReduce tasks to put your application's dependencies ahead of Hadoop. The second causes the master process to do the same. I have verified that this works around the Avro 1.7.4 vs. 1.7.7 issues. I have not verified that it works past the Guava version issues, however—and there's a good chance it doesn't, given Guava's very aggressive deprecation policies. |
Finally could get it working. but hitting to another problem(might require a different issue ticket, but adding here for sake of completeness), the Avro file which got generated using a different version(most likely 1.7.4) is not readable in 1.7.6. I am planning to do your clone and build to establish if that is indeed the issue. |
If I recall correctly, you are correct: in some situations, the Avro files generated by 1.7.4. are not readable in 1.7.7 (I did not test 1.7.5 or 1.7.6, but I think they're the same). If I recall correctly, this is the relevant Avro bug ticket: https://issues.apache.org/jira/browse/AVRO-1295 The problem is that 1.7.4 is incorrectly serializing schemas to JSON in some cases. The problem as I experienced it was Avro record schemas with two fields with the same enum; in 1.7.4 one of the fields in the generated JSON schema had the namespace specified and the other didn't. I didn't have the burden of having to read old files under the new schema, but if that's a problem there should be a workaround for it—possibly by writing a raw MapReduce job that is able to read the old schema and output the new one. |
Thank you Luis Casillas for persisting with me on solving this issue. I tried many other combinations and finally did the breakpoint debugging. I think the problem in reading avro files lies with the Cascading code and has nothing to do with the cascading.avro. here is the possible theory:
I am suspecting this is the reason the job is not able to read the Avro files because MultiInputFormat class doesn't know how to read & interpret Avro Files at job execution time. I am not sure how to get this conveyed to right team and see if this sequence of thoughts is indeed correct Any Pointers? |
By the way I have tried with both Cascading: 3.0.0 and 2.0.0 versions, with cascading-avro:avro-scheme @ 2.5.0. |
Finally could solve it for our case and cascading.avro:avro-scheme is working like a charm. This is the code piece which is causing us the problem: Class: Code piece:
This requires avro file to come into avro extension files else it wouldn't be added to the splits on input paths. I have tried changing our avro files to be with this name and things started working fine. There is one process in our pipeline which emits avro files without this extension and because of which cascading steps fail. Thanks for all the help so far in solving this case for us. |
Hi Arun - so for your issue re files not ending with ".avro", is there a suggested change? Or do you agree that Avro input files should be required to end with ".avro"? |
One way to check if a file is Avro is to use DataFileWriter.getSchema() instead of depending on the filename extension. |
Hi KKrugler, No suggested change, only thing we can add to avro documentation is, |
@ksranga : the code that checks for extension is part of AvroInputFormat and avro-scheme or cascading has very small say in it. the only thing is setting up a property |
hi @vmagotra - I think the latest Cascading 3.1 targets Hadoop 2.7, which still is using Avro 1.7.4. So I agree we should switch back to that version. |
As mentioned in issue #33, The 2.5.x branch of avro-scheme depends on a newer Avro version (1.7.7) than the one that ships with Hadoop 2.6.x (Avro 1.7.4). In addition to that, avro-scheme does call a methods that exist in Avro 1.7.7 but not in 1.7.4. Which means that an application that uses avro-scheme must set the Hadoop
mapreduce.job.user.classpath.first
configuration option totrue
to work reliably, otherwise it can getNoSuchMethodError
s as detailed in Issue #33.However, setting
mapreduce.job.user.classpath.first=true
can cause problems with other components or libraries that the application uses. For example, after modifying my application to use that setting, I had to downgrade the Guava library to 14.0.1, because the setting causes Hadoop to put my application's newer Guava ahead of its own, and current Hadoop versions contain code that depends on methods that were removed in recent Guava versions.Therefore, it seems that it would be prudent for
avro-scheme
to be more conservative about which Avro version is required, and try not to use a version newer than whatever Hadoop ships with.Note that this isn't just a matter of downgrading the Avro dependency, since avro-scheme currently does use methods that were introduced in Avro 1.7.5 and later (thus the
NoSuchMethodError
detailed in issue #33).The text was updated successfully, but these errors were encountered: