Downgrade Avro dependency to 1.7.4? #39

ldcasillas-progreso · 2015-03-16T21:10:07Z

As mentioned in issue #33, The 2.5.x branch of avro-scheme depends on a newer Avro version (1.7.7) than the one that ships with Hadoop 2.6.x (Avro 1.7.4). In addition to that, avro-scheme does call a methods that exist in Avro 1.7.7 but not in 1.7.4. Which means that an application that uses avro-scheme must set the Hadoop mapreduce.job.user.classpath.first configuration option to true to work reliably, otherwise it can get NoSuchMethodErrors as detailed in Issue #33.

However, setting mapreduce.job.user.classpath.first=true can cause problems with other components or libraries that the application uses. For example, after modifying my application to use that setting, I had to downgrade the Guava library to 14.0.1, because the setting causes Hadoop to put my application's newer Guava ahead of its own, and current Hadoop versions contain code that depends on methods that were removed in recent Guava versions.

Therefore, it seems that it would be prudent for avro-scheme to be more conservative about which Avro version is required, and try not to use a version newer than whatever Hadoop ships with.

Note that this isn't just a matter of downgrading the Avro dependency, since avro-scheme currently does use methods that were introduced in Avro 1.7.5 and later (thus the NoSuchMethodError detailed in issue #33).

The text was updated successfully, but these errors were encountered:

arun-agarwal · 2016-04-17T11:09:08Z

Any updates on this issue. We are exactly hitting to this problem in our environment. I see it pretty old, almost like an year. is cascading-avro active?

ldcasillas-progreso · 2016-04-17T17:43:16Z

I found a workaround, but it requires relatively recent versions of Hadoop. It involves two settings:

Have your Cascading application set mapreduce.job.user.classpath.first=true in the properties that it uses to create its Hadoop configuration.
Set the environment variable HADOOP_USE_CLIENT_CLASSLOADER=true when you run the hadoop jar command to launch your application. (I think this was introduced in Hadoop 2.6.0, but double check that!)

The first setting causes remote MapReduce tasks to put your application's dependencies ahead of Hadoop. The second causes the master process to do the same.

I have verified that this works around the Avro 1.7.4 vs. 1.7.7 issues. I have not verified that it works past the Guava version issues, however—and there's a good chance it doesn't, given Guava's very aggressive deprecation policies.

arun-agarwal · 2016-04-18T19:32:16Z

Finally could get it working. but hitting to another problem(might require a different issue ticket, but adding here for sake of completeness), the Avro file which got generated using a different version(most likely 1.7.4) is not readable in 1.7.6. I am planning to do your clone and build to establish if that is indeed the issue.
Any idea if Avro-1.7.6 breaks binary compatibility with 1.7.4? I checked, https://github.com/apache/avro/blob/master/CHANGES.txt
but couldn't make out any explicitly called out change. will keep this space posted on this.

ldcasillas-progreso · 2016-04-18T19:56:38Z

If I recall correctly, you are correct: in some situations, the Avro files generated by 1.7.4. are not readable in 1.7.7 (I did not test 1.7.5 or 1.7.6, but I think they're the same).

If I recall correctly, this is the relevant Avro bug ticket:

https://issues.apache.org/jira/browse/AVRO-1295

The problem is that 1.7.4 is incorrectly serializing schemas to JSON in some cases. The problem as I experienced it was Avro record schemas with two fields with the same enum; in 1.7.4 one of the fields in the generated JSON schema had the namespace specified and the other didn't.

I didn't have the burden of having to read old files under the new schema, but if that's a problem there should be a workaround for it—possibly by writing a raw MapReduce job that is able to read the old schema and output the new one.

arun-agarwal · 2016-04-20T02:24:10Z

Thank you Luis Casillas for persisting with me on solving this issue. I tried many other combinations and finally did the breakpoint debugging. I think the problem in reading avro files lies with the Cascading code and has nothing to do with the cascading.avro. here is the possible theory:

AvroScheme implements sourceConfInit correctly.
This gets called from right place : HadoopFlowStep.getInitializedConfig
which in turn calls: this.initFromSources(flowProcess, conf); i.e. initializing the SourceTap(for records, which in our case uses AvroScheme).
initFromSources correctly delegates this method to all the Source Taps one by one and calls
tap.sourceConfInit(flowProcess, accumulatedJob);
AvroScheme gets called to get the sourceConfig and set on this flowprocess/jobconfig. Via.
this.getScheme().sourceConfInit(flowProcess, this, conf);
Which sets the job config correctly. Now here comes the interesting turn:
when source tap configuration settings are finished, the initFromSources method overrides this inputformat via unconditionally calling:
MultiInputFormat.addInputFormat(conf, streamedJobs);
which internally calls:
toJob.setInputFormat(MultiInputFormat.class);
This overrides what we have done in AvroScheme Class. conf.setInputFormat(AvroInputFormat.class);

I am suspecting this is the reason the job is not able to read the Avro files because MultiInputFormat class doesn't know how to read & interpret Avro Files at job execution time.

I am not sure how to get this conveyed to right team and see if this sequence of thoughts is indeed correct Any Pointers?

arun-agarwal · 2016-04-20T02:46:03Z

By the way I have tried with both Cascading: 3.0.0 and 2.0.0 versions, with cascading-avro:avro-scheme @ 2.5.0.

arun-agarwal · 2016-04-25T01:46:58Z

Finally could solve it for our case and cascading.avro:avro-scheme is working like a charm. This is the code piece which is causing us the problem:

Class:org.apache.avro.mapred.AvroInputFormat
Method: protected FileStatus[] listStatus(JobConf job) throws IOException

Code piece:

  if (file.getPath().getName().endsWith(AvroOutputFormat.EXT))
          result.add(file);

This requires avro file to come into avro extension files else it wouldn't be added to the splits on input paths. I have tried changing our avro files to be with this name and things started working fine. There is one process in our pipeline which emits avro files without this extension and because of which cascading steps fail.

Thanks for all the help so far in solving this case for us.

kkrugler · 2016-05-06T22:39:18Z

Hi Arun - so for your issue re files not ending with ".avro", is there a suggested change? Or do you agree that Avro input files should be required to end with ".avro"?

ksranga · 2016-05-12T02:12:55Z

One way to check if a file is Avro is to use DataFileWriter.getSchema() instead of depending on the filename extension.

arun-agarwal · 2016-05-12T03:51:56Z

Hi KKrugler, No suggested change, only thing we can add to avro documentation is,
the setting the following property in app does the trick, properties.setProperty("avro.mapred.ignore.inputs.without.extension", "false");

arun-agarwal · 2016-05-12T03:55:27Z

@ksranga : the code that checks for extension is part of AvroInputFormat and avro-scheme or cascading has very small say in it. the only thing is setting up a property avro.mapred.ignore.inputs.without.extension to ignore extensions, so probably this small point can be added to our README.md or some documentation of avro-scheme.

kkrugler · 2016-07-11T14:47:31Z

hi @vmagotra - I think the latest Cascading 3.1 targets Hadoop 2.7, which still is using Avro 1.7.4. So I agree we should switch back to that version.

kkrugler added this to the 3.1 milestone Jul 12, 2016

kkrugler assigned vmagotra Jul 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Downgrade Avro dependency to 1.7.4? #39

Downgrade Avro dependency to 1.7.4? #39

ldcasillas-progreso commented Mar 16, 2015

arun-agarwal commented Apr 17, 2016

ldcasillas-progreso commented Apr 17, 2016

arun-agarwal commented Apr 18, 2016

ldcasillas-progreso commented Apr 18, 2016

arun-agarwal commented Apr 20, 2016

arun-agarwal commented Apr 20, 2016

arun-agarwal commented Apr 25, 2016

kkrugler commented May 6, 2016

ksranga commented May 12, 2016

arun-agarwal commented May 12, 2016

arun-agarwal commented May 12, 2016

kkrugler commented Jul 11, 2016

Downgrade Avro dependency to 1.7.4? #39

Downgrade Avro dependency to 1.7.4? #39

Comments

ldcasillas-progreso commented Mar 16, 2015

arun-agarwal commented Apr 17, 2016

ldcasillas-progreso commented Apr 17, 2016

arun-agarwal commented Apr 18, 2016

ldcasillas-progreso commented Apr 18, 2016

arun-agarwal commented Apr 20, 2016

arun-agarwal commented Apr 20, 2016

arun-agarwal commented Apr 25, 2016

kkrugler commented May 6, 2016

ksranga commented May 12, 2016

arun-agarwal commented May 12, 2016

arun-agarwal commented May 12, 2016

kkrugler commented Jul 11, 2016