Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downgrade Avro dependency to 1.7.4? #39

Open
ldcasillas-progreso opened this issue Mar 16, 2015 · 12 comments
Open

Downgrade Avro dependency to 1.7.4? #39

ldcasillas-progreso opened this issue Mar 16, 2015 · 12 comments
Assignees
Milestone

Comments

@ldcasillas-progreso
Copy link

As mentioned in issue #33, The 2.5.x branch of avro-scheme depends on a newer Avro version (1.7.7) than the one that ships with Hadoop 2.6.x (Avro 1.7.4). In addition to that, avro-scheme does call a methods that exist in Avro 1.7.7 but not in 1.7.4. Which means that an application that uses avro-scheme must set the Hadoop mapreduce.job.user.classpath.first configuration option to true to work reliably, otherwise it can get NoSuchMethodErrors as detailed in Issue #33.

However, setting mapreduce.job.user.classpath.first=true can cause problems with other components or libraries that the application uses. For example, after modifying my application to use that setting, I had to downgrade the Guava library to 14.0.1, because the setting causes Hadoop to put my application's newer Guava ahead of its own, and current Hadoop versions contain code that depends on methods that were removed in recent Guava versions.

Therefore, it seems that it would be prudent for avro-scheme to be more conservative about which Avro version is required, and try not to use a version newer than whatever Hadoop ships with.

Note that this isn't just a matter of downgrading the Avro dependency, since avro-scheme currently does use methods that were introduced in Avro 1.7.5 and later (thus the NoSuchMethodError detailed in issue #33).

@arun-agarwal
Copy link

Any updates on this issue. We are exactly hitting to this problem in our environment. I see it pretty old, almost like an year. is cascading-avro active?

@ldcasillas-progreso
Copy link
Author

I found a workaround, but it requires relatively recent versions of Hadoop. It involves two settings:

  1. Have your Cascading application set mapreduce.job.user.classpath.first=true in the properties that it uses to create its Hadoop configuration.
  2. Set the environment variable HADOOP_USE_CLIENT_CLASSLOADER=true when you run the hadoop jar command to launch your application. (I think this was introduced in Hadoop 2.6.0, but double check that!)

The first setting causes remote MapReduce tasks to put your application's dependencies ahead of Hadoop. The second causes the master process to do the same.

I have verified that this works around the Avro 1.7.4 vs. 1.7.7 issues. I have not verified that it works past the Guava version issues, however—and there's a good chance it doesn't, given Guava's very aggressive deprecation policies.

@arun-agarwal
Copy link

Finally could get it working. but hitting to another problem(might require a different issue ticket, but adding here for sake of completeness), the Avro file which got generated using a different version(most likely 1.7.4) is not readable in 1.7.6. I am planning to do your clone and build to establish if that is indeed the issue.
Any idea if Avro-1.7.6 breaks binary compatibility with 1.7.4? I checked, https://github.com/apache/avro/blob/master/CHANGES.txt
but couldn't make out any explicitly called out change. will keep this space posted on this.

@ldcasillas-progreso
Copy link
Author

If I recall correctly, you are correct: in some situations, the Avro files generated by 1.7.4. are not readable in 1.7.7 (I did not test 1.7.5 or 1.7.6, but I think they're the same).

If I recall correctly, this is the relevant Avro bug ticket:

https://issues.apache.org/jira/browse/AVRO-1295

The problem is that 1.7.4 is incorrectly serializing schemas to JSON in some cases. The problem as I experienced it was Avro record schemas with two fields with the same enum; in 1.7.4 one of the fields in the generated JSON schema had the namespace specified and the other didn't.

I didn't have the burden of having to read old files under the new schema, but if that's a problem there should be a workaround for it—possibly by writing a raw MapReduce job that is able to read the old schema and output the new one.

@arun-agarwal
Copy link

Thank you Luis Casillas for persisting with me on solving this issue. I tried many other combinations and finally did the breakpoint debugging. I think the problem in reading avro files lies with the Cascading code and has nothing to do with the cascading.avro. here is the possible theory:

  1. AvroScheme implements sourceConfInit correctly.
  2. This gets called from right place : HadoopFlowStep.getInitializedConfig
  3. which in turn calls: this.initFromSources(flowProcess, conf); i.e. initializing the SourceTap(for records, which in our case uses AvroScheme).
  4. initFromSources correctly delegates this method to all the Source Taps one by one and calls
    tap.sourceConfInit(flowProcess, accumulatedJob);
  5. AvroScheme gets called to get the sourceConfig and set on this flowprocess/jobconfig. Via.
    this.getScheme().sourceConfInit(flowProcess, this, conf);
  6. Which sets the job config correctly. Now here comes the interesting turn:
    when source tap configuration settings are finished, the initFromSources method overrides this inputformat via unconditionally calling:
    MultiInputFormat.addInputFormat(conf, streamedJobs);
    which internally calls:
    toJob.setInputFormat(MultiInputFormat.class);
    This overrides what we have done in AvroScheme Class. conf.setInputFormat(AvroInputFormat.class);

I am suspecting this is the reason the job is not able to read the Avro files because MultiInputFormat class doesn't know how to read & interpret Avro Files at job execution time.

I am not sure how to get this conveyed to right team and see if this sequence of thoughts is indeed correct Any Pointers?

@arun-agarwal
Copy link

By the way I have tried with both Cascading: 3.0.0 and 2.0.0 versions, with cascading-avro:avro-scheme @ 2.5.0.

@arun-agarwal
Copy link

Finally could solve it for our case and cascading.avro:avro-scheme is working like a charm. This is the code piece which is causing us the problem:

Class:org.apache.avro.mapred.AvroInputFormat
Method: protected FileStatus[] listStatus(JobConf job) throws IOException

Code piece:

  if (file.getPath().getName().endsWith(AvroOutputFormat.EXT))
          result.add(file);

This requires avro file to come into avro extension files else it wouldn't be added to the splits on input paths. I have tried changing our avro files to be with this name and things started working fine. There is one process in our pipeline which emits avro files without this extension and because of which cascading steps fail.

Thanks for all the help so far in solving this case for us.

@kkrugler
Copy link
Member

kkrugler commented May 6, 2016

Hi Arun - so for your issue re files not ending with ".avro", is there a suggested change? Or do you agree that Avro input files should be required to end with ".avro"?

@ksranga
Copy link

ksranga commented May 12, 2016

One way to check if a file is Avro is to use DataFileWriter.getSchema() instead of depending on the filename extension.

@arun-agarwal
Copy link

Hi KKrugler, No suggested change, only thing we can add to avro documentation is,
the setting the following property in app does the trick, properties.setProperty("avro.mapred.ignore.inputs.without.extension", "false");

@arun-agarwal
Copy link

@ksranga : the code that checks for extension is part of AvroInputFormat and avro-scheme or cascading has very small say in it. the only thing is setting up a property avro.mapred.ignore.inputs.without.extension to ignore extensions, so probably this small point can be added to our README.md or some documentation of avro-scheme.

@kkrugler
Copy link
Member

hi @vmagotra - I think the latest Cascading 3.1 targets Hadoop 2.7, which still is using Avro 1.7.4. So I agree we should switch back to that version.

@kkrugler kkrugler added this to the 3.1 milestone Jul 12, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants