Inferring schema from a source with other files or paths fails #7

kkrugler · 2013-08-21T20:59:58Z

Trying to infer the schema from a path which has other files, such as _SUCCESS, will likely fail.

There is a good example of how to do this properly here:
https://github.com/josephadler/fast-avro-storage/blob/master/src/main/java/com/linkedin/pig/FastAvroStorage.java#L142

We should do something similar.

kkrugler · 2013-08-21T21:00:00Z

So the change is basically to do:

FileStatus[] statusArray = fs.globStatus(p, visibleFiles);

When enumerating files, with:

public static final PathFilter visibleFiles = new PathFilter() {
@OverRide
public boolean accept(Path p) {
return (!p.getName().startsWith("_"));
}
};

Yes?

He's also got some code to try to do a one-deep traverse of directories found inside of the target directory. It seems like we'd need to leverage the enclosing Tap to do this, as otherwise the filtering of files that can be done at that level wouldn't be properly applied.

-- Ken

On Nov 2, 2012, at 2:23pm, Chris Severs wrote:

Trying to infer the schema from a path which has other files, such as _SUCCESS, will likely fail.

There is a good example of how to do this properly here:
https://github.com/josephadler/fast-avro-storage/blob/master/src/main/java/com/linkedin/pig/FastAvroStorage.java#L142

We should do something similar.

—
Reply to this email directly or view it on GitHub.

Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

kkrugler · 2013-08-21T21:00:00Z

Would it also make sense to ignore hidden files ?

kkrugler · 2013-08-21T21:00:01Z

I tried essentially what Ken suggested and it works when giving a glob path but not just a regular path but it's an easy fix.

What is the default behavior for Hadoop when you specify a directory as the input? Does it grab every file under that directory regardless of how deep? If so we should just go down the path until we find some good files and infer off that.

I think we should also ignore hidden files. Is there a good filter mask that would do so?

kkrugler · 2013-08-21T21:00:02Z

I added a bit that walks down the directories and finds a good avro file. This is also included in the new test.

kkrugler · 2013-08-21T21:00:02Z

On Nov 2, 2012, at 9:04pm, Chris Severs wrote:

I added a bit that walks down the directories and finds a good avro file. This is also included in the new test.

What happens if the directory is empty?

I remember Matt saying that this would cause a failure - I think he filed an issue (sorry, not online right now).

-- Ken

Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

kkrugler · 2013-08-21T21:00:03Z

On Nov 2, 2012, at 4:52pm, Chris Severs wrote:

I tried essentially what Ken suggested and it works when giving a glob path but not just a regular path but it's an easy fix.

What is the default behavior for Hadoop when you specify a directory as the input? Does it grab every file under that directory regardless of how deep?

No, it just goes one deep.

The same is true for GlobHfs

If so we should just go down the path until we find some good files and infer off that.

I think we should also ignore hidden files. Is there a good filter mask that would do so?

I haven't verified this, but I think the HDFS call being used will ignore invisible files (e.g. the .part-xxxx.crc files)

-- Ken

Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

kkrugler · 2013-08-21T21:00:03Z

For empty, do you mean empty in the sense that there is nothing in the directory that was provided at all or empty in the sense that there is an empty directory among other directories which have avro files? If the former, a runtime error is thrown since there are no files to infer from. If the later, it will keep looking until it finds a file.

For the depth I can bound it to go only 1 deep with no problem. I'll do that and push it up.

kkrugler · 2013-08-21T21:00:04Z

Hi Chris,

On Nov 3, 2012, at 2:25pm, Chris Severs wrote:

For empty, do you mean empty in the sense that there is nothing in the directory that was provided at all

Correct.
or empty in the sense that there is an empty directory among other directories which have avro files? If the former, a runtime error is thrown since there are no files to infer from.

I believe this is what Matt filed an issue about, since (for a regular Cascading Flow) normally having an empty file is OK.

So I imagine he'd prefer having cascading.avro return Fields.UNKNOWN for the empty directory source fields, versus an exception.

CCing Matt so he can chime in here.

-- Ken

If the later, it will keep looking until it finds a file.

For the depth I can bound it to go only 1 deep with no problem. I'll do that and push it up.

—
Reply to this email directly or view it on GitHub.

Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

kkrugler · 2013-08-21T21:00:04Z

That sounds reasonable. I can make a finer check to see if any files (not hidden or starting with _) are present and if so try to get the schema from them. If no files are present then I can just set the schema to be Schema.NULL. I think we should still throw an error if we can't get the schema from the first file we get however.

kkrugler · 2013-08-21T21:00:05Z

I have usually manually created an empty Avro file when an empty job result was produced to avoid the exception. However, yes typically we would not want an exception for empty input to avoid having to create a fake file.

kkrugler · 2013-08-21T21:00:05Z

On Nov 4, 2012, at 12:05pm, Chris Severs wrote:

That sounds reasonable. I can make a finer check to see if any files (not hidden or starting with _) are present and if so try to get the schema from them. If no files are present then I can just set the schema to be Schema.NULL. I think we should still throw an error if we can't get the schema from the first file we get however.

If the schema is set to Schema.NULL, will this ensure that retrieveSourceFields() sets the source fields to be Fields.UNKNOWN?

Also, I think Fields.UNKNOWN should be used in place of Fields.ALL for setSourceFields, in the main AvroScheme constructor here:

    else if (schema == null) {
        setSinkFields(Fields.ALL);
        setSourceFields(Fields.ALL);
    }

I don't think you can use Fields.ALL as the field specification in a source tap.

Anyway, let me know when you've got something pushed and I'll take a look at it.

-- Ken

Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

kkrugler · 2013-08-21T21:00:06Z

You're right that Fields.UNKNOWN is the correct thing to do in the constructor. I think for Schema.NULL we can probably make a special case and set to Fields.UNKNOWN as well ( I can't think of a time when Schema.NULL would be used for the data schema and have a problem with Fields.UNKNOWN). I'm out of town tomorrow so I'll probably get to this on Wednesday at the soonest.

kkrugler · 2013-08-21T21:00:06Z

I haven't had a chance to get this completely done yet. I'm out of town for a week so I'll pick it up when I get back.

kkrugler · 2013-08-21T21:00:07Z

This is fixed in the 2.2-wip branch if someone wants to take a look. I ended up setting the source fields to Fields.NONE in the case of Schema.NULL. Does that sound reasonable? No exception is thrown in this case. I need to write a quick test for this.

Matt, does this work for your use case?

kkrugler · 2013-08-21T21:00:07Z

Hi Chris,

On Dec 23, 2012, at 10:38am, Chris Severs wrote:

This is fixed in the 2.2-wip branch if someone wants to take a look. I ended up setting the source fields to Fields.NONE in the case of Schema.NULL. Does that sound reasonable? No exception is thrown in this case. I need to write a quick test for this.

Matt, does this work for your use case?

I don't have a good use case to test this out right now.

But CCing Matt, as I think he should.

-- Ken

Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

ghost assigned ccsevers Aug 21, 2013

kkrugler added this to the 3.1 milestone Jul 12, 2016

kkrugler assigned vmagotra Jul 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inferring schema from a source with other files or paths fails #7

Inferring schema from a source with other files or paths fails #7

kkrugler commented Aug 21, 2013

kkrugler commented Aug 21, 2013

kkrugler commented Aug 21, 2013

kkrugler commented Aug 21, 2013

kkrugler commented Aug 21, 2013

kkrugler commented Aug 21, 2013

kkrugler commented Aug 21, 2013

kkrugler commented Aug 21, 2013

kkrugler commented Aug 21, 2013

kkrugler commented Aug 21, 2013

kkrugler commented Aug 21, 2013

kkrugler commented Aug 21, 2013

kkrugler commented Aug 21, 2013

kkrugler commented Aug 21, 2013

kkrugler commented Aug 21, 2013

kkrugler commented Aug 21, 2013

Inferring schema from a source with other files or paths fails #7

Inferring schema from a source with other files or paths fails #7

Comments

kkrugler commented Aug 21, 2013

kkrugler commented Aug 21, 2013

kkrugler commented Aug 21, 2013

kkrugler commented Aug 21, 2013

kkrugler commented Aug 21, 2013

kkrugler commented Aug 21, 2013

kkrugler commented Aug 21, 2013

kkrugler commented Aug 21, 2013

kkrugler commented Aug 21, 2013

kkrugler commented Aug 21, 2013

kkrugler commented Aug 21, 2013

kkrugler commented Aug 21, 2013

kkrugler commented Aug 21, 2013

kkrugler commented Aug 21, 2013

kkrugler commented Aug 21, 2013

kkrugler commented Aug 21, 2013