Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inferring schema from a source with other files or paths fails #7

Open
kkrugler opened this issue Aug 21, 2013 · 15 comments
Open

Inferring schema from a source with other files or paths fails #7

kkrugler opened this issue Aug 21, 2013 · 15 comments
Assignees
Milestone

Comments

@kkrugler
Copy link
Member

Trying to infer the schema from a path which has other files, such as _SUCCESS, will likely fail.

There is a good example of how to do this properly here:
https://github.com/josephadler/fast-avro-storage/blob/master/src/main/java/com/linkedin/pig/FastAvroStorage.java#L142

We should do something similar.

@kkrugler
Copy link
Member Author

So the change is basically to do:

FileStatus[] statusArray = fs.globStatus(p, visibleFiles);

When enumerating files, with:

public static final PathFilter visibleFiles = new PathFilter() {
@OverRide
public boolean accept(Path p) {
return (!p.getName().startsWith("_"));
}
};

Yes?

He's also got some code to try to do a one-deep traverse of directories found inside of the target directory. It seems like we'd need to leverage the enclosing Tap to do this, as otherwise the filtering of files that can be done at that level wouldn't be properly applied.

-- Ken

On Nov 2, 2012, at 2:23pm, Chris Severs wrote:

Trying to infer the schema from a path which has other files, such as _SUCCESS, will likely fail.

There is a good example of how to do this properly here:
https://github.com/josephadler/fast-avro-storage/blob/master/src/main/java/com/linkedin/pig/FastAvroStorage.java#L142

We should do something similar.


Reply to this email directly or view it on GitHub.


Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

@kkrugler
Copy link
Member Author

Would it also make sense to ignore hidden files ?

@kkrugler
Copy link
Member Author

I tried essentially what Ken suggested and it works when giving a glob path but not just a regular path but it's an easy fix.

What is the default behavior for Hadoop when you specify a directory as the input? Does it grab every file under that directory regardless of how deep? If so we should just go down the path until we find some good files and infer off that.

I think we should also ignore hidden files. Is there a good filter mask that would do so?

@kkrugler
Copy link
Member Author

I added a bit that walks down the directories and finds a good avro file. This is also included in the new test.

@kkrugler
Copy link
Member Author

On Nov 2, 2012, at 9:04pm, Chris Severs wrote:

I added a bit that walks down the directories and finds a good avro file. This is also included in the new test.

What happens if the directory is empty?

I remember Matt saying that this would cause a failure - I think he filed an issue (sorry, not online right now).

-- Ken


Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

@kkrugler
Copy link
Member Author

On Nov 2, 2012, at 4:52pm, Chris Severs wrote:

I tried essentially what Ken suggested and it works when giving a glob path but not just a regular path but it's an easy fix.

What is the default behavior for Hadoop when you specify a directory as the input? Does it grab every file under that directory regardless of how deep?

No, it just goes one deep.

The same is true for GlobHfs

If so we should just go down the path until we find some good files and infer off that.

I think we should also ignore hidden files. Is there a good filter mask that would do so?

I haven't verified this, but I think the HDFS call being used will ignore invisible files (e.g. the .part-xxxx.crc files)

-- Ken


Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

@kkrugler
Copy link
Member Author

For empty, do you mean empty in the sense that there is nothing in the directory that was provided at all or empty in the sense that there is an empty directory among other directories which have avro files? If the former, a runtime error is thrown since there are no files to infer from. If the later, it will keep looking until it finds a file.

For the depth I can bound it to go only 1 deep with no problem. I'll do that and push it up.

@kkrugler
Copy link
Member Author

Hi Chris,

On Nov 3, 2012, at 2:25pm, Chris Severs wrote:

For empty, do you mean empty in the sense that there is nothing in the directory that was provided at all

Correct.
or empty in the sense that there is an empty directory among other directories which have avro files? If the former, a runtime error is thrown since there are no files to infer from.

I believe this is what Matt filed an issue about, since (for a regular Cascading Flow) normally having an empty file is OK.

So I imagine he'd prefer having cascading.avro return Fields.UNKNOWN for the empty directory source fields, versus an exception.

CCing Matt so he can chime in here.

-- Ken

If the later, it will keep looking until it finds a file.

For the depth I can bound it to go only 1 deep with no problem. I'll do that and push it up.


Reply to this email directly or view it on GitHub.


Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

@kkrugler
Copy link
Member Author

That sounds reasonable. I can make a finer check to see if any files (not hidden or starting with _) are present and if so try to get the schema from them. If no files are present then I can just set the schema to be Schema.NULL. I think we should still throw an error if we can't get the schema from the first file we get however.

@kkrugler
Copy link
Member Author

I have usually manually created an empty Avro file when an empty job result was produced to avoid the exception. However, yes typically we would not want an exception for empty input to avoid having to create a fake file.

@kkrugler
Copy link
Member Author

On Nov 4, 2012, at 12:05pm, Chris Severs wrote:

That sounds reasonable. I can make a finer check to see if any files (not hidden or starting with _) are present and if so try to get the schema from them. If no files are present then I can just set the schema to be Schema.NULL. I think we should still throw an error if we can't get the schema from the first file we get however.

If the schema is set to Schema.NULL, will this ensure that retrieveSourceFields() sets the source fields to be Fields.UNKNOWN?

Also, I think Fields.UNKNOWN should be used in place of Fields.ALL for setSourceFields, in the main AvroScheme constructor here:

    else if (schema == null) {
        setSinkFields(Fields.ALL);
        setSourceFields(Fields.ALL);
    }

I don't think you can use Fields.ALL as the field specification in a source tap.

Anyway, let me know when you've got something pushed and I'll take a look at it.

-- Ken


Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

@kkrugler
Copy link
Member Author

You're right that Fields.UNKNOWN is the correct thing to do in the constructor. I think for Schema.NULL we can probably make a special case and set to Fields.UNKNOWN as well ( I can't think of a time when Schema.NULL would be used for the data schema and have a problem with Fields.UNKNOWN). I'm out of town tomorrow so I'll probably get to this on Wednesday at the soonest.

@kkrugler
Copy link
Member Author

I haven't had a chance to get this completely done yet. I'm out of town for a week so I'll pick it up when I get back.

@kkrugler
Copy link
Member Author

This is fixed in the 2.2-wip branch if someone wants to take a look. I ended up setting the source fields to Fields.NONE in the case of Schema.NULL. Does that sound reasonable? No exception is thrown in this case. I need to write a quick test for this.

Matt, does this work for your use case?

@kkrugler
Copy link
Member Author

Hi Chris,

On Dec 23, 2012, at 10:38am, Chris Severs wrote:

This is fixed in the 2.2-wip branch if someone wants to take a look. I ended up setting the source fields to Fields.NONE in the case of Schema.NULL. Does that sound reasonable? No exception is thrown in this case. I need to write a quick test for this.

Matt, does this work for your use case?

I don't have a good use case to test this out right now.

But CCing Matt, as I think he should.

-- Ken


Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

@ghost ghost assigned ccsevers Aug 21, 2013
@kkrugler kkrugler added this to the 3.1 milestone Jul 12, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants