-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inferring schema from a source with other files or paths fails #7
Comments
So the change is basically to do:
When enumerating files, with: public static final PathFilter visibleFiles = new PathFilter() { Yes? He's also got some code to try to do a one-deep traverse of directories found inside of the target directory. It seems like we'd need to leverage the enclosing Tap to do this, as otherwise the filtering of files that can be done at that level wouldn't be properly applied. -- Ken On Nov 2, 2012, at 2:23pm, Chris Severs wrote:
Ken Krugler |
Would it also make sense to ignore hidden files ? |
I tried essentially what Ken suggested and it works when giving a glob path but not just a regular path but it's an easy fix. What is the default behavior for Hadoop when you specify a directory as the input? Does it grab every file under that directory regardless of how deep? If so we should just go down the path until we find some good files and infer off that. I think we should also ignore hidden files. Is there a good filter mask that would do so? |
I added a bit that walks down the directories and finds a good avro file. This is also included in the new test. |
On Nov 2, 2012, at 9:04pm, Chris Severs wrote:
I remember Matt saying that this would cause a failure - I think he filed an issue (sorry, not online right now). -- Ken Ken Krugler |
On Nov 2, 2012, at 4:52pm, Chris Severs wrote:
The same is true for GlobHfs
-- Ken Ken Krugler |
For empty, do you mean empty in the sense that there is nothing in the directory that was provided at all or empty in the sense that there is an empty directory among other directories which have avro files? If the former, a runtime error is thrown since there are no files to infer from. If the later, it will keep looking until it finds a file. For the depth I can bound it to go only 1 deep with no problem. I'll do that and push it up. |
Hi Chris, On Nov 3, 2012, at 2:25pm, Chris Severs wrote:
So I imagine he'd prefer having cascading.avro return Fields.UNKNOWN for the empty directory source fields, versus an exception. CCing Matt so he can chime in here. -- Ken
Ken Krugler |
That sounds reasonable. I can make a finer check to see if any files (not hidden or starting with _) are present and if so try to get the schema from them. If no files are present then I can just set the schema to be Schema.NULL. I think we should still throw an error if we can't get the schema from the first file we get however. |
I have usually manually created an empty Avro file when an empty job result was produced to avoid the exception. However, yes typically we would not want an exception for empty input to avoid having to create a fake file. |
On Nov 4, 2012, at 12:05pm, Chris Severs wrote:
Also, I think Fields.UNKNOWN should be used in place of Fields.ALL for setSourceFields, in the main AvroScheme constructor here:
I don't think you can use Fields.ALL as the field specification in a source tap. Anyway, let me know when you've got something pushed and I'll take a look at it. -- Ken Ken Krugler |
You're right that Fields.UNKNOWN is the correct thing to do in the constructor. I think for Schema.NULL we can probably make a special case and set to Fields.UNKNOWN as well ( I can't think of a time when Schema.NULL would be used for the data schema and have a problem with Fields.UNKNOWN). I'm out of town tomorrow so I'll probably get to this on Wednesday at the soonest. |
I haven't had a chance to get this completely done yet. I'm out of town for a week so I'll pick it up when I get back. |
This is fixed in the 2.2-wip branch if someone wants to take a look. I ended up setting the source fields to Fields.NONE in the case of Schema.NULL. Does that sound reasonable? No exception is thrown in this case. I need to write a quick test for this. Matt, does this work for your use case? |
Hi Chris, On Dec 23, 2012, at 10:38am, Chris Severs wrote:
But CCing Matt, as I think he should. -- Ken Ken Krugler |
Trying to infer the schema from a path which has other files, such as _SUCCESS, will likely fail.
There is a good example of how to do this properly here:
https://github.com/josephadler/fast-avro-storage/blob/master/src/main/java/com/linkedin/pig/FastAvroStorage.java#L142
We should do something similar.
The text was updated successfully, but these errors were encountered: