Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend join APIs with "map-side" joins #90

Open
blever opened this issue Jun 6, 2012 · 2 comments
Open

Extend join APIs with "map-side" joins #90

blever opened this issue Jun 6, 2012 · 2 comments
Milestone

Comments

@blever
Copy link
Contributor

blever commented Jun 6, 2012

Now that Scoobi has DObjects, it should be possible to implement "map-side" joins. That is, joins where one data set is small enough to fit into memory.

@ghost ghost assigned espringe Jun 6, 2012
@ghost ghost assigned tonymorris Dec 8, 2012
@ghost ghost assigned raronson Jan 25, 2013
@blever
Copy link
Contributor Author

blever commented Jan 25, 2013

@raronson - please provide some requirements on the functionality of the types of map-side joins we'd like to see. Appropriate names and pointers to implementation approaches would be great.

@blever
Copy link
Contributor Author

blever commented Jan 31, 2013

This comment is related to #175.

Often the smaller table in the map-side join is some data set that is present on disk. In order to perform the map-side join, it needs to be ingested and placed inside a DObject. For example:

val records =
  Source.fromInputStream(configuration.fs.open(new Path("my-small-table.txt"))).getLines()
             .map { _.split("\t").toList }
             .collect { case id :: name :: _ => (id, name) }
val small_table: DObject [Map[String, String]] = DObject(records.toMap)

val large_table: DList[(String, ...)] = ...
small_table join large_table map { ...

The way this works is that the reading of the file and placing it into a Map is all done client side. Then, a serialised version of the Map is pushed to the distributed cache from which each mapper task reads and deserialises as part of the join with the DList.

The problem with this approach is that it may be more efficient to move all of this work (reading the file and creating the Map) into the setup phase of the mapper (or reducer) task. And I don't believe there is a way to make this happen with the APIs Scoobi currently exposes ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants