Extend join APIs with "map-side" joins #90

blever · 2012-06-06T06:20:58Z

Now that Scoobi has DObjects, it should be possible to implement "map-side" joins. That is, joins where one data set is small enough to fit into memory.

blever · 2013-01-25T05:12:32Z

@raronson - please provide some requirements on the functionality of the types of map-side joins we'd like to see. Appropriate names and pointers to implementation approaches would be great.

blever · 2013-01-31T23:31:43Z

This comment is related to #175.

Often the smaller table in the map-side join is some data set that is present on disk. In order to perform the map-side join, it needs to be ingested and placed inside a DObject. For example:

val records =
  Source.fromInputStream(configuration.fs.open(new Path("my-small-table.txt"))).getLines()
             .map { _.split("\t").toList }
             .collect { case id :: name :: _ => (id, name) }
val small_table: DObject [Map[String, String]] = DObject(records.toMap)

val large_table: DList[(String, ...)] = ...
small_table join large_table map { ...

The way this works is that the reading of the file and placing it into a Map is all done client side. Then, a serialised version of the Map is pushed to the distributed cache from which each mapper task reads and deserialises as part of the join with the DList.

The problem with this approach is that it may be more efficient to move all of this work (reading the file and creating the Map) into the setup phase of the mapper (or reducer) task. And I don't believe there is a way to make this happen with the APIs Scoobi currently exposes ...

ghost assigned espringe Jun 6, 2012

ghost assigned tonymorris Dec 8, 2012

ghost assigned raronson Jan 25, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend join APIs with "map-side" joins #90

Extend join APIs with "map-side" joins #90

blever commented Jun 6, 2012

blever commented Jan 25, 2013

blever commented Jan 31, 2013

Extend join APIs with "map-side" joins #90

Extend join APIs with "map-side" joins #90

Comments

blever commented Jun 6, 2012

blever commented Jan 25, 2013

blever commented Jan 31, 2013