-
Notifications
You must be signed in to change notification settings - Fork 58
Faunus and Hadoop
Faunus has two particular use cases: graph derivations and simple graph statistics. Faunus is not a general-purpose graph processing/algorithms package for Hadoop. Instead, due to the constraints imposed by the Hadoop computing model (MapReduce), Faunus is leveraged for what Hadoop is best at: parallel processing a massive number of atomic records. With Faunus, these atomic records are the vertices of a graph and the processing is global graph derivations and/or global graph statistics.
-
Graph derivation: Given an input graph, derive a new graph based upon the input graph’s structure and semantics. Other terms include graph rewriting and graph transformations.
- deriving a cousin-graph from father and brother edges.
- deriving a youth friend-graph by removing all vertices with age greater than 30.
-
Graph statistic: A graph is a complex structure that can best be understood when the nature of its structure is distilled down to a manageable set of numbers. An analogous term is graph/network analysis.
- counting the number of vertices and edges in the graph.
- determining the degree distribution of the graph.
- determining the distribution of edges labels in the graph.
- determine the assortative mixing in the graph.
Faunus’ statistics library is for simple global graph computations. Hadoop is not optimized for iterative algorithms such as the popular centrality algorithms and/or community detection algorithms.
Hadoop is a distributed storage and processing system that greatly simplifies the creation of distributed computing jobs. The default storage layer for Hadoop is HDFS. HDFS is similar to any other file system in that it can be used to store arbitrarily formatted files (e.g. binary, text, etc.). However, HDFS allows for the storage of files so large that they can not be represented on a single machine. As such, HDFS is a distributed file system. Given files distributed over HDFS, it is possible to process these files using Hadoop’s distributed processing framework, MapReduce. MapReduce represents a computation as a series of parallel/atomic key/value pair computations. There are two steps to MapReduce:
-
Map: For every
<key1,value1>
input, yield<key2,value2>
outputs. -
Reduce: For every
<key2,list<value2>>
input, yield<key3, value3>
outputs.
In this way, the map-step can be seen as a parallel analysis of all the <key1,value1>
pairs represented in the source file. The reduce-step aggregates all the values (value2
) emitted by the previous map-step that have the same key (key2
). Some algorithm over those values (value2
) is run that then ultimately emits <key3,value3>
outputs which are then written back to HDFS.
A graph is a data structure composed of vertices (dots,things) and edges (lines,links). A vertex has a set of incoming edges and a set of outgoing edges. There are numerous types of graphs and the one used by Faunus is the property graph model of the Blueprints API. A property graph is multi-relational in that edges are labeled to denote different types of relationships between vertices. Moreover, ever vertex/edge can have an arbitrary number of key/value pairs associated with it (note that these key/value pairs are not to be confused with the key/value pairs of Hadoop). Graphs are typically processed using traversals. A traversal is a algorithmic walk over the graph in order to arrive at a particular destination (e.g. search or derivation) or to yield some side-effect in the process (e.g. a ranking or recommendation or statistic).
A graph can be represented by an adjacency list. Each “row” represents a vertex along with its incident edges. For property graphs, the vertex and edge properties are stored in the row as well. Faunus interprets a graph from this perspective. Every key/value in Hadoop is a single vertex along with its properties and its incoming and outgoing edges. As such, a Faunus MapReduce job operates on each vertex in parallel.
There are two types of “steps” in Faunus.
-
Map-Only: Filters, Mutations
-
Map: For every
<null,vertex>
input, yield a<null,vertex>
output.
-
Map: For every
-
MapReduce: Traversals
-
Map: For every
<null,vertex>
input, yield a<id,tagged_element>
output. -
Reduce: For every
<id,list<tagged_element>>
input, yield a<null,vertex>
output.
-
Map: For every
For every vertex in a map-only step, the vertex is either allowed or filtered from the next processing step or the vertex is mutated in some way (e.g. properties removed/added, edges transposed, etc.). For every vertex in a map-reduce step, the vertex’s edges are analyzed and messages are sent to the adjacent vertices. These messages are then aggregated at the reduce step and the message receiving vertex is updated in some way.
Given that the input and output of both map-only and map-reduce steps is <null,vertex>
, it is possible to chain these steps together to yield a more complex derivation and/or statistic.
A theory of multi-relational graph derivations for the purpose of applying single-relational graph algorithms is described in the article below and forms the primary inspiration for Faunus.
- Rodriguez M.A., Shinavier, J., Exposing Multi-Relational Networks to Single-Relational Network Analysis Algorithms, Journal of Informetrics, 4(1), pp. 29—41, December 2009.
The following articles provided numerous insights into various theoretical and technical issues encountered during Faunus’ design and development.
- Lin, J., Dyer, C., Data-Intensive Text Processing with MapReduce, Morgan & Claypool Publishers, 2010.
- Kepner, J., Gilbert, J., Graph Algorithms in the Language of Linear Algebra, Society for Industrial & Applied Mathematics, 2011.
Finally, a collection of useful Hadoop related blog posts are provided below.