-
Notifications
You must be signed in to change notification settings - Fork 58
Faunus and Hadoop
Faunus has two particular use cases: graph derivations and simple graph statistics. Faunus is not a general-purpose graph processing/algorithms package for Hadoop. Instead, due to the constraints imposed by the Hadoop computing model (MapReduce), Faunus is leveraged for what Hadoop is best at: parallel processing a massive number atomic items. With Faunus, these atomic items are the vertices of a graph and the processing is global graph derivations and/or global graph statistics.
-
Graph derivation: Given an input graph, derive a new graph based upon the input graph’s structure and semantics. Examples include:
- deriving a cousin-graph from father and brother edges.
- removing all vertices in the graph that have an age less than 30.
-
Graph statistic: A graph is a complex structure that can best be understood when the nature of its structure is distilled down to a manageable set of numbers. Examples include:
- counting the number of vertices and edges in the graph.
- determining the degree distribution of the graph.
- determining the distribution of edges labels in the graph.
- determine the assortative mixing in the graph.
Faunus’ statistics library is for simple global graph computations. Hadoop is not optimized for iterative algorithms such as the popular centrality algorithms and/or community detection algorithms.
Hadoop is a distributed storage and processing system that greatly simplifies the creation of distributed computing jobs. The default storage layer for Hadoop is HDFS. HDFS is similar to any other file system in that it can be used to store arbitrarily formatted files (e.g. binary, text, etc.). However, HDFS allows for the storage of files so large that they can not be represented on a single machine. As such, HDFS is a distributed file system. Given files distributed over HDFS, it is possible to process these files using Hadoop’s distributed processing framework, MapReduce. MapReduce represents a computation as a series of parallel/atomic key/value pair computations. There are two steps to MapReduce:
-
Map: For every
<key1,value1>
input, yield<key2,value2>
outputs. -
Reduce: For every
<key2,list<value2>>
input, yield<key3, value3>
outputs.
In this way, the map-step can be seen as a parallel analysis of all the <key1,value1>
pairs represented in the source file. The reduce-step aggregates all the values (value2
) emitted by the previous map-step that have the same key (key2
). Some algorithm over those values (value2
) is run that then ultimately emits <key3,value3>
outputs which are then written back to HDFS.
A graph is a data structure composed of vertices and edges. A vertex has a set of incoming edges and a set of outgoing edges. The property graph data model supports key/value pairs associated with vertices and edges and edges also have a label (or type). Graphs are processed using traversals. A traversal is a algorithmic walk over the graph in order to arrive at a particular destination (e.g. search) or to yield some side-effect in the process (e.g. a ranking or recommendation).
A graph can be represented by an adjacency list. Each row represents a vertex along with its incident edges. For the property graphs, the vertex and edge properties are stored in the row as well. Faunus interprets a graph from this perspective. Every key/value in Hadoop is a single vertex along with its properties and its incoming and outgoing edges. As such, a Faunus MapReduce job operates on each vertex in parallel.
There are two types of “steps” in Faunus.
- Map-Only: Filters, Mutations
- MapReduce: Traversals
For every vertex in a map-only step, the vertex is either allowed or filtered from the next processing step or the vertex is mutated in some way (e.g. properties removed/added, etc.).
For every vertex in a mapReduce step, the vertex’s edges are analyzed an messages are sent to the adjacent vertices. These messages are then aggregated at the reduce step and the message receiving vertex is updated in some way.
The primary use of Faunus is to leverage the provided steps in order to yield derived graphs. A graph can be composed of numerous types of relationships — father, mother, liveIn, etc. With Faunus, these explicit edges can be used to derive more semantically rich edges. For example: grandfather, grandchild, cohabitants, etc. In this way, Faunus allows for a graph represented in HDFS (or some other distributed storage environment) to have “semantically-rich” subgraphs isolated it from it and used to update the original graph or for processing by some other graph processing framework. The following paper describes the algebra implemented by Faunus.
Rodriguez M.A., Shinavier, J., Exposing Multi-Relational Networks to Single-Relational Network Analysis Algorithms, Journal of Informetrics, 4(1), pp. 29—41, December 2009.