-
Notifications
You must be signed in to change notification settings - Fork 58
Gotchas and Limitations
This is the documentation for Faunus 0.4.
Faunus was merged into Titan and renamed Titan-Hadoop in version 0.5.
Documentation for the latest Titan version is available at http://s3.thinkaurelius.com/docs/titan/current.
This section presents a list of outstanding issues and likely problems that users should be aware of. A design limitation denotes a limitation that is inherent to the foundation of Faunus and as such, is not something that will not be rectified in future release. On the contrary, temporary limitations are incurred by the current implementation and future versions should provide solutions.
Faunus is built atop Hadoop. Hadoop is not a real-time processing framework. All Hadoop jobs require a costly setup (even for small input data) that takes around 15 seconds to initiate. For real-time graph processing, use Titan (also see Distributed Graph Computing with Gremlin).
The Blueprints API states that an element (vertex/edge) can have any Object
as a property value (e.g. Vertex.setProperty(String,Object)
). Faunus only supports integer, float, double, long, string, and boolean property values.
Note that rdfs:label
is a common predicate in RDF. If use-localname=true
, then the edge label would be label
which is legal in Blueprints, but edge labels and vertex keys are in the same address space and thus, causes problems.
There is no easy way to serialize a Groovy closure and thus, propagate to the Hadoop jobs running on different machines. As such, until a solution is found, a closure must be provided as a String
. For example: filter('{it.degree > 10}')
instead of filter{it.degree > 10}
.
A single vertex must be able to fit within the -Xmx
upper bound of memory. As such, this means that a graph with a vertex with 10s of millions of edges might not fit within a reasonable machine. In the future, “vertex splitting” or streaming is a potential solution to this problem. For now, realize that a vertex with 1 million incident edges (no properties) is approximately 15 megs. The default mapper/reducer -Xmx
for Hadoop is 250 megs.
The current implementation of the GraphSON InputFormat is excessively inefficient. As it stands the full String
representation of vertex is held in memory, then its JSON Map
representation, and finally its FaunusVertex
representation. This can be fixed with a smarter, streaming parser in the future.
The Gremlin implementation that is currently distributed with Faunus is not identical to Gremlin/Pipes. Besides not all steps being implemented, the general rule is that once “the graph is left” (e.g. traverse to data that is not vertices or edges), then the traversal ends. This ending is represented as a pipeline lock in the Gremlin/Faunus compiler.
The binary sequence files supported by Hadoop are the primary means by which graph and traversal data is moved between MapReduce jobs in a Faunus chain. If a sequence file is saved to disk, be conscious of the traversal metadata it contains (e.g. path calculations).
Faunus is optimized to work best with Rexster 2.2.0+. Faunus will work with Rexster 2.1.0, but has some limitations. Due to some default and non-configurable settings in Rexster 2.1.0, it will only allow Faunus a maximum of four map tasks to connect to it. If more are configured, then Faunus will throw a SocketException
and fail.
To avoid such problems in Rexster 2.2.0+, the thread pool must be configured to include enough threads such that there are enough threads to service each of the expected long-run requests from the mapper tasks. Being able to strike a careful balance among the number of map tasks, Rexster memory availability, and Rexster thread pool size will greatly determine the speed at which Faunus will operate. It will likely take some experimentation to achieve the most efficient configuration.