From Nothing to Eye Candy

Deploy a Cluster

This will actually be quick. The easiest way to have a GeoWave cluster is to just use EMR with the GeoWave bootstrap actions: http://ngageoint.github.io/geowave/documentation.html#running-from-emr-2 Following those steps will give you a GeoWave cluster on AWS within minutes.

In fact in the baseline here there is a set of scripts that will automatically run through this example as an EMR bootstrap action (cheating).

Ingest Data

Pick a large interesting dataset of your choice. OSM GPX data is approximately 2.8 billion points and there are some interesting use cases for the aggregation of all of that data, but in this tutorial let's walk through using some GDELT data.

wget http://data.gdeltproject.org/events/md5sums
for file in `cat md5sums | cut -d' ' -f3` ; do wget http://data.gdeltproject.org/events/$file ; done
md5sum -c md5sums 2>&1

Now that we have the data, let's make sure to configure Accumulo properly. Let's add a `geowave` user and a `geowave` namespace within Accumulo. We'll associate the `geowave` namespace with GeoWave's library for accumulo. When you install from RPM it will be located at `hdfs://${HOSTNAME}:${HDFS_PORT}/accumulo/classpath/geowave/${GEOWAVE_VERSION}-apache/geowave-accumulo.jar`. The following shell script will in essence allow all tables prefixed by `geowave.` to use the geowave library.


## configure accumulo
cat <

Now its just a matter of running geowave commands to configure a store to point to the new namespace (gwNamespace defines a table prefix, so by using geowave.gdelt we will be leveraging the accumulo namespace set up in the previous step. The following command will configure a named entity "gdelt-accumulo" that can be referenced on ingest to supply all the connection parameter, then configure a named entity "gdelt-spatial" that is a configured indexing strategy utilizing spatial dimensional indexing with a partitioning strategy to avoid hotspotting (each partition will be structured spatially but the data will be assigned to an arbitrary partition with a pre-split Accumulo table based on that partitioning approach). Lastly, the data is ingested by referencing the store and the index strategy that were configured. There is an optional --cql parameter that is applied in this case.


geowave config addstore -t accumulo gdelt-accumulo --gwNamespace geowave.gdelt --zookeeper $HOSTNAME:2181 --instance $INSTANCE --user geowave --password geowave
geowave config addindex -t spatial gdelt-spatial --partitionStrategy round_robin --numPartitions $NUM_PARTITIONS
geowave ingest localtogw $STAGING_DIR/gdelt gdelt-accumulo gdelt-spatial -f gdelt --gdelt.cql "BBOX(geometry,${WEST},${SOUTH},${EAST},${NORTH})"

Run Analytics

Now let's show an example distributed process that may be run after that data has been ingested. In this case we will choose to run a Kernel Density Estimate (KDE) that can be used with a supplied color ramp to display a heatmap. We will configure a new store, just so we can store the data in entirely separate tables from the original dataset. Keep in mind this step is unnecessary if you'd prefer to keep the data in the same tables. We are configuring "gdelt-accumulo-out" as a store that we reference in the KDE command as the output of the analytic.


geowave config addstore -t accumulo gdelt-accumulo-out --gwNamespace geowave.kde_gdelt --zookeeper $HOSTNAME:2181 --instance $INSTANCE --user geowave --password geowave
hadoop jar ${GEOWAVE_TOOLS_HOME}/geowave-tools.jar analytic kde --featureType gdeltevent --minLevel 5 --maxLevel 26 --minSplits $NUM_PARTITIONS --maxSplits $NUM_PARTITIONS --coverageName gdeltevent_kde --hdfsHostPort ${HOSTNAME}:${HDFS_PORT} --jobSubmissionHostPort ${HOSTNAME}:${RESOURCE_MAN_PORT} --tileSize 1 gdelt-accumulo gdelt-accumulo-out

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

From Nothing to Eye Candy

Deploy a Cluster

Ingest Data

Run Analytics

Clone this wiki locally