Cosmos DB Gremlin Graph Demo

page_type

languages

products

sample

python

cosmosdb

azure-synapse

azure-cognitive-search

container-instance-service

container-registry

Cosmos DB Gremlin Graph Demo

Problem statement

Bank transactions have traditionally been stored in transactional databases and analysed using SQL queries and to increase scale they are now analysed in distributed systems using Apache Spark. While SQL is great to analyse this data finding relationships between transactions and accounts can be challenging. In this scenario we want to visualize 2 level of customer relationships i.e. if A sends to B and B send to C then we want to identify C when we look at transactions made by A together with C and vice versa.

Overview

Graph helps solve complex problems by utilizing power of relationships between objects, some of these can be modeled as SQL statements but gremlin api provide a more concise way to express and search relationships. In this solution we are using Azure Cosmos Graph DB to store transactions data with customer account id as vertices and transaction as edges and transactional amount as properties of the edges.Since running fan out queries on Cosmos DB is not ideal we are leveraging Azure cognitive search to index data in Cosmos DB and leverage search api perform full scan/search queries. Additionally, Azure search will give us the flexibility to search for account either received or sent. This provides a scalable solution that can scale for any number of transactions and keeping the RU requirement for Cosmos queries low.

Features:

Synapse spark is used to bulk load data into gremlin using SQL api NOTE: Cosmos gremlin expects to have certain json fields in the edge properties. Since cosmos billing is charged per hour we need to adjust the RU's accordingly to minimize cost, a spark cluster with 4 nodes and cosmos throughput at 20,000 RU/s ( single region) both edges (9 Million ) and vertices (6 Million) records can be ingested in an hour.
All search fan-out queries are done using Azure cognitive search api, Cosmos indexer can be scheduled at regular intervals to update the index
To keep the RU's low, Gremlin query is constructed to include account list e.g. when you search for account xyz all account send or received from xyz is created as vertices_list and gremlin query to get 2 level of transactions is executed as g.V().has('accountId',within({vertices_list})).optional(both().both()).bothE().as('e').inV().as('v').select('e', 'v') you can customize this query based on your use-case

Prerequisites

Installing this connector requires the following:

Azure subscription-level role assignments for both Contributor and User Access Administrator.
Azure Service Principal with client ID and secret - How to create Service Principal.

Getting started

Step.1 Deploy infrastructure

There are three deployment options for this demo:

Option 1:
1. Click on link to deploy the template.
Option 2.
1. Open a browser to https://shell.azure.com, Azure Cloud Shell is an interactive, authenticated, browser-accessible shell for managing Azure resources. It provides the flexibility of choosing the shell experience that best suits the way you work, either Bash or PowerShell
2. Select the Cloud Shell icon on the Azure portal
3. Select Bash
4. git clone this repo and cd into infra directory
5. Update settings.sh file with required values, use code command in bash shell to open the file in VS Code
6. Run ./infra-deployment.sh to deploy infrastructure
The above deployment should create container instance with a sample dashboard
Option 3.
1. Use GiHub actions to deploy services. Go to github_action_infra_deployment to see how to deploy services.

Step.2 Post install access setup

Add client ip to allow access to Synapse workspace. Navigate to resource group -> Synapse workspace -> Networking -> Click "Add client IP" and Save
Add yourself as a user to Synapse workspace. Navigate to Synapse workspace -> manage -> Access control -> Add -> scope "workspace" -> role "Synapse Administrator" -> select user "[email protected]" -> Apply
Add yourself as a user to Synapse Apache Spark administrator. Navigate to Synapse workspace -> manage -> Access control -> Add -> scope "workspace" -> role "Synapse Apache Spark administrator" -> select user "[email protected]" -> Apply
Create data container.Navigate to storage account and create container e.g. "data" and upload CSV file into this container
Assign read/write access to storage account.Navigate to Synapse workspace -> select "Data" sec -> select and expand "Linked" storage -> select Primary storage account and container e.g. data > right click on container "data" and click "Manage access" -> Add -> search and select user "[email protected]" -> assign read and write -> click Apply

Step.3 Load data

Upload CSV file PS_20174392719_1491204439457_log.csv into Synapse default storage account. Data source: Kaggle Fraud Transaction Detection.( NOTE: you need to use git-lfs to download the csv file locally )

Step.4 Data ingestion using PySpark

Import notebook "Load_Bank_transact_data.ipynb"
Update linkedService , cosmosEndpoint, cosmosMasterKey, cosmosDatabaseName and cosmosContainerName in notebook
Run notebook and monitor the progress of data load from Cosmos DB insights view ( NOTE: Cosmos billing is per hour so adjust your RU's accordingly to minimize cost)

Step.5 Sample dashboard app

A sample python webapp is deployed as part of infra deployment. Navigate to the public url from container instances and start exploring the data.screenshot of dashboard

Limitations

User authentication is not implemented yet for dashboard app

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
.github/workflows		.github/workflows
images		images
infra		infra
load_data		load_data
visualize		visualize
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
_config.yml		_config.yml
github_action_dashboard_deploy.md		github_action_dashboard_deploy.md
github_action_infra_deploy.md		github_action_infra_deploy.md
sample_gremlin_queries.md		sample_gremlin_queries.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cosmos DB Gremlin Graph Demo

Problem statement

Contents

Overview

Features:

Prerequisites

Getting started

Step.1 Deploy infrastructure

Step.2 Post install access setup

Step.3 Load data

Step.4 Data ingestion using PySpark

Step.5 Sample dashboard app

Limitations

References

About

Releases

Packages

Contributors 3

Languages

lordlinus/cosmosdb-graph-demo

Folders and files

Latest commit

History

Repository files navigation

Cosmos DB Gremlin Graph Demo

Problem statement

Contents

Overview

Features:

Prerequisites

Getting started

Step.1 Deploy infrastructure

Step.2 Post install access setup

Step.3 Load data

Step.4 Data ingestion using PySpark

Step.5 Sample dashboard app

Limitations

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages