page_type | languages | products | ||||||
---|---|---|---|---|---|---|---|---|
sample |
|
|
Bank transactions have traditionally been stored in transactional databases and analysed using SQL queries and to increase scale they are now analysed in distributed systems using Apache Spark. While SQL is great to analyse this data finding relationships between transactions and accounts can be challenging. In this scenario we want to visualize 2 level of customer relationships i.e. if A sends to B and B send to C then we want to identify C when we look at transactions made by A together with C and vice versa.
Graph helps solve complex problems by utilizing power of relationships between objects, some of these can be modeled as SQL statements but gremlin api provide a more concise way to express and search relationships. In this solution we are using Azure Cosmos Graph DB to store transactions data with customer account id as vertices and transaction as edges and transactional amount as properties of the edges.Since running fan out queries on Cosmos DB is not ideal we are leveraging Azure cognitive search to index data in Cosmos DB and leverage search api perform full scan/search queries. Additionally, Azure search will give us the flexibility to search for account either received or sent. This provides a scalable solution that can scale for any number of transactions and keeping the RU requirement for Cosmos queries low.
- Synapse spark is used to bulk load data into gremlin using SQL api NOTE: Cosmos gremlin expects to have certain json fields in the edge properties. Since cosmos billing is charged per hour we need to adjust the RU's accordingly to minimize cost, a spark cluster with 4 nodes and cosmos throughput at 20,000 RU/s ( single region) both edges (9 Million ) and vertices (6 Million) records can be ingested in an hour.
- All search fan-out queries are done using Azure cognitive search api, Cosmos indexer can be scheduled at regular intervals to update the index
- To keep the RU's low, Gremlin query is constructed to include account list e.g. when you search for account
xyz
all account send or received fromxyz
is created asvertices_list
and gremlin query to get 2 level of transactions is executed asg.V().has('accountId',within({vertices_list})).optional(both().both()).bothE().as('e').inV().as('v').select('e', 'v')
you can customize this query based on your use-case
Installing this connector requires the following:
- Azure subscription-level role assignments for both
Contributor
andUser Access Administrator
. - Azure Service Principal with client ID and secret - How to create Service Principal.
There are three deployment options for this demo:
-
Option 1:
- Click on link to deploy the template.
-
Option 2.
- Open a browser to https://shell.azure.com, Azure Cloud Shell is an interactive, authenticated, browser-accessible shell for managing Azure resources. It provides the flexibility of choosing the shell experience that best suits the way you work, either Bash or PowerShell
- Select the Cloud Shell icon on the Azure portal
- Select Bash
git clone
this repo and cd intoinfra
directory- Update
settings.sh
file with required values, usecode
command in bash shell to open the file in VS Code - Run
./infra-deployment.sh
to deploy infrastructure
The above deployment should create container instance with a sample dashboard
-
Option 3.
- Use GiHub actions to deploy services. Go to github_action_infra_deployment to see how to deploy services.
-
Add client ip to allow access to Synapse workspace. Navigate to resource group -> Synapse workspace -> Networking -> Click "Add client IP" and Save
-
Add yourself as a user to Synapse workspace. Navigate to Synapse workspace -> manage -> Access control -> Add -> scope "workspace" -> role "Synapse Administrator" -> select user "[email protected]" -> Apply
-
Add yourself as a user to Synapse Apache Spark administrator. Navigate to Synapse workspace -> manage -> Access control -> Add -> scope "workspace" -> role "Synapse Apache Spark administrator" -> select user "[email protected]" -> Apply
-
Create data container.Navigate to storage account and create container e.g. "data" and upload CSV file into this container
-
Assign read/write access to storage account.Navigate to Synapse workspace -> select "Data" sec -> select and expand "Linked" storage -> select Primary storage account and container e.g. data > right click on container "data" and click "Manage access" -> Add -> search and select user "[email protected]" -> assign read and write -> click Apply
- Upload CSV file PS_20174392719_1491204439457_log.csv into Synapse default storage account. Data source: Kaggle Fraud Transaction Detection.( NOTE: you need to use git-lfs to download the csv file locally )
- Import notebook "
Load_Bank_transact_data.ipynb
" - Update
linkedService
,cosmosEndpoint
,cosmosMasterKey
,cosmosDatabaseName
andcosmosContainerName
in notebook - Run notebook and monitor the progress of data load from Cosmos DB insights view ( NOTE: Cosmos billing is per hour so adjust your RU's accordingly to minimize cost)
A sample python webapp is deployed as part of infra deployment. Navigate to the public url from container instances and start exploring the data.screenshot of dashboard
- User authentication is not implemented yet for dashboard app
- https://tinkerpop.apache.org/docs/current/tutorials/getting-started/
- https://tinkerpop.apache.org/docs/current/reference/#a-note-on-lambdas
- https://github.com/MicrosoftDocs/azure-docs/blob/master/articles/cosmos-db/graph/gremlin-limits.md
- https://github.com/LuisBosquez/azure-cosmos-db-graph-working-guides/blob/master/graph-backend-json.md
- https://syedhassaanahmed.github.io/2018/10/28/writing-apache-spark-graphframes-to-azure-cosmos-db.html