generated from databricks-industry-solutions/industry-solutions-blueprints
-
Notifications
You must be signed in to change notification settings - Fork 4
/
02_aml_address_verification.py
162 lines (115 loc) · 7.63 KB
/
02_aml_address_verification.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
# Databricks notebook source
# MAGIC %md
# MAGIC You may find this series of notebooks at https://github.com/databricks-industry-solutions/anti-money-laundering. For more information about this solution accelerator, visit https://www.databricks.com/blog/2021/07/16/aml-solutions-at-scale-using-databricks-lakehouse-platform.html.
# COMMAND ----------
# MAGIC %md
# MAGIC
# MAGIC # Address Validation
# MAGIC A pattern we want to briefly touch upon is address matching of text to actual streetview images. Oftentimes, there is a need to validate the addresses which have been linked to entities on file. However, this can be a tedious, time-consuming, and manual process to both obtain a visualization of the address, clean, and validate. Luckily the Lakehouse architecture allows us to automate all of this using Python, machine learning runtimes with PyTorch, and pre-trained open-source models. Below is an example of a valid address to the human eye.
# COMMAND ----------
# MAGIC %md
# MAGIC <img src="https://brysmiwasb.blob.core.windows.net/demos/aml/aml_image_matching.png" width=600>
# COMMAND ----------
# MAGIC %run ./config/aml_config
# COMMAND ----------
from pyspark.sql.functions import *
addresses = (
spark
.table(config['db_entities'])
.filter(col("address").isNotNull())
.withColumn("address", translate(translate(col("address"), ',', ''), ' ', '+'))
.select('address')
.toPandas()
)
addresses.head(5)
# COMMAND ----------
# MAGIC %md
# MAGIC ### Using google maps API
# MAGIC Using street maps API, analysts can quickly access property pictures for investigation purpose. Although there are limitations in term of request rate (see Google maps API T&Cs), having access to some property data would be of great value for AML investigation.
# MAGIC In order to safely pass credentials to notebook, we make use of the [secrets API](https://docs.databricks.com/security/secrets/index.html)
# COMMAND ----------
goog_api_key = dbutils.secrets.get(scope="solution-accelerator-cicd", key="google-api")
# COMMAND ----------
import json
import time
import requests
import urllib.error
import urllib.parse
import urllib.request
for index, row in addresses[0:100].iterrows():
url = f"https://maps.googleapis.com/maps/api/streetview?parameters&size=640x640&fov=50&location={row['address']}&key={goog_api_key}"
req = requests.get(url)
with open(f'{temp_directory}/img_{index}.jpg', 'wb') as file:
file.write(req.content)
req.close()
# COMMAND ----------
# MAGIC %md
# MAGIC Using matplotlib, we can inspect a given address through the comfort of a Databricks notebook. It would appear that this specific address does not refer to any building / property. This additional context may have to be reported as part of an ongoing suspicious activity report (SAR).
# COMMAND ----------
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
plt.figure(figsize=(10, 10))
img=mpimg.imread(f'{temp_directory}/img_0.jpg')
imgplot = plt.imshow(img)
plt.show()
# COMMAND ----------
# MAGIC %md
# MAGIC ### Augmented intelligence
# MAGIC In addition to the complexity of investigating some random pictures, the sheer amount of data analyst would have to manually inspect often hinders an efficient investigation process. Could we programmatically access pictures from google map and train a model to detect property / building automatically? Whilst AI is often seen as a automated black box decision engine, we consider AI in the context of AML as an augmented intelligence process that provides analysts with all necessary context to conduct AML investigation faster and more effectively.
# COMMAND ----------
# MAGIC %md
# MAGIC Instead of training our own computer vision model that will be expensive in term of infrastructure, skillset and manual efforts labeling data, we demonstrate how analysts can use a [pretrained](http://pytorch.org/docs/master/torchvision/models.html) model in PyTorch to make predictions. The model we'll use is a [VGG16](https://pytorch.org/hub/pytorch_vision_vgg/) convolutional network, trained on [ImageNet](http://www.image-net.org/) dataset.
# COMMAND ----------
import io
import numpy as np
from PIL import Image
import requests
from matplotlib import cm
from torch.autograd import Variable
import torchvision.models as models
import torchvision.transforms as transforms
# Class labels used when training VGG as json, courtesy of the 'Example code' link above.
LABELS_URL = 'https://raw.githubusercontent.com/raghakot/keras-vis/master/resources/imagenet_class_index.json'
response = requests.get(LABELS_URL) # Make an HTTP GET request and store the response.
labels = {int(key): value for key, value in response.json().items()}
img_and_labels = {}
for i in range(100):
img=mpimg.imread(f'{temp_directory}/img_' + str(i) + '.jpg')
img = Image.fromarray(img)
# We can do all this preprocessing using a transform pipeline.
min_img_size = 224 # The min size, as noted in the PyTorch pretrained models doc, is 224 px.
transform_pipeline = transforms.Compose([transforms.Resize(min_img_size),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])])
img = transform_pipeline(img)
# PyTorch pretrained models expect the Tensor dims to be (num input imgs, num color channels, height, width).
# Currently however, we have (num color channels, height, width); let's fix this by inserting a new axis.
img = img.unsqueeze(0) # Insert the new axis at index 0 i.e. in front of the other axes/dims.
# Now that we have preprocessed our img, we need to convert it into a
# Variable; PyTorch models expect inputs to be Variables. A PyTorch Variable is a
# wrapper around a PyTorch Tensor.
img = Variable(img)
# Now let's load our model and get a prediction!
vgg = models.vgg16(pretrained=True) # This may take a few minutes.
prediction = vgg(img) # Returns a Tensor of shape (batch, num class labels)
prediction = prediction.data.numpy().argmax() # Our prediction will be the index of the class label with the largest value.
img_and_labels[i] = labels[prediction]
# COMMAND ----------
# MAGIC %md
# MAGIC As reported below, a pre-trained model already offers lots of information out of the box that could be really useful to understand how legit a given address is.
# COMMAND ----------
img_and_labels
# COMMAND ----------
# MAGIC %md
# MAGIC The power of Delta Lake allows us to store a reference to our unstructured data along with a label for simple querying in our classification breakdown below.
# COMMAND ----------
import pandas as pd
pdf = pd.DataFrame.from_dict(img_and_labels, orient='index', columns=['image_number', 'label'])
spark.createDataFrame(pdf).filter(col("label") != 'envelope').write.mode('overwrite').saveAsTable(config['db_streetview'])
# COMMAND ----------
display(sql("select label, count(1) from {} group by label".format(config['db_streetview'])))
# COMMAND ----------
# MAGIC %md
# MAGIC ### Closing thoughts
# MAGIC Although image classification is oftentimes a data science / data engineering exercise, storing results back to a delta table would allow AML analysts investigate those cases further through simple dashboarding or SQL capability. The power of Delta Lake allows us to store a reference to our unstructured data along with a label for simple querying in our classification breakdown below. The ability to query unstructured data with Delta Lake is an enormous time-saver for analysts and speeds up the validation process down to minutes instead of days or weeks.