-
Notifications
You must be signed in to change notification settings - Fork 1
/
index.html
267 lines (259 loc) · 15.4 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
<!doctype html>
<html>
<head>
<title>American Sign Language recognition</title>
<meta charset="utf-8" name="viewport" content="width=device-width, initial-scale=1">
<link href="css/frame.css" media="screen" rel="stylesheet" type="text/css" />
<link href="css/controls.css" media="screen" rel="stylesheet" type="text/css" />
<link href="css/custom.css" media="screen" rel="stylesheet" type="text/css" />
<link href='https://fonts.googleapis.com/css?family=Open+Sans:400,700' rel='stylesheet' type='text/css'>
<link href='https://fonts.googleapis.com/css?family=Open+Sans+Condensed:300,700' rel='stylesheet' type='text/css'>
<link href="https://fonts.googleapis.com/css?family=Source+Sans+Pro:400,700" rel="stylesheet">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<script src="js/menu.js"></script>
<style>
.menu-index {
color: rgb(255, 255, 255) !important;
opacity: 1 !important;
font-weight: 700 !important;
}
</style>
</head>
<body>
<div class="menu-container"></div>
<div class="content-container">
<div class="banner" style="background: url('img/hello.jpg') no-repeat center; background-size: cover; height: 200px;"></div>
<div class="banner">
<div class="banner-table flex-column">
<div class="flex-row">
<div class="flex-item flex-column">
<h2 class="add-top-margin-small">What problem are we trying to solve?</h2>
<p class="text">
Speech-to-text software exists for many languages around the world and
enables us to more-easily communicate with others in the era of digital communication.
However, there is a lack of similar technologies for Sign Language. We plan to take a
step in solving this by creating a pipeline that detects and identifies the letters of
the alphabet in American Sign Language (ASL).
</p>
</div>
</div>
</div>
</div>
<div class="content">
<div class="content-table flex-column">
<!-------------------------------------------------------------------------------------------->
<!--Start Intro-->
<div class="flex-row">
<div class="flex-item flex-column">
<img class="image" src="img/cover.png">
</div>
<div class="flex-item flex-column">
<p class="text text-large">
Sparsh Binjrajka, sbinj (at) cs.washington.edu<br><br>
Duncan Du, wenyudu (at) cs.washington.edu<br><br>
Aldrich Fan, longxf (at) uw.edu<br><br>
Josh Ning, long2000 (at) cs.washington.edu<br><br>
<a target="_blank" href="http://cs.washington.edu/">Paul G. Allen School<br>of Computer Science & Engineering</a><br>
<a target="_blank" href="http://www.washington.edu/">University of Washington</a><br><br>
185 E Stevens Way NE<br>
Seattle, WA 98195-2350
</p>
</div>
</div>
<div class="flex-row">
<div class="flex-item flex-column">
<p class="text add-top-margin">
We are creating a machine learning pipeline that detects and identifies the 26 letters of the ASL alphabet in a video or webcam format.
There are not many models that have successfully been able to classify signs in real-time. One that did show promising training results
turned out to be very biased and did not perform very well when tested by us. As such, we want to create our own dataset to validate and test existing models.
</p>
</div>
</div>
<!--End Intro-->
<!-------------------------------------------------------------------------------------------->
<!--Start Text Only-->
<div class="flex-row">
<div class="flex-item flex-column">
<h2>Existing works</h2>
<hr>
<p class="text">
We found an existing publication on identifying the ASL alphabet using the YOLOv5 model. However, upon further investigation,
we were uncertain if the dataset used was trustworthy. The training and testing dataset were taken in the same setting and
from similar angles, with most were taken from the same hand. The test result was likely biased and not representative of how
the model actually performs. This was further confirmed when we recreated the model and tested it out, getting poor results.
</p>
</div>
</div>
<!--End Text with Images and Image buttons-->
<!-------------------------------------------------------------------------------------------->
<!--Start Text around Image-->
<div class="content">
<div class="flex-row">
<div class="flex-item flex-column">
<h2>Methods</h2>
<hr>
<div class="flex-row">
<div class="flex-item flex-item-stretch flex-column">
<iframe width="800" height="550" src="https://www.youtube.com/embed/GwvBAgATCo0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
</div>
<div class="flex-item flex-column">
<p class="text">
The idea behind this was to create a dataset that more closely resembles the environment in which an ASL
recognition model might be deployed. We used our dataset to train the YOLOv8 model and, as you can see from
the video, it does well for signs that are at a varying distance from the camera and is pretty accurate across
all categories. There are a few that don’t quite work well like the letters “M” and “N” but that can be improved.
</p>
</div>
</div>
<p class="text">
In exploring existing work on ASL letter detection, we found a <a target="_blank" href="https://public.roboflow.com/object-detection/american-sign-language-letters">pre-trained YOLOv5 model by David Lee</a>.
A video demonstrating the results of the model shows almost all signs being detected with a confidence of at least 0.8.
</p>
</div>
</div>
<!-- <div class="flex-row">
<div class="flex-item flex-item-stretch flex-column">
<img class="image" src="img/dummay-img.png">
</div>
<div class="flex-item flex-item-stretch flex-column">
<p class="text">
We decided to replicate this model by downloading its dataset and, using a 70-20-10 split, trained the YOLOv8 model.
The results were quite underwhelming. It didn't recognise most signs correctly and for the ones it was correct, the
confidence was around 65%. This is not surprising since the dataset includes signs from the signer's perspective and
not from the viewer's perspective. Further, all of the pictures are of David Lee and his hand in the same environment.
Thus, the model probably is biased.
</p>
</div>
</div> -->
<div class="flex-row">
<div class="flex-item flex-item-stretch-2 flex-column">
<p class="text">
We decided to replicate this model by downloading its dataset and, using a 70-20-10 split, trained the YOLOv8 model.
The results were quite underwhelming. It didn’t recognise most signs correctly and for the ones it was correct, the
confidence was around 65%. This is not surprising since the dataset includes signs from the signer’s perspective and
not from the viewer’s perspective. Further, all of the pictures are of David Lee and his hand in the same environment.
Thus, the model probably is biased.
</p>
</div>
<div class="flex-item flex-item-stretch flex-column">
<iframe width="336" height="200" src="https://www.youtube.com/embed/ldqTpmufxUM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
</div>
</div>
<div class="flex-row">
<div class="flex-item flex-column">
<p class="text">
The limitations of the previous dataset was that it did not include different signers, some of the training data was either incorrectly signed or the pictures were from the signer’s perspective, and the same background was used for all pictures.
</p>
</div>
</div>
<div class="flex-row">
<div class="flex-item flex-column">
<p class="text">
We decided then to look for youtube videos that we could then break it down to a series of frames (using 2fps) and then annotate those to create our dataset.
The idea behind this was to create a dataset that not only had correct signs from the viewer’s
perspective but also a variety of signers with varying backgrounds which would make our model more accurate and less biased.
</p>
</div>
</div>
<div class="flex-row">
<div class="flex-item flex-column">
<p class="text">
When we trained this model, we also included augmentations for horizontal flip (to include left handed signers), rotation and shearing by +- 5 degrees to make the prediction invariant to minute changes in the hand position.
</p>
</div>
</div>
<div class="flex-row">
<div class="flex-item flex-item-stretch flex-column">
<iframe width="500" height="375" src="https://www.youtube.com/embed/ZxMw5Bs-zuM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
</div>
<div class="flex-item flex-item-stretch flex-column">
<p class="text">
This new model gave superior results to the previous model but it has its limitations as well (as can be seen in the video).
The model does very well in identifying signs if they’re placed a bit further away from the camera than if they’re placed closer.
This result made sense when we looked at our dataset again and most of the dataset were pictures that resembled the setup that
Duncan had with the webcam, i.e. located a bit further away from the webcam. Hence, the signs made by Duncan were more accurately
identified than the signs made by Sparsh who was located closer to the webcam. Another limitation of this model was that we could
not find very clear and sharp images for many of the signs since a lot of the frames we extracted from the videos either included
much of the signer’s face within the ROI box or their signs wer
</p>
</div>
</div>
<div class="flex-row">
<div class="flex-item flex-column">
<p class="text">
We created our own dataset was to generate pictures that had the following features:
</p>
<ol class="nested">
<li>Correct signs and from the perspective of the viewer.</li>
<li>7 different signers in varying lighting and background settings.</li>
<li>Varying camera angles with respect to the signer.</li>
<li>A mix of pictures that are close to the camera as well as further away.</li>
</ol>
</div>
</div>
<div class="flex-row">
<div class="flex-item flex-item-stretch flex-column">
<iframe width="800" height="600" src="https://www.youtube.com/embed/jvAcieWxq8A" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
</div>
<div class="flex-item flex-column">
<p class="text">
The idea behind this was to create a dataset that more closely resembles the environment in which an ASL
recognition model might be deployed. We used our dataset to train the YOLOv8 model and, as you can see from
the video, it does well for signs that are at a varying distance from the camera and is pretty accurate across
all categories. There are a few that don’t quite work well like the letters “M” and “N” but that can be improved.
</p>
</div>
</div>
<div class="flex-row">
<div class="flex-item flex-column">
<h2>Our dataset</h2>
<hr>
<p class="text">
Here's the link to our <a target="_blank" href="https://universe.roboflow.com/asl-classification/video-call-asl-signs">dataset</a>.
</p>
</div>
</div>
<div class="flex-row">
<div class="flex-item flex-column">
<h2>Future Works</h2>
<hr>
<h3>Word-Level Detection with Motion</h3>
<p class="text">
Currently, this model is only capable of recognising static signs - signs without any motion. However,
the vast majority of ASL signs include motion. A more complicated neural network like an LSTM would be
needed to extend this model to world level detection(WL-ASL) and other motion-based signs. But the same
dataset-creation techniques can be used for that dataset as well.
</p>
<h3>Dataset Diversity</h3>
<p class="text">
The dataset can be further improved by including more skin-tones and different background settings.
The 7 signers in the data are not a representative group of all ASL signers. A more complete dataset
should include more variety in terms of age, skintone, etc. In addition, the background and lighting
conditions in the dataset are limited. If a model were to be deployed in an arbitrary setting, it should
contain more variety in background features like furniture and different colored walls.
</p>
</div>
</div>
<div class="banner">
<div class="banner-table flex-column">
<div class="flex-row">
<div class="flex-item flex-column">
<h2>Credits</h2>
<hr>
<p class="text add-bottom-margin-large">
David Lee’s American Sign Language Letters Dataset — Link: <br>https://public.roboflow.com/object-detection/american-sign-language-letters
<br>Youtube Videos:
<br>https://www.youtube.com/watch?v=tkMg8g8vVUo
<br>https://www.youtube.com/watch?v=sHyG7iz3ork
<br>https://www.youtube.com/watch?v=bFv_mLwBvHc
<br>https://www.youtube.com/watch?v=lYhAAMDQl-Q
<br>RobowFlow Platform for data annotation and YOLOv8 model training
<br>https://roboflow.com/
</p>
</div>
</div>
</div>
</div>
</div>
</body>
</html>