1.Metric-Based Few-Shot Learning for Video Action Recognition ⬇️
In the few-shot scenario, a learner must effectively generalize to unseen classes given a small support set of labeled examples. While a relatively large amount of research has gone into few-shot learning for image classification, little work has been done on few-shot video classification. In this work, we address the task of few-shot video action recognition with a set of two-stream models. We evaluate the performance of a set of convolutional and recurrent neural network video encoder architectures used in conjunction with three popular metric-based few-shot algorithms. We train and evaluate using a few-shot split of the Kinetics 600 dataset. Our experiments confirm the importance of the two-stream setup, and find prototypical networks and pooled long short-term memory network embeddings to give the best performance as few-shot method and video encoder, respectively. For a 5-shot 5-way task, this setup obtains 84.2% accuracy on the test set and 59.4% on a special "challenge" test set, composed of highly confusable classes.
2.Street Crossing Aid Using Light-weight CNNs for the Visually Impaired ⬇️
In this paper, we address an issue that the visually impaired commonly face while crossing intersections and propose a solution that takes form as a mobile application. The application utilizes a deep learning convolutional neural network model, LytNetV2, to output necessary information that the visually impaired may lack when without human companions or guide-dogs. A prototype of the application runs on iOS devices of versions 11 or above. It is designed for comprehensiveness, concision, accuracy, and computational efficiency through delivering the two most important pieces of information, pedestrian traffic light color and direction, required to cross the road in real-time. Furthermore, it is specifically aimed to support those facing financial burden as the solution takes the form of a free mobile application. Through the modification and utilization of key principles in MobileNetV3 such as depthwise seperable convolutions and squeeze-excite layers, the deep neural network model achieves a classification accuracy of 96% and average angle error of 6.15 degrees, while running at a frame rate of 16.34 frames per second. Additionally, the model is trained as an image classifier, allowing for a faster and more accurate model. The network is able to outperform other methods such as object detection and non-deep learning algorithms in both accuracy and thoroughness. The information is delivered through both auditory signals and vibrations, and it has been tested on seven visually impaired and has received above satisfactory responses.
3.Mining Minimal Map-Segments for Visual Place Classifiers ⬇️
In visual place recognition (VPR), map segmentation (MS) is a preprocessing technique used to partition a given view-sequence map into place classes (i.e., map segments) so that each class has good place-specific training images for a visual place classifier (VPC). Existing approaches to MS implicitly/explicitly suppose that map segments have a certain size, or individual map segments are balanced in size. However, recent VPR systems showed that very small important map segments (minimal map segments) often suffice for VPC, and the remaining large unimportant portion of the map should be discarded to minimize map maintenance cost. Here, a new MS algorithm that can mine minimal map segments from a large view-sequence map is presented. To solve the inherently NP hard problem, MS is formulated as a video-segmentation problem and the efficient point-trajectory based paradigm of video segmentation is used. The proposed map representation was implemented with three types of VPC: deep convolutional neural network, bag-of-words, and object class detector, and each was integrated into a Monte Carlo localization algorithm (MCL) within a topometric VPR framework. Experiments using the publicly available NCLT dataset thoroughly investigate the efficacy of MS in terms of VPR performance.
4.Fault-Diagnosing SLAM for Varying Scale Change Detection ⬇️
In this paper, we present a new fault diagnosis (FD) -based approach for detection of imagery changes that can detect significant changes as inconsistencies between different sub-modules (e.g., self-localizaiton) of visual SLAM. Unlike classical change detection approaches such as pairwise image comparison (PC) and anomaly detection (AD), neither the memorization of each map image nor the maintenance of up-to-date place-specific anomaly detectors are required in this FD approach. A significant challenge that is encountered when incorporating different SLAM sub-modules into FD involves dealing with the varying scales of objects that have changed (e.g., the appearance of small dangerous obstacles on the floor). To address this issue, we reconsider the bag-of-words (BoW) image representation, by exploiting its recent advances in terms of self-localization and change detection. As a key advantage, BoW image representation can be reorganized into any different scaling by simply cropping the original BoW image. Furthermore, we propose to combine different self-localization modules with strong and weak BoW features with different discriminativity, and to treat inconsistency between strong and weak self-localization as an indicator of change. The efficacy of the proposed approach for FD with/without AD and/or PC was experimentally validated.
5.Target-Specific Action Classification for Automated Assessment of Human Motor Behavior from Video ⬇️
Objective monitoring and assessment of human motor behavior can improve the diagnosis and management of several medical conditions. Over the past decade, significant advances have been made in the use of wearable technology for continuously monitoring human motor behavior in free-living conditions. However, wearable technology remains ill-suited for applications which require monitoring and interpretation of complex motor behaviors (e.g. involving interactions with the environment). Recent advances in computer vision and deep learning have opened up new possibilities for extracting information from video recordings. In this paper, we present a hierarchical vision-based behavior phenotyping method for classification of basic human actions in video recordings performed using a single RGB camera. Our method addresses challenges associated with tracking multiple human actors and classification of actions in videos recorded in changing environments with different fields of view. We implement a cascaded pose tracker that uses temporal relationships between detections for short-term tracking and appearance-based tracklet fusion for long-term tracking. Furthermore, for action classification, we use pose evolution maps derived from the cascaded pose tracker as low-dimensional and interpretable representations of the movement sequences for training a convolutional neural network. The cascaded pose tracker achieves an average accuracy of 88% in tracking the target human actor in our video recordings, and overall system achieves average test accuracy of 84% for target-specific action classification in untrimmed video recordings.
6.Multi-user Augmented Reality Application for Video Communication in Virtual Space ⬇️
Communication is the most useful tool to impart knowledge, understand ideas, clarify thoughts and expressions, organize plan and manage every single day-to-day activity. Although there are different modes of communication, physical barrier always affects the clarity of the message due to the absence of body language and facial expressions. These barriers are overcome by video calling, which is technically the most advance mode of communication at present. The proposed work concentrates around the concept of video calling in a more natural and seamless way using Augmented Reality (AR). AR can be helpful in giving the users an experience of physical presence in each other's environment. Our work provides an entirely new platform for video calling, wherein the users can enjoy the privilege of their own virtual space to interact with the individual's environment. Moreover, there is no limitation of sharing the same screen space. Any number of participants can be accommodated over a single conference without having to compromise the screen size.
7.Adversarial Learning with Margin-based Triplet Embedding Regularization ⬇️
The Deep neural networks (DNNs) have achieved great success on a variety of computer vision tasks, however, they are highly vulnerable to adversarial attacks. To address this problem, we propose to improve the local smoothness of the representation space, by integrating a margin-based triplet embedding regularization term into the classification objective, so that the obtained model learns to resist adversarial examples. The regularization term consists of two steps optimizations which find potential perturbations and punish them by a large margin in an iterative way. Experimental results on MNIST, CASIA-WebFace, VGGFace2 and MS-Celeb-1M reveal that our approach increases the robustness of the network against both feature and label adversarial attacks in simple object classification and deep face recognition.
8.Document Rectification and Illumination Correction using a Patch-based CNN ⬇️
We propose a novel learning method to rectify document images with various distortion types from a single input image. As opposed to previous learning-based methods, our approach seeks to first learn the distortion flow on input image patches rather than the entire image. We then present a robust technique to stitch the patch results into the rectified document by processing in the gradient domain. Furthermore, we propose a second network to correct the uneven illumination, further improving the readability and OCR accuracy. Due to the less complex distortion present on the smaller image patches, our patch-based approach followed by stitching and illumination correction can significantly improve the overall accuracy in both the synthetic and real datasets.
9.Retro-Actions: Learning 'Close' by Time-Reversing 'Open' Videos ⬇️
We investigate video transforms that result in class-homogeneous label-transforms. These are video transforms that consistently maintain or modify the labels of all videos in each class. We propose a general approach to discover invariant classes, whose transformed examples maintain their label; pairs of equivariant classes, whose transformed examples exchange their labels; and novel-generating classes, whose transformed examples belong to a new class outside the dataset. Label transforms offer additional supervision previously unexplored in video recognition benefiting data augmentation and enabling zero-shot learning opportunities by learning a class from transformed videos of its counterpart.
Amongst such video transforms, we study horizontal-flipping, time-reversal, and their composition. We highlight errors in naively using horizontal-flipping as a form of data augmentation in video. Next, we validate the realism of time-reversed videos through a human perception study where people exhibit equal preference for forward and time-reversed videos. Finally, we test our approach on two datasets, Jester and Something-Something, evaluating the three video transforms for zero-shot learning and data augmentation. Our results show that gestures such as zooming in can be learnt from zooming out in a zero-shot setting, as well as more complex actions with state transitions such as digging something out of something from burying something in something.
10.Deep Aggregation of Regional Convolutional Activations for Content Based Image Retrieval ⬇️
One of the key challenges of deep learning based image retrieval remains in aggregating convolutional activations into one highly representative feature vector. Ideally, this descriptor should encode semantic, spatial and low level information. Even though off-the-shelf pre-trained neural networks can already produce good representations in combination with aggregation methods, appropriate fine tuning for the task of image retrieval has shown to significantly boost retrieval performance. In this paper, we present a simple yet effective supervised aggregation method built on top of existing regional pooling approaches. In addition to the maximum activation of a given region, we calculate regional average activations of extracted feature maps. Subsequently, weights for each of the pooled feature vectors are learned to perform a weighted aggregation to a single feature vector. Furthermore, we apply our newly proposed NRA loss function for deep metric learning to fine tune the backbone neural network and to learn the aggregation weights. Our method achieves state-of-the-art results for the INRIA Holidays data set and competitive results for the Oxford Buildings and Paris data sets while reducing the training time significantly.
11.Weakly Supervised Semantic Segmentation Using Constrained Dominant Sets ⬇️
The availability of large-scale data sets is an essential pre-requisite for deep learning based semantic segmentation schemes. Since obtaining pixel-level labels is extremely expensive, supervising deep semantic segmentation networks using low-cost weak annotations has been an attractive research problem in recent years. In this work, we explore the potential of Constrained Dominant Sets (CDS) for generating multi-labeled full mask predictions to train a fully convolutional network (FCN) for semantic segmentation. Our experimental results show that using CDS's yields higher-quality mask predictions compared to methods that have been adopted in the literature for the same purpose.
12.ACFNet: Attentional Class Feature Network for Semantic Segmentation ⬇️
Recent works have made great progress in semantic segmentation by exploiting richer context, most of which are designed from a spatial perspective. In contrast to previous works, we present the concept of class center which extracts the global context from a categorical perspective. This class-level context describes the overall representation of each class in an image. We further propose a novel module, named Attentional Class Feature (ACF) module, to calculate and adaptively combine different class centers according to each pixel. Based on the ACF module, we introduce a coarse-to-fine segmentation network, called Attentional Class Feature Network (ACFNet), which can be composed of an ACF module and any off-the-shell segmentation network (base network). In this paper, we use two types of base networks to evaluate the effectiveness of ACFNet. We achieve new state-of-the-art performance of 81.85% mIoU on Cityscapes dataset with only finely annotated data used for training.
13.EATEN: Entity-aware Attention for Single Shot Visual Text Extraction ⬇️
Extracting entity from images is a crucial part of many OCR applications, such as entity recognition of cards, invoices, and receipts. Most of the existing works employ classical detection and recognition paradigm. This paper proposes an Entity-aware Attention Text Extraction Network called EATEN, which is an end-to-end trainable system to extract the entities without any post-processing. In the proposed framework, each entity is parsed by its corresponding entity-aware decoder, respectively. Moreover, we innovatively introduce a state transition mechanism which further improves the robustness of entity extraction. In consideration of the absence of public benchmarks, we construct a dataset of almost 0.6 million images in three real-world scenarios (train ticket, passport and business card), which is publicly available at this https URL. To the best of our knowledge, EATEN is the first single shot method to extract entities from images. Extensive experiments on these benchmarks demonstrate the state-of-the-art performance of EATEN.
14.Learning Lightweight Pedestrian Detector with Hierarchical Knowledge Distillation ⬇️
It remains very challenging to build a pedestrian detection system for real world applications, which demand for both accuracy and speed. This work presents a novel hierarchical knowledge distillation framework to learn a lightweight pedestrian detector, which significantly reduces the computational cost and still holds the high accuracy at the same time. Following the `teacher--student' diagram that a stronger, deeper neural network can teach a lightweight network to learn better representations, we explore multiple knowledge distillation architectures and reframe this approach as a unified, hierarchical distillation framework. In particular, the proposed distillation is performed at multiple hierarchies, multiple stages in a modern detector, which empowers the student detector to learn both low-level details and high-level abstractions simultaneously. Experiment result shows that a student model trained by our framework, with 6 times compression in number of parameters, still achieves competitive performance as the teacher model on the widely used pedestrian detection benchmark.
15.CNN-based RGB-D Salient Object Detection: Learn, Select and Fuse ⬇️
The goal of this work is to present a systematic solution for RGB-D salient object detection, which addresses the following three aspects with a unified framework: modal-specific representation learning, complementary cue selection and cross-modal complement fusion. To learn discriminative modal-specific features, we propose a hierarchical cross-modal distillation scheme, in which the well-learned source modality provides supervisory signals to facilitate the learning process for the new modality. To better extract the complementary cues, we formulate a residual function to incorporate complements from the paired modality adaptively. Furthermore, a top-down fusion structure is constructed for sufficient cross-modal interactions and cross-level transmissions. The experimental results demonstrate the effectiveness of the proposed cross-modal distillation scheme in zero-shot saliency detection and pre-training on a new modality, as well as the advantages in selecting and fusing cross-modal/cross-level complements.
16.A nonlocal feature-driven exemplar-based approach for image inpainting ⬇️
We present a nonlocal variational image completion technique which admits simultaneous inpainting of multiple structures and textures in a unified framework. The recovery of geometric structures is achieved by using general convolution operators as a measure of behavior within an image. These are combined with a nonlocal exemplar-based approach to exploit the self-similarity of an image in the selected feature domains and to ensure the inpainting of textures. We also introduce an anisotropic patch distance metric to allow for better control of the feature selection within an image and present a nonlocal energy functional based on this metric. Finally, we derive an optimization algorithm for the proposed variational model and examine its validity experimentally with various test images.
17.Making the Invisible Visible: Action Recognition Through Walls and Occlusions ⬇️
Understanding people's actions and interactions typically depends on seeing them. Automating the process of action recognition from visual data has been the topic of much research in the computer vision community. But what if it is too dark, or if the person is occluded or behind a wall? In this paper, we introduce a neural network model that can detect human actions through walls and occlusions, and in poor lighting conditions. Our model takes radio frequency (RF) signals as input, generates 3D human skeletons as an intermediate representation, and recognizes actions and interactions of multiple people over time. By translating the input to an intermediate skeleton-based representation, our model can learn from both vision-based and RF-based datasets, and allow the two tasks to help each other. We show that our model achieves comparable accuracy to vision-based action recognition systems in visible scenarios, yet continues to work accurately when people are not visible, hence addressing scenarios that are beyond the limit of today's vision-based action recognition.
18.Spherical Kernel for Efficient Graph Convolution on 3D Point Clouds ⬇️
We propose a spherical kernel for efficient graph convolution of 3D point clouds. Our metric-based kernels systematically quantize the local 3D space to identify distinctive geometric relationships in the data. Similar to the regular grid CNN kernels, the spherical kernel maintains translation-invariance and asymmetry properties, where the former guarantees weight sharing among similar local structures in the data and the latter facilitates fine geometric learning. The proposed kernel is applied to graph neural networks without edge-dependent filter generation, making it computationally attractive for large point clouds. In our graph networks, each vertex is associated with a single point location and edges connect the neighborhood points within a defined range. The graph gets coarsened in the network with farthest point sampling. Analogous to the standard CNNs, we define pooling and unpooling operations for our network. We demonstrate the effectiveness of the proposed spherical kernel with graph neural networks for point cloud classification and semantic segmentation using ModelNet, ShapeNet, RueMonge2014, ScanNet and S3DIS datasets. The source code and the trained models can be downloaded from this https URL.
19.Coupled Generative Adversarial Network for Continuous Fine-grained Action Segmentation ⬇️
We propose a novel conditional GAN (cGAN) model for continuous fine-grained human action segmentation, that utilises multi-modal data and learned scene context information. The proposed approach utilises two GANs: termed Action GAN and Auxiliary GAN, where the Action GAN is trained to operate over the current RGB frame while the Auxiliary GAN utilises supplementary information such as depth or optical flow. The goal of both GANs is to generate similar `action codes', a vector representation of the current action. To facilitate this process a context extractor that incorporates data and recent outputs from both modes is used to extract context information to aid recognition. The result is a recurrent GAN architecture which learns a task specific loss function from multiple feature modalities. Extensive evaluations on variants of the proposed model to show the importance of utilising different information streams such as context and auxiliary information in the proposed network; and show that our model is capable of outperforming state-of-the-art methods for three widely used datasets: 50 Salads, MERL Shopping and Georgia Tech Egocentric Activities, comprising both static and dynamic camera settings.
20.Forecasting Future Action Sequences with Neural Memory Networks ⬇️
We propose a novel neural memory network based framework for future action sequence forecasting. This is a challenging task where we have to consider short-term, within sequence relationships as well as relationships in between sequences, to understand how sequences of actions evolve over time. To capture these relationships effectively, we introduce neural memory networks to our modelling scheme. We show the significance of using two input streams, the observed frames and the corresponding action labels, which provide different information cues for our prediction task. Furthermore, through the proposed method we effectively map the long-term relationships among individual input sequences through separate memory modules, which enables better fusion of the salient features. Our method outperforms the state-of-the-art approaches by a large margin on two publicly available datasets: Breakfast and 50 Salads.
21.Fourier-CPPNs for Image Synthesis ⬇️
Compositional Pattern Producing Networks (CPPNs) are differentiable networks that independently map (x, y) pixel coordinates to (r, g, b) colour values. Recently, CPPNs have been used for creating interesting imagery for creative purposes, e.g., neural art. However their architecture biases generated images to be overly smooth, lacking high-frequency detail. In this work, we extend CPPNs to explicitly model the frequency information for each pixel output, capturing frequencies beyond the DC component. We show that our Fourier-CPPNs (F-CPPNs) provide improved visual detail for image synthesis.
22.Learning 3D-aware Egocentric Spatial-Temporal Interaction via Graph Convolutional Networks ⬇️
To enable intelligent automated driving systems, a promising strategy is to understand how human drives and interacts with road users in complicated driving situations. In this paper, we propose a 3D-aware egocentric spatial-temporal interaction framework for automated driving applications. Graph convolution networks (GCN) is devised for interaction modeling. We introduce three novel concepts into GCN. First, we decompose egocentric interactions into ego-thing and ego-stuff interaction, modeled by two GCNs. In both GCNs, ego nodes are introduced to encode the interaction between thing objects (e.g., car and pedestrian), and interaction between stuff objects (e.g., lane marking and traffic light). Second, objects' 3D locations are explicitly incorporated into GCN to better model egocentric interactions. Third, to implement ego-stuff interaction in GCN, we propose a MaskAlign operation to extract features for irregular objects.
We validate the proposed framework on tactical driver behavior recognition. Extensive experiments are conducted using Honda Research Institute Driving Dataset, the largest dataset with diverse tactical driver behavior annotations. Our framework demonstrates substantial performance boost over baselines on the two experimental settings by 3.9% and 6.0%, respectively. Furthermore, we visualize the learned affinity matrices, which encode ego-thing and ego-stuff interactions, to showcase the proposed framework can capture interactions effectively.
23.Fine-grained Action Segmentation using the Semi-Supervised Action GAN ⬇️
In this paper we address the problem of continuous fine-grained action segmentation, in which multiple actions are present in an unsegmented video stream. The challenge for this task lies in the need to represent the hierarchical nature of the actions and to detect the transitions between actions, allowing us to localise the actions within the video effectively. We propose a novel recurrent semi-supervised Generative Adversarial Network (GAN) model for continuous fine-grained human action segmentation. Temporal context information is captured via a novel Gated Context Extractor (GCE) module, composed of gated attention units, that directs the queued context information through the generator model, for enhanced action segmentation. The GAN is made to learn features in a semi-supervised manner, enabling the model to perform action classification jointly with the standard, unsupervised, GAN learning procedure. We perform extensive evaluations on different architectural variants to demonstrate the importance of the proposed network architecture, and show that it is capable of outperforming current state-of-the-art on three challenging datasets: 50 Salads, MERL Shopping and Georgia Tech Egocentric Activities dataset.
24.Propagated Perturbation of Adversarial Attack for well-known CNNs: Empirical Study and its Explanation ⬇️
Deep Neural Network based classifiers are known to be vulnerable to perturbations of inputs constructed by an adversarial attack to force misclassification. Most studies have focused on how to make vulnerable noise by gradient based attack methods or to defense model from adversarial attack. The use of the denoiser model is one of a well-known solution to reduce the adversarial noise although classification performance had not significantly improved. In this study, we aim to analyze the propagation of adversarial attack as an explainable AI(XAI) point of view. Specifically, we examine the trend of adversarial perturbations through the CNN architectures. To analyze the propagated perturbation, we measured normalized Euclidean Distance and cosine distance in each CNN layer between the feature map of the perturbed image passed through denoiser and the non-perturbed original image. We used five well-known CNN based classifiers and three gradient-based adversarial attacks. From the experimental results, we observed that in most cases, Euclidean Distance explosively increases in the final fully connected layer while cosine distance fluctuated and disappeared at the last layer. This means that the use of denoiser can decrease the amount of noise. However, it failed to defense accuracy degradation.
25.Triplet-Aware Scene Graph Embeddings ⬇️
Scene graphs have become an important form of structured knowledge for tasks such as for image generation, visual relation detection, visual question answering, and image retrieval. While visualizing and interpreting word embeddings is well understood, scene graph embeddings have not been fully explored. In this work, we train scene graph embeddings in a layout generation task with different forms of supervision, specifically introducing triplet super-vision and data augmentation. We see a significant performance increase in both metrics that measure the goodness of layout prediction, mean intersection-over-union (mIoU)(52.3% vs. 49.2%) and relation score (61.7% vs. 54.1%),after the addition of triplet supervision and data augmentation. To understand how these different methods affect the scene graph representation, we apply several new visualization and evaluation methods to explore the evolution of the scene graph embedding. We find that triplet supervision significantly improves the embedding separability, which is highly correlated with the performance of the layout prediction model.
26.Gaze Estimation for Assisted Living Environments ⬇️
Effective assisted living environments must be able to perform inferences on how their occupants interact with one another as well as with surrounding objects. To accomplish this goal using a vision-based automated approach, multiple tasks such as pose estimation, object segmentation and gaze estimation must be addressed. Gaze direction in particular provides some of the strongest indications of how a person interacts with the environment. In this paper, we propose a simple neural network regressor that estimates the gaze direction of individuals in a multi-camera assisted living scenario, relying only on the relative positions of facial keypoints collected from a single pose estimation model. To handle cases of keypoint occlusion, our model exploits a novel confidence gated unit in its input layer. In addition to the gaze direction, our model also outputs an estimation of its own prediction uncertainty. Experimental results on a public benchmark demonstrate that our approach performs on pair with a complex, dataset-specific baseline, while its uncertainty predictions are highly correlated to the actual angular error of corresponding estimations. Finally, experiments on images from a real assisted living environment demonstrate the higher suitability of our model for its final application.
27.Unsupervised Learning for Real-World Super-Resolution ⬇️
Most current super-resolution methods rely on low and high resolution image pairs to train a network in a fully supervised manner. However, such image pairs are not available in real-world applications. Instead of directly addressing this problem, most works employ the popular bicubic downsampling strategy to artificially generate a corresponding low resolution image. Unfortunately, this strategy introduces significant artifacts, removing natural sensor noise and other real-world characteristics. Super-resolution networks trained on such bicubic images therefore struggle to generalize to natural images. In this work, we propose an unsupervised approach for image super-resolution. Given only unpaired data, we learn to invert the effects of bicubic downsampling in order to restore the natural image characteristics present in the data. This allows us to generate realistic image pairs, faithfully reflecting the distribution of real-world images. Our super-resolution network can therefore be trained with direct pixel-wise supervision in the high resolution domain, while robustly generalizing to real input. We demonstrate the effectiveness of our approach in quantitative and qualitative experiments.
28.Understanding Architectures Learnt by Cell-based Neural Architecture Search ⬇️
Neural architecture search (NAS) generates architectures automatically for given tasks, e.g., image classification and language modeling. Recently, various NAS algorithms have been proposed to improve search efficiency and effectiveness. However, little attention is paid to understand the generated architectures, including whether they share any commonality. In this paper, we analyze the generated architectures and give our explanations of their superior performance. We firstly uncover that the architectures generated by NAS algorithms share a common connection pattern, which contributes to their fast convergence. Consequently, these architectures are selected during architecture search. We further empirically and theoretically show that the fast convergence is the consequence of smooth loss landscape and accurate gradient information conducted by the common connection pattern. Contracting to universal recognition, we finally observe that popular NAS architectures do not always generalize better than the candidate architectures, encouraging us to re-think about the state-of-the-art NAS algorithms.
29.Defending Against Physically Realizable Attacks on Image Classification ⬇️
We study the problem of defending deep neural network approaches for image classification from physically realizable attacks. First, we demonstrate that the two most scalable and effective methods for learning robust models, adversarial training with PGD attacks and randomized smoothing, exhibit very limited effectiveness against three of the highest profile physical attacks. Next, we propose a new abstract adversarial model, rectangular occlusion attacks, in which an adversary places a small adversarially crafted rectangle in an image, and develop two approaches for efficiently computing the resulting adversarial examples. Finally, we demonstrate that adversarial training using our new attack yields image classification models that exhibit high robustness against the physically realizable attacks we study, offering the first effective generic defense against such attacks.
30.An Efficient Sampling-based Method for Online Informative Path Planning in Unknown Environments ⬇️
The ability to plan informative paths online is essential to robot autonomy. In particular, sampling-based approaches are often used as they are capable of using arbitrary information gain formulations. However, they are prone to local minima, resulting in sub-optimal trajectories, and sometimes do not reach global coverage. In this paper, we present a new RRT*-inspired online informative path planning algorithm. Our method continuously expands a single tree of candidate trajectories and rewires segments to maintain the tree and refine intermediate trajectories. This allows the algorithm to achieve global coverage and maximize the utility of a path in a global context, using a single objective function. We demonstrate the algorithm's capabilities in the applications of autonomous indoor exploration as well as accurate Truncated Signed Distance Field (TSDF)-based 3D reconstruction on-board a Micro Aerial vehicle (MAV). We study the impact of commonly used information gain and cost formulations in these scenarios and propose a novel TSDF-based 3D reconstruction gain and cost-utility formulation. Detailed evaluation in realistic simulation environments show that our approach outperforms state of the art methods in these tasks. Experiments on a real MAV demonstrate the ability of our method to robustly plan in real-time, exploring an indoor environment solely with on-board sensing and computation. We make our framework available for future research.
31.A Transfer Learning Approach for Automated Segmentation of Prostate Whole Gland and Transition Zone in Diffusion Weighted MRI ⬇️
The segmentation of prostate whole gland and transition zone in Diffusion Weighted MRI (DWI) are the first step in designing computer-aided detection algorithms for prostate cancer. However, variations in MRI acquisition parameters and scanner manufacturing result in different appearances of prostate tissue in the images. Convolutional neural networks (CNNs) which have shown to be successful in various medical image analysis tasks including segmentation are typically sensitive to the variations in imaging parameters. This sensitivity leads to poor segmentation performance of CNNs trained on a source cohort and tested on a target cohort from a different scanner and hence, it limits the applicability of CNNs for cross-cohort training and testing. Contouring prostate whole gland and transition zone in DWI images are time-consuming and expensive. Thus, it is important to enable CNNs pretrained on images of source domain, to segment images of target domain with minimum requirement for manual segmentation of images from the target domain. In this work, we propose a transfer learning method based on a modified U-net architecture and loss function, for segmentation of prostate whole gland and transition zone in DWIs using a CNN pretrained on a source dataset and tested on the target dataset. We explore the effect of the size of subset of target dataset used for fine-tuning the pre-trained CNN on the overall segmentation accuracy. Our results show that with a fine-tuning data as few as 30 patients from the target domain, the proposed transfer learning-based algorithm can reach dice score coefficient of 0.80 for both prostate whole gland and transition zone segmentation. Using a fine-tuning data of 115 patients from the target domain, dice score coefficient of 0.85 and 0.84 are achieved for segmentation of whole gland and transition zone, respectively, in the target domain.
32.Underwater Image Super-Resolution using Deep Residual Multipliers ⬇️
We present a deep residual network-based generative model for single image super-resolution (SISR) of underwater imagery for use by autonomous underwater robots. We also provide an adversarial training pipeline for learning SISR from paired data. In order to supervise the training, we formulate an objective function that evaluates the perceptual quality of an image based on its global content, color, and local style information. Additionally, we present USR-248, a large-scale dataset of three sets of underwater images of high (640x480) and low (80x60, 160x120, and 320x240) resolution. USR-248 contains over 7K paired instances in each set of data for supervised training of 2x, 4x, or 8x SISR models. Furthermore, we validate the effectiveness of our proposed model through qualitative and quantitative experiments and compare the results with several state-of-the-art models' performances. We also analyze its practical feasibility for applications such as scene understanding and attention modeling in noisy visual conditions.
33.Genetic Neural Architecture Search for automatic assessment of human sperm images ⬇️
Male infertility is a disease which affects approximately 7% of men. Sperm morphology analysis (SMA) is one of the main diagnosis methods for this problem. Manual SMA is an inexact, subjective, non-reproducible, and hard to teach process. As a result, in this paper, we introduce a novel automatic SMA based on a neural architecture search algorithm termed Genetic Neural Architecture Search (GeNAS). For this purpose, we used a collection of images called MHSMA dataset contains 1,540 sperm images which have been collected from 235 patients with infertility problems. GeNAS is a genetic algorithm that acts as a meta-controller which explores the constrained search space of plain convolutional neural network architectures. Every individual of the genetic algorithm is a convolutional neural network trained to predict morphological deformities in different segments of human sperm (head, vacuole, and acrosome), and its fitness is calculated by a novel proposed method named GeNAS-WF especially designed for noisy, low resolution, and imbalanced datasets. Also, a hashing method is used to save each trained neural architecture fitness, so we could reuse them during fitness evaluation and speed up the algorithm. Besides, in terms of running time and computation power, our proposed architecture search method is far more efficient than most of the other existing neural architecture search algorithms. Additionally, other proposed methods have been evaluated on balanced datasets, whereas GeNAS is built specifically for noisy, low quality, and imbalanced datasets which are common in the field of medical imaging. In our experiments, the best neural architecture found by GeNAS has reached an accuracy of 92.66%, 77.33%, and 77.66% in the vacuole, head, and acrosome abnormality detection, respectively. In comparison to other proposed algorithms for MHSMA dataset, GeNAS achieved state-of-the-art results.
34.Brain Tumor Segmentation and Survival Prediction ⬇️
The paper demonstrates the use of the fully convolutional neural network for glioma segmentation on the BraTS 2019 dataset. Three-layers deep encoder-decoder architecture is used along with dense connection at encoder part to propagate the information from coarse layer to deep layers. This architecture is used to train three tumor sub-components separately. Subcomponent training weights are initialized with whole tumor weights to get the localization of the tumor within the brain. At the end, three segmentation results were merged to get the entire tumor segmentation. Dice Similarity of training dataset with focal loss implementation for whole tumor, tumor core and enhancing tumor is 0.92, 0.90 and 0.79 respectively. Radiomic features along with segmentation results and age are used to predict the overall survival of patients using random forest regressor to classify survival of patients in long, medium and short survival classes. 55.4% of classification accuracy is reported for training dataset with the scans whose resection status is gross-total resection.
35.Deep 3D-Zoom Net: Unsupervised Learning of Photo-Realistic 3D-Zoom ⬇️
The 3D-zoom operation is the positive translation of the camera in the Z-axis, perpendicular to the image plane. In contrast, the optical zoom changes the focal length and the digital zoom is used to enlarge a certain region of an image to the original image size. In this paper, we are the first to formulate an unsupervised 3D-zoom learning problem where images with an arbitrary zoom factor can be generated from a given single image. An unsupervised framework is convenient, as it is a challenging task to obtain a 3D-zoom dataset of natural scenes due to the need for special equipment to ensure camera movement is restricted to the Z-axis. In addition, the objects in the scenes should not move when being captured, which hinders the construction of a large dataset of outdoor scenes. We present a novel unsupervised framework to learn how to generate arbitrarily 3D-zoomed versions of a single image, not requiring a 3D-zoom ground truth, called the Deep 3D-Zoom Net. The Deep 3D-Zoom Net incorporates the following features: (i) transfer learning from a pre-trained disparity estimation network via a back re-projection reconstruction loss; (ii) a fully convolutional network architecture that models depth-image-based rendering (DIBR), taking into account high-frequency details without the need for estimating the intermediate disparity; and (iii) incorporating a discriminator network that acts as a no-reference penalty for unnaturally rendered areas. Even though there is no baseline to fairly compare our results, our method outperforms previous novel view synthesis research in terms of realistic appearance on large camera baselines. We performed extensive experiments to verify the effectiveness of our method on the KITTI and Cityscapes datasets.
36.Infusing Learned Priors into Model-Based Multispectral Imaging ⬇️
We introduce a new algorithm for regularized reconstruction of multispectral (MS) images from noisy linear measurements. Unlike traditional approaches, the proposed algorithm regularizes the recovery problem by using a prior specified \emph{only} through a learned denoising function. More specifically, we propose a new accelerated gradient method (AGM) variant of regularization by denoising (RED) for model-based MS image reconstruction. The key ingredient of our approach is the three-dimensional (3D) deep neural net (DNN) denoiser that can fully leverage spationspectral correlations within MS images. Our results suggest the generalizability of our MS-RED algorithm, where a single trained DNN can be used to solve several different MS imaging problems.
37.Learning Your Way Without Map or Compass: Panoramic Target Driven Visual Navigation ⬇️
We present a robot navigation system that uses an imitation learning framework to successfully navigate in complex environments. Our framework takes a pre-built 3D scan of a real environment and trains an agent from pre-generated expert trajectories to navigate to any position given a panoramic view of the goal and the current visual input without relying on map, compass, odometry, GPS or relative position of the target at runtime. Our end-to-end trained agent uses RGB and depth (RGBD) information and can handle large environments (up to
$1031m^2$ ) across multiple rooms (up to$40$ ) and generalizes to unseen targets. We show that when compared to several baselines using deep reinforcement learning and RGBD SLAM, our method (1) requires fewer training examples and less training time, (2) reaches the goal location with higher accuracy, (3) produces better solutions with shorter paths for long-range navigation tasks, and (4) generalizes to unseen environments given an RGBD map of the environment.
38.HyperLearn: A Distributed Approach for Representation Learning in Datasets With Many Modalities ⬇️
Multimodal datasets contain an enormous amount of relational information, which grows exponentially with the introduction of new modalities. Learning representations in such a scenario is inherently complex due to the presence of multiple heterogeneous information channels. These channels can encode both (a) inter-relations between the items of different modalities and (b) intra-relations between the items of the same modality. Encoding multimedia items into a continuous low-dimensional semantic space such that both types of relations are captured and preserved is extremely challenging, especially if the goal is a unified end-to-end learning framework. The two key challenges that need to be addressed are: 1) the framework must be able to merge complex intra and inter relations without losing any valuable information and 2) the learning model should be invariant to the addition of new and potentially very different modalities. In this paper, we propose a flexible framework which can scale to data streams from many modalities. To that end we introduce a hypergraph-based model for data representation and deploy Graph Convolutional Networks to fuse relational information within and across modalities. Our approach provides an efficient solution for distributing otherwise extremely computationally expensive or even unfeasible training processes across multiple-GPUs, without any sacrifices in accuracy. Moreover, adding new modalities to our model requires only an additional GPU unit keeping the computational time unchanged, which brings representation learning to truly multimodal datasets. We demonstrate the feasibility of our approach in the experiments on multimedia datasets featuring second, third and fourth order relations.
39.Learning Sparse Mixture of Experts for Visual Question Answering ⬇️
There has been a rapid progress in the task of Visual Question Answering with improved model architectures. Unfortunately, these models are usually computationally intensive due to their sheer size which poses a serious challenge for deployment. We aim to tackle this issue for the specific task of Visual Question Answering (VQA). A Convolutional Neural Network (CNN) is an integral part of the visual processing pipeline of a VQA model (assuming the CNN is trained along with entire VQA model). In this project, we propose an efficient and modular neural architecture for the VQA task with focus on the CNN module. Our experiments demonstrate that a sparsely activated CNN based VQA model achieves comparable performance to a standard CNN based VQA model architecture.
40.Learning to Conceal: A Deep Learning Based Method for Preserving Privacy and Avoiding Prejudice ⬇️
In this paper, we introduce a learning model able to conceals personal information (e.g. gender, age, ethnicity, etc.) from an image, while maintaining any additional information present in the image (e.g. smile, hair-style, brightness). Our trained model is not provided the information that it is concealing, and does not try learning it either. Namely, we created a variational autoencoder (VAE) model that is trained on a dataset including labels of the information one would like to conceal (e.g. gender, ethnicity, age). These labels are directly added to the VAE's sampled latent vector. Due to the limited number of neurons in the latent vector and its appended noise, the VAE avoids learning any relation between the given images and the given labels, as those are given directly. Therefore, the encoded image lacks any of the information one wishes to conceal. The encoding may be decoded back into an image according to any provided properties (e.g. a 40 year old woman).
The proposed architecture can be used as a mean for privacy preserving and can serve as an input to systems, which will become unbiased and not suffer from prejudice. We believe that privacy and discrimination are two of the most important aspects in which the community should try and develop methods to prevent misuse of technological advances.
41.Classifying Topological Charge in SU(3) Yang-Mills Theory with Machine Learning ⬇️
We apply a machine learning technique for identifying the topological charge of quantum gauge configurations in four-dimensional SU(3) Yang-Mills theory. The topological charge density measured on the original and smoothed gauge configurations with and without dimensional reduction is used for inputs of the neural networks (NN) with and without convolutional layers. The gradient flow is used for the smoothing of the gauge field. We find that the topological charge determined at a large flow time can be predicted with high accuracy from the data at small flow times by the trained NN; the accuracy exceeds
$99%$ with the data at$t/a^2\le0.3$ . High robustness against the change of simulation parameters is also confirmed. We find that the best performance is obtained when the spatial coordinates of the topological charge density are fully integrated out as a preprocessing, which implies that our convolutional NN does not find characteristic structures in multi-dimensional space relevant for the determination of the topological charge.