1.Deep Predictive Motion Tracking in Magnetic Resonance Imaging: Application to Fetal Imaging ⬇️
Fetal magnetic resonance imaging (MRI) is challenged by uncontrollable, large, and irregular fetal movements. Fetal MRI is performed in a fully interactive manner in which a technologist monitors motion to prescribe slices in right angles with respect to the anatomy of interest. Current practice involves repeated acquisitions to ensure diagnostic-quality images are acquired; and the scans are retrospectively registered slice-by-slice to reconstruct 3D images. Nonetheless, manual monitoring of 3D fetal motion based on displayed 2D slices and navigation at the level of stacks-of-slices (instead of slices) is sub-optimal and inefficient. The current process is highly operator-dependent, requires extensive training, and significantly increases the length of fetal MRI scans which makes them difficult for pregnant women, and costly. With that motivation, we presented a new real-time image-based motion tracking technique in MRI using deep learning that can significantly improve state of the art. Through a combination of spatial and temporal encoder-decoder networks, our system learns to predict 3D pose of the fetal head based on dynamics of motion inferred directly from sequences of acquired slices. Compared to recent works that estimate static 3D pose of the subject from slices, our method learns to predict dynamics of 3D motion. We compared our trained network on held-out test sets (including data with different characteristics, e.g. different age ranges, and motion trajectories recorded from volunteer subjects) with networks designed for estimation as well as methods adopted to make predictions. The results of all estimation and prediction tasks show that we achieved reliable motion tracking in fetal MRI. This technique can be augmented with deep learning based fast anatomy detection, segmentation, and image registration techniques to build real-time motion tracking and navigation systems.
2.A closer look at domain shift for deep learning in histopathology ⬇️
Domain shift is a significant problem in histopathology. There can be large differences in data characteristics of whole-slide images between medical centers and scanners, making generalization of deep learning to unseen data difficult. To gain a better understanding of the problem, we present a study on convolutional neural networks trained for tumor classification of H&E stained whole-slide images. We analyze how augmentation and normalization strategies affect performance and learned representations, and what features a trained model respond to. Most centrally, we present a novel measure for evaluating the distance between domains in the context of the learned representation of a particular model. This measure can reveal how sensitive a model is to domain variations, and can be used to detect new data that a model will have problems generalizing to. The results show how learning is heavily influenced by the preparation of training data, and that the latent representation used to do classification is sensitive to changes in data distribution, especially when training without augmentation or normalization.
3.MIC: Mining Interclass Characteristics for Improved Metric Learning ⬇️
Metric learning seeks to embed images of objects suchthat class-defined relations are captured by the embeddingspace. However, variability in images is not just due to different depicted object classes, but also depends on other latent characteristics such as viewpoint or illumination. In addition to these structured properties, random noise further obstructs the visual relations of interest. The common approach to metric learning is to enforce a representation that is invariant under all factors but the ones of interest. In contrast, we propose to explicitly learn the latent characteristics that are shared by and go across object classes. We can then directly explain away structured visual variability, rather than assuming it to be unknown random noise. We propose a novel surrogate task to learn visual characteristics shared across classes with a separate encoder. This encoder is trained jointly with the encoder for class information by reducing their mutual information. On five standard image retrieval benchmarks the approach significantly improves upon the state-of-the-art.
4.Deep Learning for Deepfakes Creation and Detection ⬇️
Deep learning has been successfully applied to solve various complex problems ranging from big data analytics to computer vision and human-level control. Deep learning advances however have also been employed to create software that can cause threats to privacy, democracy and national security. One of those deep learning-powered applications recently emerged is "deepfake". Deepfake algorithms can create fake images and videos that humans cannot distinguish them from authentic ones. The proposal of technologies that can automatically detect and assess the integrity of digital visual media is therefore indispensable. This paper presents a survey of algorithms used to create deepfakes and, more importantly, methods proposed to detect deepfakes in the literature to date. We present extensive discussions on challenges, research trends and directions related to deepfake technologies. By reviewing the background of deepfakes and state-of-the-art deepfake detection methods, this study provides a comprehensive overview of deepfake techniques and facilitates the development of new and more robust methods to deal with the increasingly challenging deepfakes.
5.Dual Adaptive Pyramid Network for Cross-Stain Histopathology Image Segmentation ⬇️
Supervised semantic segmentation normally assumes the test data being in a similar data domain as the training data. However, in practice, the domain mismatch between the training and unseen data could lead to a significant performance drop. Obtaining accurate pixel-wise label for images in different domains is tedious and labor intensive, especially for histopathology images. In this paper, we propose a dual adaptive pyramid network (DAPNet) for histopathological gland segmentation adapting from one stain domain to another. We tackle the domain adaptation problem on two levels: 1) the image-level considers the differences of image color and style; 2) the feature-level addresses the spatial inconsistency between two domains. The two components are implemented as domain classifiers with adversarial training. We evaluate our new approach using two gland segmentation datasets with H&E and DAB-H stains respectively. The extensive experiments and ablation study demonstrate the effectiveness of our approach on the domain adaptive segmentation task. We show that the proposed approach performs favorably against other state-of-the-art methods.
6.Gated Channel Transformation for Visual Recognition ⬇️
In this work, we propose a generally applicable transformation unit for visual recognition with deep convolutional neural networks. This transformation explicitly models channel relationships with explainable control variables. These variables determine the neuron behaviors of competition or cooperation, and they are jointly optimized with convolutional weights towards more accurate recognition. In Squeeze-and-Excitation (SE) Networks, the channel relationships are implicitly learned by fully connected layers, and the SE block is integrated at the block-level. We instead introduce a channel normalization layer to reduce the number of parameters and computational complexity. This lightweight layer incorporates a simple L2 normalization, enabling our transformation unit applicable to operator-level without much increase of additional parameters. Extensive experiments demonstrate the effectiveness of our unit with clear margins on many vision tasks, i.e., image classification on ImageNet, object detection and instance segmentation on COCO, video classification on Kinetics.
7.The Good, the Bad and the Ugly: Evaluating Convolutional Neural Networks for Prohibited Item Detection Using Real and Synthetically Composited X-ray Imagery ⬇️
Detecting prohibited items in X-ray security imagery is pivotal in maintaining border and transport security against a wide range of threat profiles. Convolutional Neural Networks (CNN) with the support of a significant volume of data have brought advancement in such automated prohibited object detection and classification. However, collating such large volumes of X-ray security imagery remains a significant challenge. This work opens up the possibility of using synthetically composed imagery, avoiding the need to collate such large volumes of hand-annotated real-world imagery. Here we investigate the difference in detection performance achieved using real and synthetic X-ray training imagery for CNN architecture detecting three exemplar prohibited items, {Firearm, Firearm Parts, Knives}, within cluttered and complex X-ray security baggage imagery. We achieve 0.88 of mean average precision (mAP) with a Faster R-CNN and ResNet-101 CNN architecture for this 3-class object detection using real X-ray imagery. While the performance is comparable with synthetically composited X-ray imagery (0.78 mAP), our extended evaluation demonstrates both challenge and promise of using synthetically composed images to diversify the X-ray security training imagery for automated detection algorithm training.
8.mustGAN: Multi-Stream Generative Adversarial Networks for MR Image Synthesis ⬇️
Multi-contrast MRI protocols increase the level of morphological information available for diagnosis. Yet, the number and quality of contrasts is limited in practice by various factors including scan time and patient motion. Synthesis of missing or corrupted contrasts can alleviate this limitation to improve clinical utility. Common approaches for multi-contrast MRI involve either one-to-one and many-to-one synthesis methods. One-to-one methods take as input a single source contrast, and they learn a latent representation sensitive to unique features of the source. Meanwhile, many-to-one methods receive multiple distinct sources, and they learn a shared latent representation more sensitive to common features across sources. For enhanced image synthesis, here we propose a multi-stream approach that aggregates information across multiple source images via a mixture of multiple one-to-one streams and a joint many-to-one stream. The shared feature maps generated in the many-to-one stream and the complementary feature maps generated in the one-to-one streams are combined with a fusion block. The location of the fusion block is adaptively modified to maximize task-specific performance. Qualitative and quantitative assessments on T1-, T2-, PD-weighted and FLAIR images clearly demonstrate the superior performance of the proposed method compared to previous state-of-the-art one-to-one and many-to-one methods.
9.Non-imaging single-pixel sensing with optimized binary modulation ⬇️
The conventional high-level sensing tasks such as image classification require high-fidelity images as input to extract target features, which are produced by either complex imaging hardware or high-complexity reconstruction algorithms. In this letter, we propose single-pixel sensing (SPS) that performs sensing tasks directly from coupled measurements of a single-pixel detector, without the conventional image acquisition and reconstruction process. We build a deep convolutional neural network that comprises a target encoder and a sensing decoder. The encoder simulates the single-pixel detection and employs binary modulation that can be physically implemented at ~22kHz. Both the encoder and decoder are trained together for optimal sensing precision. The effectiveness of SPS is demonstrated on the classification task of handwritten MNIST dataset, and achieves 96.68% classification accuracy at ~1kHz. Compared to the conventional imaging-sensing framework, the reported SPS technique requires less measurements for fast sensing rate, maintains low computational complexity, wide working spectrum and high signal-to-noise ratio, and is further beneficial for communication and encryption.
10.CAT: Compression-Aware Training for bandwidth reduction ⬇️
Convolutional neural networks (CNNs) have become the dominant neural network architecture for solving visual processing tasks. One of the major obstacles hindering the ubiquitous use of CNNs for inference is their relatively high memory bandwidth requirements, which can be a main energy consumer and throughput bottleneck in hardware accelerators. Accordingly, an efficient feature map compression method can result in substantial performance gains. Inspired by quantization-aware training approaches, we propose a compression-aware training (CAT) method that involves training the model in a way that allows better compression of feature maps during inference. Our method trains the model to achieve low-entropy feature maps, which enables efficient compression at inference time using classical transform coding methods. CAT significantly improves the state-of-the-art results reported for quantization. For example, on ResNet-34 we achieve 73.1% accuracy (0.2% degradation from the baseline) with an average representation of only 1.79 bits per value. Reference implementation accompanies the paper at this https URL
11.Multi-modal segmentation with missing MR sequences using pre-trained fusion networks ⬇️
Missing data is a common problem in machine learning and in retrospective imaging research it is often encountered in the form of missing imaging modalities. We propose to take into account missing modalities in the design and training of neural networks, to ensure that they are capable of providing the best possible prediction even when multiple images are not available. The proposed network combines three modifications to the standard 3D UNet architecture: a training scheme with dropout of modalities, a multi-pathway architecture with fusion layer in the final stage, and the separate pre-training of these pathways. These modifications are evaluated incrementally in terms of performance on full and missing data, using the BraTS multi-modal segmentation challenge. The final model shows significant improvement with respect to the state of the art on missing data and requires less memory during training.
12.Efficient Residual Dense Block Search for Image Super-Resolution ⬇️
Although remarkable progress has been made on single image super-resolution due to the revival of deep convolutional neural networks, deep learning methods are confronted with the challenges of computation and memory consumption in practice, especially for mobile devices. Focusing on this issue, we propose an efficient residual dense block search algorithm with multiple objectives to hunt for fast, lightweight and accurate networks for image super-resolution. Firstly, to accelerate super-resolution network, we exploit the variation of feature scale adequately with the proposed efficient residual dense blocks. In the proposed evolutionary algorithm, the locations of pooling and upsampling operator are searched automatically. Secondly, network architecture is evolved with the guidance of block credits to acquire accurate super-resolution network. The block credit reflects the effect of current block and is earned during model evaluation process. It guides the evolution by weighing the sampling probability of mutation to favor admirable blocks. Extensive experimental results demonstrate the effectiveness of the proposed searching method and the found efficient super-resolution models achieve better performance than the state-of-the-art methods with limited number of parameters and FLOPs.
13.Beyond image classification: zooplankton identification with deep vector space embeddings ⬇️
Zooplankton images, like many other real world data types, have intrinsic properties that make the design of effective classification systems difficult. For instance, the number of classes encountered in practical settings is potentially very large, and classes can be ambiguous or overlap. In addition, the choice of taxonomy often differs between researchers and between institutions. Although high accuracy has been achieved in benchmarks using standard classifier architectures, biases caused by an inflexible classification scheme can have profound effects when the output is used in ecosystem assessments and monitoring.
Here, we propose using a deep convolutional network to construct a vector embedding of zooplankton images. The system maps (embeds) each image into a high-dimensional Euclidean space so that distances between vectors reflect semantic relationships between images. We show that the embedding can be used to derive classifications with comparable accuracy to a specific classifier, but that it simultaneously reveals important structures in the data. Furthermore, we apply the embedding to new classes previously unseen by the system, and evaluate its classification performance in such cases.
Traditional neural network classifiers perform well when the classes are clearly defined a priori and have sufficiently large labeled data sets available. For practical cases in ecology as well as in many other fields this is not the case, and we argue that the vector embedding method presented here is a more appropriate approach.
14.Attention Convolutional Binary Neural Tree for Fine-Grained Visual Categorization ⬇️
Fine-grained visual categorization (FGVC) is an important but challenging task due to high intra-class variances and low inter-class variances caused by deformation, occlusion, illumination, etc. An attention convolutional binary neural tree architecture is presented to address those problems for weakly supervised FGVC. Specifically, we incorporate convolutional operations along edges of the tree structure, and use the routing functions in each node to determine the root-to-leaf computational paths within the tree. The final decision is computed as the summation of the predictions from leaf nodes. The deep convolutional operations learn to capture the representations of objects, and the tree structure characterizes the coarse-to-fine hierarchical feature learning process. In addition, we use the attention transformer module to enforce the network to capture discriminative features. The negative log-likelihood loss is used to train the entire network in an end-to-end fashion by SGD with back-propagation. Several experiments on the CUB-200-2011, Stanford Cars and Aircraft datasets demonstrate that the proposed method performs favorably against the state-of-the-arts.
15.Accurate and Compact Convolutional Neural Networks with Trained Binarization ⬇️
Although convolutional neural networks (CNNs) are now widely used in various computer vision applications, its huge resource demanding on parameter storage and computation makes the deployment on mobile and embedded devices difficult. Recently, binary convolutional neural networks are explored to help alleviate this issue by quantizing both weights and activations with only 1 single bit. However, there may exist a noticeable accuracy degradation when compared with full-precision models. In this paper, we propose an improved training approach towards compact binary CNNs with higher accuracy. Trainable scaling factors for both weights and activations are introduced to increase the value range. These scaling factors will be trained jointly with other parameters via backpropagation. Besides, a specific training algorithm is developed including tight approximation for derivative of discontinuous binarization function and
$L_2$ regularization acting on weight scaling factors. With these improvements, the binary CNN achieves 92.3% accuracy on CIFAR-10 with VGG-Small network. On ImageNet, our method also obtains 46.1% top-1 accuracy with AlexNet and 54.2% with Resnet-18 surpassing previous works.
16.Balancing Specialization, Generalization, and Compression for Detection and Tracking ⬇️
We propose a method for specializing deep detectors and trackers to restricted settings. Our approach is designed with the following goals in mind: (a) Improving accuracy in restricted domains; (b) preventing overfitting to new domains and forgetting of generalized capabilities; (c) aggressive model compression and acceleration. To this end, we propose a novel loss that balances compression and acceleration of a deep learning model vs. loss of generalization capabilities. We apply our method to the existing tracker and detector models. We report detection results on the VIRAT and CAVIAR data sets. These results show our method to offer unprecedented compression rates along with improved detection. We apply our loss for tracker compression at test time, as it processes each video. Our tests on the OTB2015 benchmark show that applying compression during test time actually improves tracking performance.
17.FALCON: Fast and Lightweight Convolution for Compressing and Accelerating CNN ⬇️
How can we efficiently compress Convolutional Neural Networks (CNN) while retaining their accuracy on classification tasks? A promising direction is based on depthwise separable convolution which replaces a standard convolution with a depthwise convolution and a pointwise convolution. However, previous works based on depthwise separable convolution are limited since 1) they are mostly heuristic approaches without a precise understanding of their relations to standard convolution, and 2) their accuracies do not match that of the standard convolution. In this paper, we propose FALCON, an accurate and lightweight method for compressing CNN. FALCON is derived by interpreting existing convolution methods based on depthwise separable convolution using EHP, our proposed mathematical formulation to approximate the standard convolution kernel. Such interpretation leads to developing a generalized version rank-k FALCON which further improves the accuracy while sacrificing a bit of compression and computation reduction rates. In addition, we propose FALCON-branch by fitting FALCON into the previous state-of-the-art convolution unit ShuffleUnitV2 which gives even better accuracy. Experiments show that FALCON and FALCON-branch outperform 1) existing methods based on depthwise separable convolution and 2) standard CNN models by up to 8x compression and 8x computation reduction while ensuring similar accuracy. We also demonstrate that rank-k FALCON provides even better accuracy than standard convolution in many cases, while using a smaller number of parameters and floating-point operations.
18.Cross-View Kernel Similarity Metric Learning Using Pairwise Constraints for Person Re-identification ⬇️
Person re-identification is the task of matching pedestrian images across non-overlapping cameras. In this paper, we propose a non-linear cross-view similarity metric learning for handling small size training data in practical re-ID systems. The method employs non-linear mappings combined with cross-view discriminative subspace learning and cross-view distance metric learning based on pairwise similarity constraints. It is a natural extension of XQDA from linear to non-linear mappings using kernels, and learns non-linear transformations for efficiently handling complex non-linearity of person appearance across camera views. Importantly, the proposed method is very computationally efficient. Extensive experiments on four challenging datasets shows that our method attains competitive performance against state-of-the-art methods.
19.Conditional Transferring Features: Scaling GANs to Thousands of Classes with 30% Less High-quality Data for Training ⬇️
Generative adversarial network (GAN) has greatly improved the quality of unsupervised image generation. Previous GAN-based methods often require a large amount of high-quality training data while producing a small number (e.g., tens) of classes. This work aims to scale up GANs to thousands of classes meanwhile reducing the use of high-quality data in training. We propose an image generation method based on conditional transferring features, which can capture pixel-level semantic changes when transforming low-quality images into high-quality ones. Moreover, self-supervision learning is integrated into our GAN architecture to provide more label-free semantic supervisory information observed from the training data. As such, training our GAN architecture requires much fewer high-quality images with a small number of additional low-quality images. The experiments on CIFAR-10 and STL-10 show that even removing 30% high-quality images from the training set, our method can still outperform previous ones. The scalability on object classes has been experimentally validated: our method with 30% fewer high-quality images obtains the best quality in generating 1,000 ImageNet classes, as well as generating all 3,755 classes of CASIA-HWDB1.0 Chinese handwriting characters.
20.Guided Attention Network for Object Detection and Counting on Drones ⬇️
Object detection and counting are related but challenging problems, especially for drone based scenes with small objects and cluttered background. In this paper, we propose a new Guided Attention Network (GANet) to deal with both object detection and counting tasks based on the feature pyramid. Different from the previous methods relying on unsupervised attention modules, we fuse different scales of feature maps by using the proposed weakly-supervised Background Attention (BA) between the background and objects for more semantic feature representation. Then, the Foreground Attention (FA) module is developed to consider both global and local appearance of the object to facilitate accurate localization. Moreover, the new data argumentation strategy is designed to train a robust model in various complex scenes. Extensive experiments on three challenging benchmarks (i.e., UAVDT, CARPK and PUCPR+) show the state-of-the-art detection and counting performance of the proposed method compared with existing methods.
21.Stochastic Conditional Generative Networks with Basis Decomposition ⬇️
While generative adversarial networks (GANs) have revolutionized machine learning, a number of open questions remain to fully understand them and exploit their power. One of these questions is how to efficiently achieve proper diversity and sampling of the multi-mode data space. To address this, we introduce BasisGAN, a stochastic conditional multi-mode image generator. By exploiting the observation that a convolutional filter can be well approximated as a linear combination of a small set of basis elements, we learn a plug-and-played basis generator to stochastically generate basis elements, with just a few hundred of parameters, to fully embed stochasticity into convolutional filters. By sampling basis elements instead of filters, we dramatically reduce the cost of modeling the parameter space with no sacrifice on either image diversity or fidelity. To illustrate this proposed plug-and-play framework, we construct variants of BasisGAN based on state-of-the-art conditional image generation networks, and train the networks by simply plugging in a basis generator, without additional auxiliary components, hyperparameters, or training objectives. The experimental success is complemented with theoretical results indicating how the perturbations introduced by the proposed sampling of basis elements can propagate to the appearance of generated images.
22.Towards Automated Biometric Identification of Sea Turtles (Chelonia mydas) ⬇️
Passive biometric identification enables wildlife monitoring with minimal disturbance. Using a motion-activated camera placed at an elevated position and facing downwards, we collected images of sea turtle carapace, each belonging to one of sixteen Chelonia mydas juveniles. We then learned co-variant and robust image descriptors from these images, enabling indexing and retrieval. In this work, we presented several classification results of sea turtle carapaces using the learned image descriptors. We found that a template-based descriptor, i.e., Histogram of Oriented Gradients (HOG) performed exceedingly better during classification than keypoint-based descriptors. For our dataset, a high-dimensional descriptor is a must due to the minimal gradient and color information inside the carapace images. Using HOG, we obtained an average classification accuracy of 65%.
23.Rescan: Inductive Instance Segmentation for Indoor RGBD Scans ⬇️
In depth-sensing applications ranging from home robotics to AR/VR, it will be common to acquire 3D scans of interior spaces repeatedly at sparse time intervals (e.g., as part of regular daily use). We propose an algorithm that analyzes these "rescans" to infer a temporal model of a scene with semantic instance information. Our algorithm operates inductively by using the temporal model resulting from past observations to infer an instance segmentation of a new scan, which is then used to update the temporal model. The model contains object instance associations across time and thus can be used to track individual objects, even though there are only sparse observations. During experiments with a new benchmark for the new task, our algorithm outperforms alternate approaches based on state-of-the-art networks for semantic instance segmentation.
24.Learning Propagation for Arbitrarily-structured Data ⬇️
Processing an input signal that contains arbitrary structures, e.g., superpixels and point clouds, remains a big challenge in computer vision. Linear diffusion, an effective model for image processing, has been recently integrated with deep learning algorithms. In this paper, we propose to learn pairwise relations among data points in a global fashion to improve semantic segmentation with arbitrarily-structured data, through spatial generalized propagation networks (SGPN). The network propagates information on a group of graphs, which represent the arbitrarily-structured data, through a learned, linear diffusion process. The module is flexible to be embedded and jointly trained with many types of networks, e.g., CNNs. We experiment with semantic segmentation networks, where we use our propagation module to jointly train on different data -- images, superpixels and point clouds. We show that SGPN consistently improves the performance of both pixel and point cloud segmentation, compared to networks that do not contain this module. Our method suggests an effective way to model the global pairwise relations for arbitrarily-structured data.
25.Pretraining boosts out-of-domain robustness for pose estimation ⬇️
Deep neural networks are highly effective tools for human and animal pose estimation. However, robustness to out-of-domain data remains a challenge. Here, we probe the transfer and generalization ability for pose estimation with two architecture classes (MobileNetV2s and ResNets) pretrained on ImageNet. We generated a novel dataset of 30 horses that allowed for both within-domain and out-of-domain (unseen horse) testing. We find that pretraining on ImageNet strongly improves out-of-domain performance. Moreover, we show that for both pretrained and networks trained from scratch, better ImageNet-performing architectures perform better for pose estimation, with a substantial improvement on out-of-domain data when pretrained. Collectively, our results demonstrate that transfer learning is particularly beneficial for out-of-domain robustness.
26.Intelligent image synthesis to attack a segmentation CNN using adversarial learning ⬇️
Deep learning approaches based on convolutional neural networks (CNNs) have been successful in solving a number of problems in medical imaging, including image segmentation. In recent years, it has been shown that CNNs are vulnerable to attacks in which the input image is perturbed by relatively small amounts of noise so that the CNN is no longer able to perform a segmentation of the perturbed image with sufficient accuracy. Therefore, exploring methods on how to attack CNN-based models as well as how to defend models against attacks have become a popular topic as this also provides insights into the performance and generalization abilities of CNNs. However, most of the existing work assumes unrealistic attack models, i.e. the resulting attacks were specified in advance. In this paper, we propose a novel approach for generating adversarial examples to attack CNN-based segmentation models for medical images. Our approach has three key features: 1) The generated adversarial examples exhibit anatomical variations (in form of deformations) as well as appearance perturbations; 2) The adversarial examples attack segmentation models so that the Dice scores decrease by a pre-specified amount; 3) The attack is not required to be specified beforehand. We have evaluated our approach on CNN-based approaches for the multi-organ segmentation problem in 2D CT images. We show that the proposed approach can be used to attack different CNN-based segmentation models.
27.Anchor Loss: Modulating Loss Scale based on Prediction Difficulty ⬇️
We propose a novel loss function that dynamically rescales the cross entropy based on prediction difficulty regarding a sample. Deep neural network architectures in image classification tasks struggle to disambiguate visually similar objects. Likewise, in human pose estimation symmetric body parts often confuse the network with assigning indiscriminative scores to them. This is due to the output prediction, in which only the highest confidence label is selected without taking into consideration a measure of uncertainty. In this work, we define the prediction difficulty as a relative property coming from the confidence score gap between positive and negative labels. More precisely, the proposed loss function penalizes the network to avoid the score of a false prediction being significant. To demonstrate the efficacy of our loss function, we evaluate it on two different domains: image classification and human pose estimation. We find improvements in both applications by achieving higher accuracy compared to the baseline methods.
28.Mixup Inference: Better Exploiting Mixup to Defend Adversarial Attacks ⬇️
It has been widely recognized that adversarial examples can be easily crafted to fool deep networks, which mainly root from the locally non-linear behavior nearby input examples. Applying mixup in training provides an effective mechanism to improve generalization performance and model robustness against adversarial perturbations, which introduces the globally linear behavior in-between training examples. However, in previous work, the mixup-trained models only passively defend adversarial attacks in inference by directly classifying the inputs, where the induced global linearity is not well exploited. Namely, since the locality of the adversarial perturbations, it would be more efficient to actively break the locality via the globality of the model predictions. Inspired by simple geometric intuition, we develop an inference principle, named mixup inference (MI), for mixup-trained models. MI mixups the input with other random clean samples, which can shrink and transfer the equivalent perturbation if the input is adversarial. Our experiments on CIFAR-10 and CIFAR-100 demonstrate that MI can further improve the adversarial robustness for the models trained by mixup and its variants.
29.Synthetic Data for Deep Learning ⬇️
Synthetic data is an increasingly popular tool for training deep learning models, especially in computer vision but also in other areas. In this work, we attempt to provide a comprehensive survey of the various directions in the development and application of synthetic data. First, we discuss synthetic datasets for basic computer vision problems, both low-level (e.g., optical flow estimation) and high-level (e.g., semantic segmentation), synthetic environments and datasets for outdoor and urban scenes (autonomous driving), indoor scenes (indoor navigation), aerial navigation, simulation environments for robotics, applications of synthetic data outside computer vision (in neural programming, bioinformatics, NLP, and more); we also survey the work on improving synthetic data development and alternative ways to produce it such as GANs. Second, we discuss in detail the synthetic-to-real domain adaptation problem that inevitably arises in applications of synthetic data, including synthetic-to-real refinement with GAN-based models and domain adaptation at the feature/model level without explicit data transformations. Third, we turn to privacy-related applications of synthetic data and review the work on generating synthetic datasets with differential privacy guarantees. We conclude by highlighting the most promising directions for further work in synthetic data studies.
30.Towards continuous learning for glioma segmentation with elastic weight consolidation ⬇️
When finetuning a convolutional neural network (CNN) on data from a new domain, catastrophic forgetting will reduce performance on the original training data. Elastic Weight Consolidation (EWC) is a recent technique to prevent this, which we evaluated while training and re-training a CNN to segment glioma on two different datasets. The network was trained on the public BraTS dataset and finetuned on an in-house dataset with non-enhancing low-grade glioma. EWC was found to decrease catastrophic forgetting in this case, but was also found to restrict adaptation to the new domain.
31.Message Scheduling for Performant, Many-Core Belief Propagation ⬇️
Belief Propagation (BP) is a message-passing algorithm for approximate inference over Probabilistic Graphical Models (PGMs), finding many applications such as computer vision, error-correcting codes, and protein-folding. While general, the convergence and speed of the algorithm has limited its practical use on difficult inference problems. As an algorithm that is highly amenable to parallelization, many-core Graphical Processing Units (GPUs) could significantly improve BP performance. Improving BP through many-core systems is non-trivial: the scheduling of messages in the algorithm strongly affects performance. We present a study of message scheduling for BP on GPUs. We demonstrate that BP exhibits a tradeoff between speed and convergence based on parallelism and show that existing message schedulings are not able to utilize this tradeoff. To this end, we present a novel randomized message scheduling approach, Randomized BP (RnBP), which outperforms existing methods on the GPU.
32.Deep learning vessel segmentation and quantification of the foveal avascular zone using commercial and prototype OCT-A platforms ⬇️
Automatic quantification of perifoveal vessel densities in optical coherence tomography angiography (OCT-A) images face challenges such as variable intra- and inter-image signal to noise ratios, projection artefacts from outer vasculature layers, and motion artefacts. This study demonstrates the utility of deep neural networks for automatic quantification of foveal avascular zone (FAZ) parameters and perifoveal vessel density of OCT-A images in healthy and diabetic eyes. OCT-A images of the foveal region were acquired using three OCT-A systems: a 1060nm Swept Source (SS)-OCT prototype, RTVue XR Avanti (Optovue Inc., Fremont, CA), and the ZEISS Angioplex (Carl Zeiss Meditec, Dublin, CA). Automated segmentation was then performed using a deep neural network. Four FAZ morphometric parameters (area, min/max diameter, and eccentricity) and perifoveal vessel density were used as outcome measures. The accuracy, sensitivity and specificity of the DNN vessel segmentations were comparable across all three device platforms. No significant difference between the means of the measurements from automated and manual segmentations were found for any of the outcome measures on any system. The intraclass correlation coefficient (ICC) was also good (> 0.51) for all measurements. Automated deep learning vessel segmentation of OCT-A may be suitable for both commercial and research purposes for better quantification of the retinal circulation.
33.Domain-invariant Learning using Adaptive Filter Decomposition ⬇️
Domain shifts are frequently encountered in real-world scenarios. In this paper, we consider the problem of domain-invariant deep learning by explicitly modeling domain shifts with only a small amount of domain-specific parameters in a Convolutional Neural Network (CNN). By exploiting the observation that a convolutional filter can be well approximated as a linear combination of a small set of basis elements, we show for the first time, both empirically and theoretically, that domain shifts can be effectively handled by decomposing a regular convolutional layer into a domain-specific basis layer and a domain-shared basis coefficient layer, while both remain convolutional. An input channel will now first convolve spatially only with each respective domain-specific basis to "absorb" domain variations, and then output channels are linearly combined using common basis coefficients trained to promote shared semantics across domains. We use toy examples, rigorous analysis, and real-world examples to show the framework's effectiveness in cross-domain performance and domain adaptation. With the proposed architecture, we need only a small set of basis elements to model each additional domain, which brings a negligible amount of additional parameters, typically a few hundred.
34.Sign Language Recognition Analysis using Multimodal Data ⬇️
Voice-controlled personal and home assistants (such as the Amazon Echo and Apple Siri) are becoming increasingly popular for a variety of applications. However, the benefits of these technologies are not readily accessible to Deaf or Hard-ofHearing (DHH) users. The objective of this study is to develop and evaluate a sign recognition system using multiple modalities that can be used by DHH signers to interact with voice-controlled devices. With the advancement of depth sensors, skeletal data is used for applications like video analysis and activity recognition. Despite having similarity with the well-studied human activity recognition, the use of 3D skeleton data in sign language recognition is rare. This is because unlike activity recognition, sign language is mostly dependent on hand shape pattern. In this work, we investigate the feasibility of using skeletal and RGB video data for sign language recognition using a combination of different deep learning architectures. We validate our results on a large-scale American Sign Language (ASL) dataset of 12 users and 13107 samples across 51 signs. It is named as GMUASL51. We collected the dataset over 6 months and it will be publicly released in the hope of spurring further machine learning research towards providing improved accessibility for digital assistants.
35.Carving out the low surface brightness universe with NoiseChisel ⬇️
NoiseChisel is a program to detect very low signal-to-noise ratio (S/N) features with minimal assumptions on their morphology. It was introduced in 2015 and released within a collection of data analysis programs and libraries known as GNU Astronomy Utilities (Gnuastro). Over the last ten stable releases of Gnuastro, NoiseChisel has significantly improved: detecting even fainter signal, enabling better user control over its inner workings, and many bug fixes. The most important change may be that NoiseChisel's segmentation features have been moved into a new program called Segment. Another major change is the final growth strategy of its true detections, for example NoiseChisel is able to detect the outer wings of M51 down to S/N of 0.25, or 28.27 mag/arcsec2 on a single-exposure SDSS image (r-band). Segment is also able to detect the localized HII regions as "clumps" much more successfully. Finally, to orchestrate a controlled analysis, the concept of a "reproducible paper" is discussed: this paper itself is exactly reproducible (snapshot v4-0-g8505cfd).
36.Augmenting the Pathology Lab: An Intelligent Whole Slide Image Classification System for the Real World ⬇️
Standard of care diagnostic procedure for suspected skin cancer is microscopic examination of hematoxylin & eosin stained tissue by a pathologist. Areas of high inter-pathologist discordance and rising biopsy rates necessitate higher efficiency and diagnostic reproducibility. We present and validate a deep learning system which classifies digitized dermatopathology slides into 4 categories. The system is developed using 5,070 images from a single lab, and tested on an uncurated set of 13,537 images from 3 test labs, using whole slide scanners manufactured by 3 different vendors. The system's use of deep-learning-based confidence scoring as a criterion to consider the result as accurate yields an accuracy of up to 98%, and makes it adoptable in a real-world setting. Without confidence scoring, the system achieved an accuracy of 78%. We anticipate that our deep learning system will serve as a foundation enabling faster diagnosis of skin cancer, identification of cases for specialist review, and targeted diagnostic classifications.
37.Accept Synthetic Objects as Real: End-to-End Training of Attentive Deep Visuomotor Policies for Manipulation in Clutter ⬇️
Recent research demonstrated that it is feasible to end-to-end train multi-task deep visuomotor policies for robotic manipulation using variations of learning from demonstration (LfD) and reinforcement learning (RL). In this paper, we extend the capabilities of end-to-end LfD architectures to object manipulation in clutter. We start by introducing a data augmentation procedure called Accept Synthetic Objects as Real (ASOR). Using ASOR we develop two network architectures: implicit attention ASOR-IA and explicit attention ASOR-EA. Both architectures use the same training data (demonstrations in uncluttered environments) as previous approaches. Experimental results show that ASOR-IA and ASOR-EA succeed ina significant fraction of trials in cluttered environments where previous approaches never succeed. In addition, we find that both ASOR-IA and ASOR-EA outperform previous approaches even in uncluttered environments, with ASOR-EA performing better even in clutter compared to the previous best baseline in an uncluttered environment.