1.Search to Distill: Pearls are Everywhere but not the Eyes ⬇️
Standard Knowledge Distillation (KD) approaches distill the knowledge of a cumbersome teacher model into the parameters of a student model with a pre-defined architecture. However, the knowledge of a neural network, which is represented by the network's output distribution conditioned on its input, depends not only on its parameters but also on its architecture. Hence, a more generalized approach for KD is to distill the teacher's knowledge into both the parameters and architecture of the student. To achieve this, we present a new Architecture-aware Knowledge Distillation (AKD) approach that finds student models (pearls for the teacher) that are best for distilling the given teacher model. In particular, we leverage Neural Architecture Search (NAS), equipped with our KD-guided reward, to search for the best student architectures for a given teacher. Experimental results show our proposed AKD consistently outperforms the conventional NAS plus KD approach, and achieves state-of-the-art results on the ImageNet classification task under various latency settings. Furthermore, the best AKD student architecture for the ImageNet classification task also transfers well to other tasks such as million level face recognition and ensemble learning.
2.Exploring the Origins and Prevalence of Texture Bias in Convolutional Neural Networks ⬇️
Recent work has indicated that, unlike humans, ImageNet-trained CNNs tend to classify images by texture rather than shape. How pervasive is this bias, and where does it come from? We find that, when trained on datasets of images with conflicting shape and texture, the inductive bias of CNNs often favors shape; in general, models learn shape at least as easily as texture. Moreover, although ImageNet training leads to classifier weights that classify ambiguous images according to texture, shape is decodable from the hidden representations of ImageNet networks. Turning to the question of the origin of texture bias, we identify consistent effects of task, architecture, preprocessing, and hyperparameters. Different self-supervised training objectives and different architectures have significant and largely independent effects on the shape bias of the learned representations. Among modern ImageNet architectures, we find that shape bias is positively correlated with ImageNet accuracy. Random-crop data augmentation encourages reliance on texture: Models trained without crops have lower accuracy but higher shape bias. Finally, hyperparameter combinations that yield similar accuracy are associated with vastly different levels of shape bias. Our results suggest general strategies to reduce texture bias in neural networks.
3.EfficientDet: Scalable and Efficient Object Detection ⬇️
Model efficiency has become increasingly important in computer vision. In this paper, we systematically study various neural network architecture design choices for object detection and propose several key optimizations to improve efficiency. First, we propose a weighted bi-directional feature pyramid network (BiFPN), which allows easy and fast multi-scale feature fusion; Second, we propose a compound scaling method that uniformly scales the resolution, depth, and width for all backbone, feature network, and box/class prediction networks at the same time. Based on these optimizations, we have developed a new family of object detectors, called EfficientDet, which consistently achieve an order-of-magnitude better efficiency than prior art across a wide spectrum of resource constraints. In particular, without bells and whistles, our EfficientDet-D7 achieves stateof-the-art 51.0 mAP on COCO dataset with 52M parameters and 326B FLOPS1 , being 4x smaller and using 9.3x fewer FLOPS yet still more accurate (+0.3% mAP) than the best previous detector.
4.Fine-grained Synthesis of Unrestricted Adversarial Examples ⬇️
We propose a novel approach for generating unrestricted adversarial examples by manipulating fine-grained aspects of image generation. Unlike existing unrestricted attacks that typically hand-craft geometric transformations, we learn stylistic and stochastic modifications leveraging state-of-the-art generative models. This allows us to manipulate an image in a controlled, fine-grained manner without being bounded by a norm threshold. Our model can be used for both targeted and non-targeted unrestricted attacks. We demonstrate that our attacks can bypass certified defenses, yet our adversarial images look indistinguishable from natural images as verified by human evaluation. Adversarial training can be used as an effective defense without degrading performance of the model on clean images. We perform experiments on LSUN and CelebA-HQ as high resolution datasets to validate efficacy of our proposed approach.
5.Learning Cross-modal Context Graph for Visual Grounding ⬇️
Visual grounding is a ubiquitous building block in many vision-language tasks and yet remains challenging due to large variations in visual and linguistic features of grounding entities, strong context effect and the resulting semantic ambiguities. Prior works typically focus on learning representations of individual phrases with limited context information. To address their limitations, this paper proposes a language-guided graph representation to capture the global context of grounding entities and their relations, and develop a cross-modal graph matching strategy for the multiple-phrase visual grounding task. In particular, we introduce a modular graph neural network to compute context-aware representations of phrases and object proposals respectively via message propagation, followed by a graph-based matching module to generate globally consistent localization of grounding phrases. We train the entire graph neural network jointly in a two-stage strategy and evaluate it on the Flickr30K Entities benchmark. Extensive experiments show that our method outperforms the prior state of the arts by a sizable margin, evidencing the efficacy of our grounding framework. Code is available at this https URL.
6.Weak Supervision for Generating Pixel-Level Annotations in Scene Text Segmentation ⬇️
Providing pixel-level supervisions for scene text segmentation is inherently difficult and costly, so that only few small datasets are available for this task. To face the scarcity of training data, previous approaches based on Convolutional Neural Networks (CNNs) rely on the use of a synthetic dataset for pre-training. However, synthetic data cannot reproduce the complexity and variability of natural images. In this work, we propose to use a weakly supervised learning approach to reduce the domain-shift between synthetic and real data. Leveraging the bounding-box supervision of the COCO-Text and the MLT datasets, we generate weak pixel-level supervisions of real images. In particular, the COCO-Text-Segmentation (COCO_TS) and the MLT-Segmentation (MLT_S) datasets are created and released. These two datasets are used to train a CNN, the Segmentation Multiscale Attention Network (SMANet), which is specifically designed to face some peculiarities of the scene text segmentation task. The SMANet is trained end-to-end on the proposed datasets, and the experiments show that COCO_TS and MLT_S are a valid alternative to synthetic images, allowing to use only a fraction of the training samples and improving significantly the performances.
7.Experimental Exploration of Compact Convolutional Neural Network Architectures for Non-temporal Real-time Fire Detection ⬇️
In this work we explore different Convolutional Neural Network (CNN) architectures and their variants for non-temporal binary fire detection and localization in video or still imagery. We consider the performance of experimentally defined, reduced complexity deep CNN architectures for this task and evaluate the effects of different optimization and normalization techniques applied to different CNN architectures (spanning the Inception, ResNet and EfficientNet architectural concepts). Contrary to contemporary trends in the field, our work illustrates a maximum overall accuracy of 0.96 for full frame binary fire detection and 0.94 for superpixel localization using an experimentally defined reduced CNN architecture based on the concept of InceptionV4. We notably achieve a lower false positive rate of 0.06 compared to prior work in the field presenting an efficient, robust and real-time solution for fire region detection.
8.Unsupervised Monocular Depth Prediction for Indoor Continuous Video Streams ⬇️
This paper studies unsupervised monocular depth prediction problem. Most of existing unsupervised depth prediction algorithms are developed for outdoor scenarios, while the depth prediction work in the indoor environment is still very scarce to our knowledge. Therefore, this work focuses on narrowing the gap by firstly evaluating existing approaches in the indoor environments and then improving the state-of-the-art design of architecture. Unlike typical outdoor training dataset, such as KITTI with motion constraints, data for indoor environment contains more arbitrary camera movement and short baseline between two consecutive images, which deteriorates the network training for the pose estimation. To address this issue, we propose two methods: Firstly, we propose a novel reconstruction loss function to constraint pose estimation, resulting in accuracy improvement of the predicted disparity map; secondly, we use an ensemble learning with a flipping strategy along with a median filter, directly taking operation on the output disparity map. We evaluate our approaches on the TUM RGB-D and self-collected datasets. The results have shown that both approaches outperform the previous state-of-the-art unsupervised learning approaches.
9.Evaluating the Transferability and Adversarial Discrimination of Convolutional Neural Networks for Threat Object Detection and Classification within X-Ray Security Imagery ⬇️
X-ray imagery security screening is essential to maintaining transport security against a varying profile of threat or prohibited items. Particular interest lies in the automatic detection and classification of weapons such as firearms and knives within complex and cluttered X-ray security imagery. Here, we address this problem by exploring various end-to-end object detection Convolutional Neural Network (CNN) architectures. We evaluate several leading variants spanning the Faster R-CNN, Mask R-CNN, and RetinaNet architectures to explore the transferability of such models between varying X-ray scanners with differing imaging geometries, image resolutions and material colour profiles. Whilst the limited availability of X-ray threat imagery can pose a challenge, we employ a transfer learning approach to evaluate whether such inter-scanner generalisation may exist over a multiple class detection problem. Overall, we achieve maximal detection performance using a Faster R-CNN architecture with a ResNet$_{101}$ classification network, obtaining 0.88 and 0.86 of mean Average Precision (mAP) for a three-class and two class item from varying X-ray imaging sources. Our results exhibit a remarkable degree of generalisability in terms of cross-scanner performance (mAP: 0.87, firearm detection: 0.94 AP). In addition, we examine the inherent adversarial discriminative capability of such networks using a specifically generated adversarial dataset for firearms detection - with a variable low false positive, as low as 5%, this shows both the challenge and promise of such threat detection within X-ray security imagery.
10.MetH: A family of high-resolution and variable-shape image challenges ⬇️
High-resolution and variable-shape images have not yet been properly addressed by the AI community. The approach of down-sampling data often used with convolutional neural networks is sub-optimal for many tasks, and has too many drawbacks to be considered a sustainable alternative. In sight of the increasing importance of problems that can benefit from exploiting high-resolution (HR) and variable-shape, and with the goal of promoting research in that direction, we introduce a new family of datasets (MetH). The four proposed problems include two image classification, one image regression and one super resolution task. Each of these datasets contains thousands of art pieces captured by HR and variable-shape images, labeled by experts at the Metropolitan Museum of Art. We perform an analysis, which shows how the proposed tasks go well beyond current public alternatives in both pixel size and aspect ratio variance. At the same time, the performance obtained by popular architectures on these tasks shows that there is ample room for improvement. To wrap up the relevance of the contribution we review the fields, both in AI and high-performance computing, that could benefit from the proposed challenges.
11.Real-time Scene Text Detection with Differentiable Binarization ⬇️
Recently, segmentation-based methods are quite popular in scene text detection, as the segmentation results can more accurately describe scene text of various shapes such as curve text. However, the post-processing of binarization is essential for segmentation-based detection, which converts probability maps produced by a segmentation method into bounding boxes/regions of text. In this paper, we propose a module named Differentiable Binarization (DB), which can perform the binarization process in a segmentation network. Optimized along with a DB module, a segmentation network can adaptively set the thresholds for binarization, which not only simplifies the post-processing but also enhances the performance of text detection. Based on a simple segmentation network, we validate the performance improvements of DB on five benchmark datasets, which consistently achieves state-of-the-art results, in terms of both detection accuracy and speed. In particular, with a light-weight backbone, the performance improvements by DB are significant so that we can look for an ideal tradeoff between detection accuracy and efficiency. Specifically, with a backbone of ResNet-18, our detector achieves an F-measure of 82.8, running at 62 FPS, on the MSRA-TD500 dataset. Code is available at: this https URL
12.Deep Learning based HEp-2 Image Classification: A Comprehensive Review ⬇️
Classification of HEp-2 cell patterns plays a significant role in the indirect immunofluorescence test for identifying autoimmune diseases in the human body. Many automatic HEp-2 cell classification methods have been proposed in recent years, amongst which deep learning based methods have shown impressive performance. This paper provides a comprehensive review of the existing deep learning based HEp-2 cell image classification methods. These methods perform HEp-2 image classification in two levels, namely, cell-level and specimen-level. Both levels are covered in this review. In each level, the methods are organized with a deep network usage based taxonomy. The core idea, notable achievements, and key advantages and weakness of each method are critically analyzed. Furthermore, a concise review of the existing HEp-2 datasets that are commonly used in the literature is given. The paper ends with an overview of the current state-of-the-arts and a discussion on novel opportunities and future research directions in this field. It is hoped that this paper would give readers a comprehensive reference of this novel, challenging, and thriving field.
13.Shift Convolution Network for Stereo Matching ⬇️
In this paper, we present Shift Convolution Network (ShiftConvNet) to provide matching capability between two feature maps for stereo estimation. The proposed method can speedily produce a highly accurate disparity map from stereo images. A module called shift convolution layer is proposed to replace the traditional correlation layer to perform patch comparisons between two feature maps. By using a novel architecture of convolutional network to learn the matching process, ShiftConvNet can produce better results than DispNet-C[1], also running faster with 5 fps. Moreover, with a proposed auto shift convolution refine part, further improvement is obtained. The proposed approach was evaluated on FlyingThings 3D. It achieves state-of-the-art results on the benchmark dataset. Codes will be made available at github.
14.Improving Semantic Segmentation of Aerial Images Using Patch-based Attention ⬇️
The trade-off between feature representation power and spatial localization accuracy is crucial for the dense classification/semantic segmentation of aerial images. High-level features extracted from the late layers of a neural network are rich in semantic information, yet have blurred spatial details; low-level features extracted from the early layers of a network contain more pixel-level information, but are isolated and noisy. It is therefore difficult to bridge the gap between high and low-level features due to their difference in terms of physical information content and spatial distribution. In this work, we contribute to solve this problem by enhancing the feature representation in two ways. On the one hand, a patch attention module (PAM) is proposed to enhance the embedding of context information based on a patch-wise calculation of local attention. On the other hand, an attention embedding module (AEM) is proposed to enrich the semantic information of low-level features by embedding local focus from high-level features. Both of the proposed modules are light-weight and can be applied to process the extracted features of convolutional neural networks (CNNs). Experiments show that, by integrating the proposed modules into the baseline Fully Convolutional Network (FCN), the resulting local attention network (LANet) greatly improves the performance over the baseline and outperforms other attention based methods on two aerial image datasets.
15.D3S -- A Discriminative Single Shot Segmentation Tracker ⬇️
Template-based discriminative trackers are currently the dominant tracking paradigm due to their robustness, but are restricted to bounding box tracking and a limited range of transformation models, which reduces their localization accuracy. We propose a discriminative single-shot segmentation tracker - D3S, which narrows the gap between visual object tracking and video object segmentation. A single-shot network applies two target models with complementary geometric properties, one invariant to a broad range of transformations, including non-rigid deformations, the other assuming a rigid object to simultaneously achieve high robustness and online target segmentation. Without per-dataset finetuning and trained only for segmentation as the primary output, D3S outperforms all trackers on VOT2016, VOT2018 and GOT-10k benchmarks and performs close to the state-of-the-art trackers on the TrackingNet. D3S outperforms the leading segmentation tracker SiamMask on video object segmentation benchmark and performs on par with top video object segmentation algorithms, while running an order of magnitude faster, close to real-time.
16.Efficient Derivative Computation for Cumulative B-Splines on Lie Groups ⬇️
Continuous-time trajectory representation has recently gained popularity for tasks where the fusion of high-frame-rate sensors and multiple unsynchronized devices is required. Lie group cumulative B-splines are a popular way of representing continuous trajectories without singularities. They have been used in near real-time SLAM and odometry systems with IMU, LiDAR, regular, RGB-D and event cameras, as well as for offline calibration. These applications require efficient computation of time derivatives (velocity, acceleration), but all prior works rely on a computationally suboptimal formulation. In this work we present an alternative derivation of time derivatives based on recurrence relations that needs
$\mathcal{O}(k)$ instead of$\mathcal{O}(k^2)$ matrix operations (for a spline of order$k$ ) and results in simple and elegant expressions. While producing the same result, the proposed approach significantly speeds up the trajectory optimization and allows for computing simple analytic derivatives with respect to spline knots. The results presented in this paper pave the way for incorporating continuous-time trajectory representations into more applications where real-time performance is required.
17.RefineDetLite: A Lightweight One-stage Object Detection Framework for CPU-only Devices ⬇️
Previous state-of-the-art real-time object detectors have been reported on GPUs which are extremely expensive for processing massive data and in resource-restricted scenarios. Therefore, high efficiency object detectors on CPU-only devices are urgently-needed in industry. The floating-point operations (FLOPs) of networks are not strictly proportional to the running speed on CPU devices, which inspires the design of an exactly "fast" and "accurate" object detector. After investigating the concern gaps between classification networks and detection backbones, and following the design principles of efficient networks, we propose a lightweight residual-like backbone with large receptive fields and wide dimensions for low-level features, which are crucial for detection tasks. Correspondingly, we also design a light-head detection part to match the backbone capability. Furthermore, by analyzing the drawbacks of current one-stage detector training strategies, we also propose three orthogonal training strategies---IOU-guided loss, classes-aware weighting method and balanced multi-task training approach. Without bells and whistles, our proposed RefineDetLite achieves 26.8 mAP on the MSCOCO benchmark at a speed of 130 ms/pic on a single-thread CPU. The detection accuracy can be further increased to 29.6 mAP by integrating all the proposed training strategies, without apparent speed drop.
18.The dynamics of the stomatognathic system from 4D multimodal data ⬇️
The purpose of this chapter is to discuss methods of acquisition, visualization and analysis of the dynamics of a complex biomedical system, illustrated by the human stomatognathic system. The stomatognathic system consists of the teeth and the skull bones with the maxilla and the mandible. Its dynamics can be described by the change of mutual position of the lower/mandibular part versus the upper/maxillary one due to the physiological motion of opening, chewing and swallowing. In order to analyse the dynamics of the stomatognathic system its morphology and motion has to be digitized, which is done using static and dynamic multimodal imagery like CBCT and 3D scans data and temporal measurements of motion. The integration of multimodal data incorporates different direct and indirect methods of registration - aligning of all the data in the same coordinate system. The integrated sets of data form 4D multimodal data which can be further visualized, modeled, and subjected to multivariate time series analysis. Example results are shown. Although there is no direct method of imaging the TMJ motion, the integration of multimodal data forms an adequate tool. As medical imaging becomes ever more diverse and ever more accessible, organizing the imagery and measurements into unified, comprehensive records can deliver to the doctor the most information in the most accessible form, creating a new quality in data simulation, analysis and interpretation.
19.Self-supervised Learning of 3D Objects from Natural Images ⬇️
We present a method to learn single-view reconstruction of the 3D shape, pose, and texture of objects from categorized natural images in a self-supervised manner. Since this is a severely ill-posed problem, carefully designing a training method and introducing constraints are essential. To avoid the difficulty of training all elements at the same time, we propose training category-specific base shapes with fixed pose distribution and simple textures first, and subsequently training poses and textures using the obtained shapes. Another difficulty is that shapes and backgrounds sometimes become excessively complicated to mistakenly reconstruct textures on object surfaces. To suppress it, we propose using strong regularization and constraints on object surfaces and background images. With these two techniques, we demonstrate that we can use natural image collections such as CIFAR-10 and PASCAL objects for training, which indicates the possibility to realize 3D object reconstruction on diverse object categories beyond synthetic datasets.
20.You Are Here: Geolocation by Embedding Maps and Images ⬇️
We present a novel approach to geolocating images on a 2-D map based on learning a low dimensional embedded space, which allows a comparison between an image captured at a location and local neighbourhoods of the map. The representation is not sufficiently discriminatory to allow localisation from a single image but when concatenated along a route, localisation converges quickly, with over 90% accuracy being achieved for routes up to 200m in length when using Google Street View and Open Street Map data. The approach generalises a previous fixed semantic feature based approach and achieves faster convergence and higher accuracy without the need for including turn information.
21.Analysis of Deep Networks for Monocular Depth Estimation Through Adversarial Attacks with Proposal of a Defense Method ⬇️
In this paper, we consider adversarial attacks against a system of monocular depth estimation (MDE) based on convolutional neural networks (CNNs). The motivation is two-fold. One is to study the security of MDE systems, which has not been actively considered in the community. The other is to improve our understanding of the computational mechanism of CNNs performing MDE. Toward this end, we apply the method recently proposed for visualization of MDE to defending attacks. It trains another CNN to predict a saliency map from an input image, such that the CNN for MDE continues to accurately estimate the depth map from the image with its non-salient part masked out. We report the following findings. First, unsurprisingly, attacks by IFGSM (or equivalently PGD) succeed in making the CNNs yield inaccurate depth estimates. Second, the attacks can be defended by masking out non-salient pixels, indicating that the attacks function by perturbing mostly non-salient pixels. However, the prediction of saliency maps is itself vulnerable to the attacks, even though it is not the direct target of the attacks. We show that the attacks can be defended by using a saliency map predicted by a CNN trained to be robust to the attacks. These results provide an effective defense method as well as a clue to understanding the computational mechanism of CNNs for MDE.
22.Hierarchical Attention Networks for Medical Image Segmentation ⬇️
The medical image is characterized by the inter-class indistinction, high variability, and noise, where the recognition of pixels is challenging. Unlike previous self-attention based methods that capture context information from one level, we reformulate the self-attention mechanism from the view of the high-order graph and propose a novel method, namely Hierarchical Attention Network (HANet), to address the problem of medical image segmentation. Concretely, an HA module embedded in the HANet captures context information from neighbors of multiple levels, where these neighbors are extracted from the high-order graph. In the high-order graph, there will be an edge between two nodes only if the correlation between them is high enough, which naturally reduces the noisy attention information caused by the inter-class indistinction. The proposed HA module is robust to the variance of input and can be flexibly inserted into the existing convolution neural networks. We conduct experiments on three medical image segmentation tasks including optic disc/cup segmentation, blood vessel segmentation, and lung segmentation. Extensive results show our method is more effective and robust than the existing state-of-the-art methods.
23.Learning mappings onto regularized latent spaces for biometric authentication ⬇️
We propose a novel architecture for generic biometric authentication based on deep neural networks: RegNet. Differently from other methods, RegNet learns a mapping of the input biometric traits onto a target distribution in a well-behaved space in which users can be separated by means of simple and tunable boundaries. More specifically, authorized and unauthorized users are mapped onto two different and well behaved Gaussian distributions. The novel approach of learning the mapping instead of the boundaries further avoids the problem encountered in typical classifiers for which the learnt boundaries may be complex and difficult to analyze. RegNet achieves high performance in terms of security metrics such as Equal Error Rate (EER), False Acceptance Rate (FAR) and Genuine Acceptance Rate (GAR). The experiments we conducted on publicly available datasets of face and fingerprint confirm the effectiveness of the proposed system.
24.Vision: A Deep Learning Approach to provide walking assistance to the visually impaired ⬇️
Blind people face a lot of problems in their daily routines. They have to struggle a lot just to do their day-to-day chores. In this paper, we have proposed a system with the objective to help the visually impaired by providing audio aid guiding them to avoid obstacles, which will assist them to move in their surroundings. Object Detection using YOLO will help them detect the nearby objects and Depth Estimation using monocular vision will tell the approximate distance of the detected objects from the user. Despite a higher accuracy, stereo vision has many hardware constraints, which makes monocular vision the preferred choice for this application.
25.Event-based Object Detection and Tracking for Space Situational Awareness ⬇️
In this work, we present optical space imaging using an unconventional yet promising class of imaging devices known as neuromorphic event-based sensors. These devices, which are modeled on the human retina, do not operate with frames, but rather generate asynchronous streams of events in response to changes in log-illumination at each pixel. These devices are therefore extremely fast, do not have fixed exposure times, allow for imaging whilst the device is moving and enable low power space imaging during daytime as well as night without modification of the sensors. Recorded at multiple remote sites, we present the first event-based space imaging dataset including recordings from multiple event-based sensors from multiple providers, greatly lowering the barrier to entry for other researchers given the scarcity of such sensors and the expertise required to operate them. The dataset contains 236 separate recordings and 572 labeled resident space objects. The event-based imaging paradigm presents unique opportunities and challenges motivating the development of specialized event-based algorithms that can perform tasks such as detection and tracking in an event-based manner. Here we examine a range of such event-based algorithms for detection and tracking. The presented methods are designed specifically for space situational awareness applications and are evaluated in terms of accuracy and speed and suitability for implementation in neuromorphic hardware on remote or space-based imaging platforms.
26.Fast and Flexible Image Blind Denoising via Competition of Experts ⬇️
Fast and flexible processing are two essential requirements for a number of practical applications of image denoising. Current state-of-the-art methods, however, still require either high computational cost or limited scopes of the target. We introduce an efficient ensemble network trained via a competition of expert networks, as an application for image blind denoising. We realize automatic division of unlabeled noisy datasets into clusters respectively optimized to enhance denoising performance. The architecture is scalable, can be extended to deal with diverse noise sources/levels without increasing the computation time. Taking advantage of this method, we save up to approximately 90% of computational cost without sacrifice of the denoising performance compared to single network models with identical architectures. We also compare the proposed method with several existing algorithms and observe significant outperformance over prior arts in terms of computational efficiency.
27.Towards Ghost-free Shadow Removal via Dual Hierarchical Aggregation Network and Shadow Matting GAN ⬇️
Shadow removal is an essential task for scene understanding. Many studies consider only matching the image contents, which often causes two types of ghosts: color in-consistencies in shadow regions or artifacts on shadow boundaries. In this paper, we tackle these issues in two ways. First, to carefully learn the border artifacts-free image, we propose a novel network structure named the dual hierarchically aggregation network~(DHAN). It contains a series of growth dilated convolutions as the backbone without any down-samplings, and we hierarchically aggregate multi-context features for attention and prediction, respectively. Second, we argue that training on a limited dataset restricts the textural understanding of the network, which leads to the shadow region color in-consistencies. Currently, the largest dataset contains 2k+ shadow/shadow-free image pairs. However, it has only 0.1k+ unique scenes since many samples share exactly the same background with different shadow positions. Thus, we design a shadow matting generative adversarial network~(SMGAN) to synthesize realistic shadow mattings from a given shadow mask and shadow-free image. With the help of novel masks or scenes, we enhance the current datasets using synthesized shadow images. Experiments show that our DHAN can erase the shadows and produce high-quality ghost-free images. After training on the synthesized and real datasets, our network outperforms other state-of-the-art methods by a large margin. The code is available: this http URL
28.DermGAN: Synthetic Generation of Clinical Skin Images with Pathology ⬇️
Despite the recent success in applying supervised deep learning to medical imaging tasks, the problem of obtaining large and diverse expert-annotated datasets required for the development of high performant models remains particularly challenging. In this work, we explore the possibility of using Generative Adverserial Networks (GAN) to synthesize clinical images with skin condition. We propose DermGAN, an adaptation of the popular Pix2Pix architecture, to create synthetic images for a pre-specified skin condition while being able to vary its size, location and the underlying skin color. We demonstrate that the generated images are of high fidelity using objective GAN evaluation metrics. In a Human Turing test, we note that the synthetic images are not only visually similar to real images, but also embody the respective skin condition in dermatologists' eyes. Finally, when using the synthetic images as a data augmentation technique for training a skin condition classifier, we observe that the model performs comparably to the baseline model overall while improving on rare but malignant conditions.
29.Instance-Invariant Adaptive Object Detection via Progressive Disentanglement ⬇️
Most state-of-the-art methods of object detection suffer from poor generalization ability when the training and test data are from different domains, e.g., with different styles. To address this problem, previous methods mainly use holistic representations to align feature-level and pixel-level distributions of different domains, which may neglect the instance-level characteristics of objects in images. Besides, when transferring detection ability across different domains, it is important to obtain the instance-level features that are domain-invariant, instead of the styles that are domain-specific. Therefore, in order to extract instance-invariant features, we should disentangle the domain-invariant features from the domain-specific features. To this end, a progressive disentangled framework is first proposed to solve domain adaptive object detection. Particularly, base on disentangled learning used for feature decomposition, we devise two disentangled layers to decompose domain-invariant and domain-specific features. And the instance-invariant features are extracted based on the domain-invariant features. Finally, to enhance the disentanglement, a three-stage training mechanism including multiple loss functions is devised to optimize our model. In the experiment, we verify the effectiveness of our method on three domain-shift scenes. Our method is separately 2.3%, 3.6%, and 4.0% higher than the baseline method \cite{saito2019strong}.
30.Take an Emotion Walk: Perceiving Emotions from Gaits Using Hierarchical Attention Pooling and Affective Mapping ⬇️
We present an autoencoder-based semi-supervised approach to classify perceived human emotions from walking styles obtained from videos or from motion-captured data and represented as sequences of 3D poses. Given the motion on each joint in the pose at each time step extracted from 3D pose sequences, we hierarchically pool these joint motions in a bottom-up manner in the encoder, following the kinematic chains in the human body. We also constrain the latent embeddings of the encoder to contain the space of psychologically-motivated affective features underlying the gaits. We train the decoder to reconstruct the motions per joint per time step in a top-down manner from the latent embeddings. For the annotated data, we also train a classifier to map the latent embeddings to emotion labels. Our semi-supervised approach achieves a mean average precision of 0.84 on the Emotion-Gait benchmark dataset, which contains gaits collected from multiple sources. We outperform current state-of-art algorithms for both emotion recognition and action recognition from 3D gaits by 7% -- 23% on the absolute.
31.DRNet: Dissect and Reconstruct the Convolutional Neural Network via Interpretable Manners ⬇️
This paper proposes to use an interpretable method to dissect the channels of a large-scale convolutional neural networks (CNNs) into class-wise parts, and reconstruct a CNN using some of these parts. The dissection and reconstruction process can be done in very short time on state-of-the-art networks such as VGG and MobileNetV2. This method allows users to run parts of a CNN according to specific application scenarios, instead of running the whole network or retraining a new one for every task. Experiments on Cifar and ILSVRC 2012 show that the reconstructed CNN runs more efficiently than the original one and achieve a better accuracy. Interpretability analyses show that our method is a new way of applying CNNs on tasks with given knowledge.
32.SSAH: Semi-supervised Adversarial Deep Hashing with Self-paced Hard Sample Generation ⬇️
Deep hashing methods have been proved to be effective and efficient for large-scale Web media search. The success of these data-driven methods largely depends on collecting sufficient labeled data, which is usually a crucial limitation in practical cases. The current solutions to this issue utilize Generative Adversarial Network (GAN) to augment data in semi-supervised learning. However, existing GAN-based methods treat image generations and hashing learning as two isolated processes, leading to generation ineffectiveness. Besides, most works fail to exploit the semantic information in unlabeled data. In this paper, we propose a novel Semi-supervised Self-pace Adversarial Hashing method, named SSAH to solve the above problems in a unified framework. The SSAH method consists of an adversarial network (A-Net) and a hashing network (H-Net). To improve the quality of generative images, first, the A-Net learns hard samples with multi-scale occlusions and multi-angle rotated deformations which compete against the learning of accurate hashing codes. Second, we design a novel self-paced hard generation policy to gradually increase the hashing difficulty of generated samples. To make use of the semantic information in unlabeled ones, we propose a semi-supervised consistent loss. The experimental results show that our method can significantly improve state-of-the-art models on both the widely-used hashing datasets and fine-grained datasets.
33.Discriminative Local Sparse Representation by Robust Adaptive Dictionary Pair Learning ⬇️
In this paper, we propose a structured Robust Adaptive Dic-tionary Pair Learning (RA-DPL) framework for the discrim-inative sparse representation learning. To achieve powerful representation ability of the available samples, the setting of RA-DPL seamlessly integrates the robust projective dictionary pair learning, locality-adaptive sparse representations and discriminative coding coefficients learning into a unified learning framework. Specifically, RA-DPL improves existing projective dictionary pair learning in four perspectives. First, it applies a sparse l2,1-norm based metric to encode the recon-struction error to deliver the robust projective dictionary pairs, and the l2,1-norm has the potential to minimize the error. Sec-ond, it imposes the robust l2,1-norm clearly on the analysis dictionary to ensure the sparse property of the coding coeffi-cients rather than using the costly l0/l1-norm. As such, the robustness of the data representation and the efficiency of the learning process are jointly considered to guarantee the effi-cacy of our RA-DPL. Third, RA-DPL conceives a structured reconstruction weight learning paradigm to preserve the local structures of the coding coefficients within each class clearly in an adaptive manner, which encourages to produce the locality preserving representations. Fourth, it also considers improving the discriminating ability of coding coefficients and dictionary by incorporating a discriminating function, which can ensure high intra-class compactness and inter-class separation in the code space. Extensive experiments show that our RA-DPL can obtain superior performance over other state-of-the-arts.
34.MMTM: Multimodal Transfer Module for CNN Fusion ⬇️
In late fusion, each modality is processed in a separate unimodal Convolutional Neural Network (CNN) stream and the scores of each modality are fused at the end. Due to its simplicity late fusion is still the predominant approach in many state-of-the-art multimodal applications. In this paper, we present a simple neural network module for leveraging the knowledge from multiple modalities in convolutional neural networks. The propose unit, named Multimodal Transfer Module (MMTM), can be added at different levels of the feature hierarchy, enabling slow modality fusion. Using squeeze and excitation operations, MMTM utilizes the knowledge of multiple modalities to recalibrate the channel-wise features in each CNN stream. Despite other intermediate fusion methods, the proposed module could be used for feature modality fusion in convolution layers with different spatial dimensions. Another advantage of the proposed method is that it could be added among unimodal branches with minimum changes in the their network architectures, allowing each branch to be initialized with existing pretrained weights. Experimental results show that our framework improves the recognition accuracy of well-known multimodal networks. We demonstrate state-of-the-art or competitive performance on four datasets that span the task domains of dynamic hand gesture recognition, speech enhancement, and action recognition with RGB and body joints.
35.Unified Multifaceted Feature Learning for Person Re-Identification ⬇️
Person re-identification (ReID) aims at re-identifying persons from different viewpoints across multiple cameras, of which it is of great importance to learn multifaceted features expressed in different parts of a person, e.g., clothes, bags, and other accessories in the main body, appearance in the head, and shoes in the foot. To learn such features, existing methods are focused on the striping-based approach that builds multi-branch neural networks to learn local features in each part of the identities, with one-branch network dedicated to one part. This results in complex models with a large number of parameters. To address this issue, this paper proposes to learn the multifaceted features in a simple unified single-branch neural network. The Unified Multifaceted Feature Learning (UMFL) framework is introduced to fulfill this goal, which consists of two key collaborative modules: compound batch image erasing (including batch constant erasing and random erasing) and hierarchical structured loss. The loss structures the augmented images resulted by the two types of image erasing in a two-level hierarchy and enforces multifaceted attention to different parts. As we show in the extensive experimental results on four benchmark person ReID datasets, despite the use of significantly simplified network structure, our method performs substantially better than state-of-the-art competing methods. Our method can also effectively generalize to vehicle ReID, achieving similar improvement on two vehicle ReID datasets.
36.CUP: Cluster Pruning for Compressing Deep Neural Networks ⬇️
We propose Cluster Pruning (CUP) for compressing and accelerating deep neural networks. Our approach prunes similar filters by clustering them based on features derived from both the incoming and outgoing weight connections. With CUP, we overcome two limitations of prior work-(1) non-uniform pruning: CUP can efficiently determine the ideal number of filters to prune in each layer of a neural network. This is in contrast to prior methods that either prune all layers uniformly or otherwise use resource-intensive methods such as manual sensitivity analysis or reinforcement learning to determine the ideal number. (2) Single-shot operation: We extend CUP to CUP-SS (for CUP single shot) whereby pruning is integrated into the initial training phase itself. This leads to large savings in training time compared to traditional pruning pipelines. Through extensive evaluation on multiple datasets (MNIST, CIFAR-10, and Imagenet) and models(VGG-16, Resnets-18/34/56) we show that CUP outperforms recent state of the art. Specifically, CUP-SS achieves 2.2x flops reduction for a Resnet-50 model trained on Imagenet while staying within 0.9% top-5 accuracy. It saves over 14 hours in training time with respect to the original Resnet-50. The code to reproduce results is available.
37.Open Cross-Domain Visual Search ⬇️
This paper introduces open cross-domain visual search, where categories in any target domain are retrieved based on queries from any source domain. Current works usually tackle cross-domain visual search as a domain adaptation problem. This limits the search to a closed setting, with one fixed source domain and one fixed target domain. To make the step towards an open setting where multiple visual domains are available, we introduce a simple yet effective approach. We formulate the search as one of mapping examples from every visual domain to a common semantic space, where categories are represented by hyperspherical prototypes. Cross-domain search is then performed by searching in the common space, regardless of which domains are used as source or target. Having separate mappings for every domain allows us to search in an open setting, and to incrementally add new domains over time without retraining existing mapping functions. Experimentally, we show our capability to perform open cross-domain visual search. Our approach is competitive with respect to traditional closed settings, where we obtain state-of-the-art results on six benchmarks for three sketch-based search tasks.
38.Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA ⬇️
In this paper, we aim to obtain improved attention for a visual question answering (VQA) task. It is challenging to provide supervision for attention. An observation we make is that visual explanations as obtained through class activation mappings (specifically Grad-CAM) that are meant to explain the performance of various networks could form a means of supervision. However, as the distributions of attention maps and that of Grad-CAMs differ, it would not be suitable to directly use these as a form of supervision. Rather, we propose the use of a discriminator that aims to distinguish samples of visual explanation and attention maps. The use of adversarial training of the attention regions as a two-player game between attention and explanation serves to bring the distributions of attention maps and visual explanations closer. Significantly, we observe that providing such a means of supervision also results in attention maps that are more closely related to human attention resulting in a substantial improvement over baseline stacked attention network (SAN) models. It also results in a good improvement in rank correlation metric on the VQA task. This method can also be combined with recent MCB based methods and results in consistent improvement. We also provide comparisons with other means for learning distributions such as based on Correlation Alignment (Coral), Maximum Mean Discrepancy (MMD) and Mean Square Error (MSE) losses and observe that the adversarial loss outperforms the other forms of learning the attention maps. Visualization of the results also confirms our hypothesis that attention maps improve using this form of supervision.
39.Attention Guided Anomaly Detection and Localization in Images ⬇️
Anomaly detection and localization is a popular computer vision problem involving detecting anomalous images and localizing anomalies within them. However, this task is challenging due to the small sample size and pixel coverage of the anomaly in real-world scenarios. Prior works need to use anomalous training images to compute a threshold to detect and localize anomalies. To remove this need, we propose Convolutional Adversarial Variational autoencoder with Guided Attention (CAVGA), which localizes the anomaly with a convolutional latent variable to preserve the spatial information. In the unsupervised setting, we propose an attention expansion loss, where we encourage CAVGA to focus on all normal regions in the image without using any anomalous training image. Furthermore, using only 2% anomalous images in the weakly supervised setting we propose a complementary guided attention loss, where we encourage the normal attention to focus on all normal regions while minimizing the regions covered by the anomalous attention in the normal image. CAVGA outperforms the state-of-the-art (SOTA) anomaly detection methods on the MNIST, CIFAR-10, Fashion-MNIST, MVTec Anomaly Detection (MVTAD), and modified ShanghaiTech Campus (mSTC) datasets. CAVGA also outperforms the SOTA anomaly localization methods on the MVTAD and mSTC datasets.
40.Hybrid Composition with IdleBlock: More Efficient Networks for Image Recognition ⬇️
We propose a new building block, IdleBlock, which naturally prunes connections within the block. To fully utilize the IdleBlock we break the tradition of monotonic design in state-of-the-art networks, and introducing hybrid composition with IdleBlock. We study hybrid composition on MobileNet v3 and EfficientNet-B0, two of the most efficient networks. Without any neural architecture search, the deeper "MobileNet v3" with hybrid composition design surpasses possibly all state-of-the-art image recognition network designed by human experts or neural architecture search algorithms. Similarly, the hybridized EfficientNet-B0 networks are more efficient than previous state-of-the-art networks with similar computation budgets. These results suggest a new simpler and more efficient direction for network design and neural architecture search.
41.CoopNet: Cooperative Convolutional Neural Network for Low-Power MCUs ⬇️
Fixed-point quantization and binarization are two reduction methods adopted to deploy Convolutional Neural Networks (CNN) on end-nodes powered by low-power micro-controller units (MCUs). While most of the existing works use them as stand-alone optimizations, this work aims at demonstrating there is margin for a joint cooperation that leads to inferential engines with lower latency and higher accuracy. Called
$CoopNet$ , the proposed heterogeneous model is conceived, implemented and tested on off-the-shelf MCUs with small on-chip memory and few computational resources. Experimental results conducted on three different CNNs using as test-bench the low-power RISC core of the Cortex-M family by ARM validate the CoopNet proposal by showing substantial improvements w.r.t. designs where quantization and binarization are applied separately.
42.Learning Stylized Character Expressions from Humans ⬇️
We present DeepExpr, a novel expression transfer system from humans to multiple stylized characters via deep learning. We developed : 1) a data-driven perceptual model of facial expressions, 2) a novel stylized character data set with cardinal expression annotations : FERG (Facial Expression Research Group) - DB (added two new characters), and 3) . We evaluated our method on a set of retrieval tasks on our collected stylized character dataset of expressions. We have also shown that the ranking order predicted by the proposed features is highly correlated with the ranking order provided by a facial expression expert and Mechanical Turk (MT) experiments.
43.Mini Lesions Detection on Diabetic Retinopathy Images via Large Scale CNN Features ⬇️
Diabetic retinopathy (DR) is a diabetes complication that affects eyes. DR is a primary cause of blindness in working-age people and it is estimated that 3 to 4 million people with diabetes are blinded by DR every year worldwide. Early diagnosis have been considered an effective way to mitigate such problem. The ultimate goal of our research is to develop novel machine learning techniques to analyze the DR images generated by the fundus camera for automatically DR diagnosis. In this paper, we focus on identifying small lesions on DR fundus images. The results from our analysis, which include the lesion category and their exact locations in the image, can be used to facilitate the determination of DR severity (indicated by DR stages). Different from traditional object detection for natural images, lesion detection for fundus images have unique challenges. Specifically, the size of a lesion instance is usually very small, compared with the original resolution of the fundus images, making them diffcult to be detected. We analyze the lesion-vs-image scale carefully and propose a large-size feature pyramid network (LFPN) to preserve more image details for mini lesion instance detection. Our method includes an effective region proposal strategy to increase the sensitivity. The experimental results show that our proposed method is superior to the original feature pyramid network (FPN) method and Faster RCNN.
44.Localizing Occluders with Compositional Convolutional Networks ⬇️
Compositional convolutional networks are generative compositional models of neural network features, that achieve state of the art results when classifying partially occluded objects, even when they have not been exposed to occluded objects during training. In this work, we study the performance of CompositionalNets at localizing occluders in images. We show that the original model is not able to localize occluders well. We propose to overcome this limitation by modeling the feature activations as a mixture of von-Mises-Fisher distributions, which also allows for an end-to-end training of CompositionalNets. Our experimental results demonstrate that the proposed extensions increase the model's performance at localizing occluders as well as at classifying partially occluded objects.
45.Accurate Trajectory Prediction for Autonomous Vehicles ⬇️
Predicting vehicle trajectories, angle and speed is important for safe and comfortable driving. We demonstrate the best predicted angle, speed, and best performance overall winning the top three places of the ICCV 2019 Learning to Drive challenge. Our key contributions are (i) a general neural network system architecture which embeds and fuses together multiple inputs by encoding, and decodes multiple outputs using neural networks, (ii) using pre-trained neural networks for augmenting the given input data with segmentation maps and semantic information, and (iii) leveraging the form and distribution of the expected output in the model.
46.Joint Super-Resolution and Alignment of Tiny Faces ⬇️
Super-resolution (SR) and landmark localization of tiny faces are highly correlated tasks. On the one hand, landmark localization could obtain higher accuracy with faces of high-resolution (HR). On the other hand, face SR would benefit from prior knowledge of facial attributes such as landmarks. Thus, we propose a joint alignment and SR network to simultaneously detect facial landmarks and super-resolve tiny faces. More specifically, a shared deep encoder is applied to extract features for both tasks by leveraging complementary information. To exploit the representative power of the hierarchical encoder, intermediate layers of a shared feature extraction module are fused to form efficient feature representations. The fused features are then fed to task-specific modules to detect landmarks and super-resolve face images in parallel. Extensive experiments demonstrate that the proposed model significantly outperforms the state-of-the-art in both landmark localization and SR of faces. We show a large improvement for landmark localization of tiny faces (i.e., 1616). Furthermore, the proposed framework yields comparable results for landmark localization on low-resolution (LR) faces (i.e., 6464) to existing methods on HR (i.e., 256*256). As for SR, the proposed method recovers sharper edges and more details from LR face images than other state-of-the-art methods, which we demonstrate qualitatively and quantitatively.
47.Enhancing Generic Segmentation with Learned Region Representations ⬇️
Current successful approaches for generic (non-semantic) segmentation rely mostly on edge detection and have leveraged the strengths of deep learning mainly by improving the edge detection stage in the algorithmic pipeline. This is in contrast to semantic and instance segmentation, where deep learning has made a dramatic affect and DNNs are applied directly to generate pixel-wise segment representations. We propose a new method for learning a pixelwise representation that reflects segment relatedness. This representation is combined with an edge map to yield a new segmentation algorithm. We show that the representations themselves achieve state-of-the-art segment similarity scores. Moreover, the proposed, combined segmentation algorithm provides results that are either the state of the art or improve it, for most quality measures.
48.Cross-Class Relevance Learning for Temporal Concept Localization ⬇️
We present a novel Cross-Class Relevance Learning approach for the task of temporal concept localization. Most localization architectures rely on feature extraction layers followed by a classification layer which outputs class probabilities for each segment. However, in many real-world applications classes can exhibit complex relationships that are difficult to model with this architecture. In contrast, we propose to incorporate target class and class-related features as input, and learn a pairwise binary model to predict general segment to class relevance. This facilitates learning of shared information between classes, and allows for arbitrary class-specific feature engineering. We apply this approach to the 3rd YouTube-8M Video Understanding Challenge together with other leading models, and achieve first place out of over 280 teams. In this paper we describe our approach and show some empirical results.
49.Deep Motion Blur Removal Using Noisy/Blurry Image Pairs ⬇️
Removing spatially variant motion blur from a blurry image is a challenging problem as blur sources are complicated and difficult to model accurately. Recent progress in deep neural networks suggests that kernel free single image deblurring can be efficiently performed, but questions about deblurring performance persist. Thus, we propose to restore a sharp image by fusing a pair of noisy/blurry images captured in a burst. Two neural network structures, DeblurRNN and DeblurMerger, are presented to exploit the pair of images in a sequential manner or parallel manner. To boost the training, gradient loss, adversarial loss and spectral normalization are leveraged. The training dataset that consists of pairs of noisy/blurry images and the corresponding ground truth sharp image is synthesized based on the benchmark dataset GOPRO. We evaluated the trained networks on a variety of synthetic datasets and real image pairs. The results demonstrate that the proposed approach outperforms the state-of-the-art both qualitatively and quantitatively.
50.Action Recognition Using Volumetric Motion Representations ⬇️
Traditional action recognition models are constructed around the paradigm of 2D perspective imagery. Though sophisticated time-series models have pushed the field forward, much of the information is still not exploited by confining the domain to 2D. In this work, we introduce a novel representation of motion as a voxelized 3D vector field and demonstrate how it can be used to improve performance of action recognition networks. This volumetric representation is a natural fit for 3D CNNs, and allows out-of-plane data augmentation techniques during training of these networks. Both the construction of this representation from RGB-D video and inference can be run in real time. We demonstrate superior results using this representation with our network design on the open-source NTU RGB+D dataset where it outperforms state-of-the-art on both of the defined evaluation metrics. Furthermore, we experimentally show how the out-of-plane augmentation techniques create viewpoint invariance and allow the model trained using this representation to generalize to unseen camera angles. Code is available here: this https URL.
51.Modal-aware Features for Multimodal Hashing ⬇️
Many retrieval applications can benefit from multiple modalities, e.g., text that contains images on Wikipedia, for which how to represent multimodal data is the critical component. Most deep multimodal learning methods typically involve two steps to construct the joint representations: 1) learning of multiple intermediate features, with each intermediate feature corresponding to a modality, using separate and independent deep models; 2) merging the intermediate features into a joint representation using a fusion strategy. However, in the first step, these intermediate features do not have previous knowledge of each other and cannot fully exploit the information contained in the other modalities. In this paper, we present a modal-aware operation as a generic building block to capture the non-linear dependences among the heterogeneous intermediate features that can learn the underlying correlation structures in other multimodal data as soon as possible. The modal-aware operation consists of a kernel network and an attention network. The kernel network is utilized to learn the non-linear relationships with other modalities. Then, to learn better representations for binary hash codes, we present an attention network that finds the informative regions of these modal-aware features that are favorable for retrieval. Experiments conducted on three public benchmark datasets demonstrate significant improvements in the performance of our method relative to state-of-the-art methods.
52.Sibling Neural Estimators: Improving Iterative Image Decoding with Gradient Communication ⬇️
For lossy image compression, we develop a neural-based system which learns a nonlinear estimator for decoding from quantized representations. The system links two recurrent networks that \help" each other reconstruct same target image patches using complementary portions of spatial context that communicate via gradient signals. This dual agent system builds upon prior work that proposed the iterative refinement algorithm for recurrent neural network (RNN)based decoding which improved image reconstruction compared to standard decoding techniques. Our approach, which works with any encoder, neural or non-neural, This system progressively reduces image patch reconstruction error over a fixed number of steps. Experiment with variants of RNN memory cells, with and without future information, find that our model consistently creates lower distortion images of higher perceptual quality compared to other approaches. Specifically, on the Kodak Lossless True Color Image Suite, we observe as much as a 1:64 decibel (dB) gain over JPEG, a 1:46 dB gain over JPEG 2000, a 1:34 dB gain over the GOOG neural baseline, 0:36 over E2E (a modern competitive neural compression model), and 0:37 over a single iterative neural decoder.
53.Utility Analysis of Network Architectures for 3D Point Cloud Processing ⬇️
In this paper, we diagnose deep neural networks for 3D point cloud processing to explore utilities of different network architectures. We propose a number of hypotheses on the effects of specific network architectures on the representation capacity of DNNs. In order to prove the hypotheses, we design five metrics to diagnose various types of DNNs from the following perspectives, information discarding, information concentration, rotation robustness, adversarial robustness, and neighborhood inconsistency. We conduct comparative studies based on such metrics to verify the hypotheses. We further use the verified hypotheses to revise architectures of existing DNNs to improve their utilities. Experiments demonstrate the effectiveness of our method.
54.Heterogeneous Graph-based Knowledge Transfer for Generalized Zero-shot Learning ⬇️
Generalized zero-shot learning (GZSL) tackles the problem of learning to classify instances involving both seen classes and unseen ones. The key issue is how to effectively transfer the model learned from seen classes to unseen classes. Existing works in GZSL usually assume that some prior information about unseen classes are available. However, such an assumption is unrealistic when new unseen classes appear dynamically. To this end, we propose a novel heterogeneous graph-based knowledge transfer method (HGKT) for GZSL, agnostic to unseen classes and instances, by leveraging graph neural network. Specifically, a structured heterogeneous graph is constructed with high-level representative nodes for seen classes, which are chosen through Wasserstein barycenter in order to simultaneously capture inter-class and intra-class relationship. The aggregation and embedding functions can be learned through graph neural network, which can be used to compute the embeddings of unseen classes by transferring the knowledge from their neighbors. Extensive experiments on public benchmark datasets show that our method achieves state-of-the-art results.
55.3D-Rotation-Equivariant Quaternion Neural Networks ⬇️
This paper proposes a set of rules to revise various neural networks for 3D point cloud processing to rotation-equivariant quaternion neural networks (REQNNs). We find that when a neural network uses quaternion features under certain conditions, the network feature naturally has the rotation-equivariance property. Rotation equivariance means that applying a specific rotation transformation to the input point cloud is equivalent to applying the same rotation transformation to all intermediate-layer quaternion features. Besides, the REQNN also ensures that the intermediate-layer features are invariant to the permutation of input points. Compared with the original neural network, the REQNN exhibits higher rotation robustness.
56.Exploiting Spatial Invariance for Scalable Unsupervised Object Tracking ⬇️
The ability to detect and track objects in the visual world is a crucial skill for any intelligent agent, as it is a necessary precursor to any object-level reasoning process. Moreover, it is important that agents learn to track objects without supervision (i.e. without access to annotated training videos) since this will allow agents to begin operating in new environments with minimal human assistance. The task of learning to discover and track objects in videos, which we call \textit{unsupervised object tracking}, has grown in prominence in recent years; however, most architectures that address it still struggle to deal with large scenes containing many objects. In the current work, we propose an architecture that scales well to the large-scene, many-object setting by employing spatially invariant computations (convolutions and spatial attention) and representations (a spatially local object specification scheme). In a series of experiments, we demonstrate a number of attractive features of our architecture; most notably, that it outperforms competing methods at tracking objects in cluttered scenes with many objects, and that it can generalize well to videos that are larger and/or contain more objects than videos encountered during training.
57.Towards a Unified Evaluation of Explanation Methods without Ground Truth ⬇️
This paper proposes a set of criteria to evaluate the objectiveness of explanation methods of neural networks, which is crucial for the development of explainable AI, but it also presents significant challenges. The core challenge is that people usually cannot obtain ground-truth explanations of the neural network. To this end, we design four metrics to evaluate explanation results without ground-truth explanations. Our metrics can be broadly applied to nine benchmark methods of interpreting neural networks, which provides new insights of explanation methods.
58.Segmentation of Defective Skulls from CT Data for Tissue Modelling ⬇️
In this work we present a method of automatic segmentation of defective skulls for custom cranial implant design and 3D printing purposes. Since such tissue models are usually required in patient cases with complex anatomical defects and variety of external objects present in the acquired data, most deep learning-based approaches fall short because it is not possible to create a sufficient training dataset that would encompass the spectrum of all possible structures. Because CNN segmentation experiments in this application domain have been so far limited to simple patch-based CNN architectures, we first show how the usage of the encoder-decoder architecture can substantially improve the segmentation accuracy. Then, we show how the number of segmentation artifacts, which usually require manual corrections, can be further reduced by adding a boundary term to CNN training and by globally optimizing the segmentation with graph-cut. Finally, we show that using the proposed method, 3D segmentation accurate enough for clinical application can be achieved with 2D CNN architectures as well as their 3D counterparts.
59.Inspect Transfer Learning Architecture with Dilated Convolution ⬇️
There are many award-winning pre-trained Convolutional Neural Network (CNN), which have a common phenomenon of increasing depth in convolutional layers. However, I inspect on VGG network, which is one of the famous model submitted to ILSVRC-2014, to show that slight modification in the basic architecture can enhance the accuracy result of the image classification task. In this paper, We present two improve architectures of pre-trained VGG-16 and VGG-19 networks that apply transfer learning when trained on a different dataset. I report a series of experimental result on various modification of the primary VGG networks and achieved significant out-performance on image classification task by: (1) freezing the first two blocks of the convolutional layers to prevent over-fitting and (2) applying different combination of dilation rate in the last three blocks of convolutional layer to reduce image resolution for feature extraction. Both the proposed architecture achieves a competitive result on CIFAR-10 and CIFAR-100 dataset.
60.Yottixel -- An Image Search Engine for Large Archives of Histopathology Whole Slide Images ⬇️
With the emergence of digital pathology, searching for similar images in large archives has gained considerable attention. Image retrieval can provide pathologists with unprecedented access to the evidence embodied in already diagnosed and treated cases from the past. This paper proposes a search engine specialized for digital pathology, called Yottixel, a portmanteau for "one yotta pixel," alluding to the big-data nature of histopathology images. The most impressive characteristic of Yottixel is its ability to represent whole slide images (WSIs) in a compact manner. Yottixel can perform millions of searches in real-time with a high search accuracy and low storage profile. Yottixel uses an intelligent indexing algorithm capable of representing WSIs with a mosaic of patches by converting them into a small number of methodically extracted barcodes, called "Bunch of Barcodes" (BoB), the most prominent performance enabler of Yottixel. The performance of the prototype platform is qualitatively tested using 300 WSIs from the University of Pittsburgh Medical Center (UPMC) and 2,020 WSIs from The Cancer Genome Atlas Program (TCGA) provided by the National Cancer Institute. Both datasets amount to more than 4,000,000 patches of 1000x1000 pixels. We report three sets of experiments that show that Yottixel can accurately retrieve organs and malignancies, and its semantic ordering shows good agreement with the subjective evaluation of human observers.
61.Pan-Cancer Diagnostic Consensus Through Searching Archival Histopathology Images Using Artificial Intelligence ⬇️
The emergence of digital pathology has opened new horizons for histopathology and cytology. Artificial-intelligence algorithms are able to operate on digitized slides to assist pathologists with diagnostic tasks. Whereas machine learning involving classification and segmentation methods have obvious benefits for image analysis in pathology, image search represents a fundamental shift in computational pathology. Matching the pathology of new patients with already diagnosed and curated cases offers pathologist a novel approach to improve diagnostic accuracy through visual inspection of similar cases and computational majority vote for consensus building. In this study, we report the results from searching the largest public repository (The Cancer Genome Atlas [TCGA] program by National Cancer Institute, USA) of whole slide images from almost 11,000 patients depicting different types of malignancies. For the first time, we successfully indexed and searched almost 30,000 high-resolution digitized slides constituting 16 terabytes of data comprised of 20 million 1000x1000 pixels image patches. The TCGA image database covers 25 anatomic sites and contains 32 cancer subtypes. High-performance storage and GPU power were employed for experimentation. The results were assessed with conservative "majority voting" to build consensus for subtype diagnosis through vertical search and demonstrated high accuracy values for both frozen sections slides (e.g., bladder urothelial carcinoma 93%, kidney renal clear cell carcinoma 97%, and ovarian serous cystadenocarcinoma 99%) and permanent histopathology slides (e.g., prostate adenocarcinoma 98%, skin cutaneous melanoma 99%, and thymoma 100%). The key finding of this validation study was that computational consensus appears to be possible for rendering diagnoses if a sufficiently large number of searchable cases are available for each cancer subtype.
62.An Inception Inspired Deep Network to Analyse Fundus Images ⬇️
A fundus image usually contains the optic disc, pathologies and other structures in addition to vessels to be segmented. This study proposes a deep network for vessel segmentation, whose architecture is inspired by inception modules. The network contains three sub-networks, each with a different filter size, which are connected in the last layer of the proposed network. According to experiments conducted in the DRIVE and IOSTAR, the performance of our network is found to be better than or comparable to that of the previous methods. We also observe that the sub-networks pay attention to different parts of an input image when producing an output map in the last layer of the proposed network; though, training of the proposed network is not constrained for this purpose.
63.Dual Reconstruction with Densely Connected Residual Network for Single Image Super-Resolution ⬇️
Deep learning-based single image super-resolution enables very fast and high-visual-quality reconstruction. Recently, an enhanced super-resolution based on generative adversarial network (ESRGAN) has achieved excellent performance in terms of both qualitative and quantitative quality of the reconstructed high-resolution image. In this paper, we propose to add one more shortcut between two dense-blocks, as well as add shortcut between two convolution layers inside a dense-block. With this simple strategy of adding more shortcuts in the proposed network, it enables a faster learning process as the gradient information can be back-propagated more easily. Based on the improved ESRGAN, the dual reconstruction is proposed to learn different aspects of the super-resolved image for judiciously enhancing the quality of the reconstructed image. In practice, the super-resolution model is pre-trained solely based on pixel distance, followed by fine-tuning the parameters in the model based on adversarial loss and perceptual loss. Finally, we fuse two different models by weighted-summing their parameters to obtain the final super-resolution model. Experimental results demonstrated that the proposed method achieves excellent performance in the real-world image super-resolution challenge. We have also verified that the proposed dual reconstruction does further improve the quality of the reconstructed image in terms of both PSNR and SSIM.
64.Computer-Aided Clinical Skin Disease Diagnosis Using CNN and Object Detection Models ⬇️
Skin disease is one of the most common types of human diseases, which may happen to everyone regardless of age, gender or race. Due to the high visual diversity, human diagnosis highly relies on personal experience; and there is a serious shortage of experienced dermatologists in many countries. To alleviate this problem, computer-aided diagnosis with state-of-the-art (SOTA) machine learning techniques would be a promising solution. In this paper, we aim at understanding the performance of convolutional neural network (CNN) based approaches. We first build two versions of skin disease datasets from Internet images: (a) Skin-10, which contains 10 common classes of skin disease with a total of 10,218 images; (b) Skin-100, which is a larger dataset that consists of 19,807 images of 100 skin disease classes. Based on these datasets, we benchmark several SOTA CNN models and show that the accuracy of skin-100 is much lower than the accuracy of skin-10. We then implement an ensemble method based on several CNN models and achieve the best accuracy of 79.01% for Skin-10 and 53.54% for Skin-100. We also present an object detection based approach by introducing bounding boxes into the Skin-10 dataset. Our results show that object detection can help improve the accuracy of some skin disease classes.
65.W-Net: Two-stage U-Net with misaligned data for raw-to-RGB mapping ⬇️
Recent research on a learning mapping between raw Bayer images and RGB images has progressed with the development of deep convolutional neural networks. A challenging data set namely the Zurich Raw-to-RGB data set~(ZRR) has been released in the AIM 2019 raw-to-RGB mapping challenge. In ZRR, input raw and target RGB images are captured by two different cameras and thus not perfectly aligned. Moreover, camera metadata such as white balance gains and color correction matrix are not provided, which makes the challenge more difficult. In this paper, we explore an effective network structure and a loss function to address these issues. We exploit a two-stage U-Net architecture and also introduce a loss function that is less variant to alignment and more sensitive to color differences. In addition, we show an ensemble of networks trained with different loss functions can bring a significant performance gain. We demonstrate the superiority of our method by achieving the highest score in terms of both the peak signal-to-noise ratio and the structural similarity and obtaining the second-best mean-opinion-score in the challenge.
66.End to end collision avoidance based on optical flow and neural networks ⬇️
Optical flow is believed to play an important role in the agile flight of birds and insects. Even though it is a very simple concept, it is rarely used in computer vision for collision avoidance. This work implements a neural network based collision avoidance which was deployed and evaluated on a solely for this purpose refitted car.
67.Automatic Brain Tumour Segmentation and Biophysics-Guided Survival Prediction ⬇️
Gliomas are the most common malignant brain tumourswith intrinsic heterogeneity. Accurate segmentation of gliomas and theirsub-regions on multi-parametric magnetic resonance images (mpMRI)is of great clinical importance, which defines tumour size, shape andappearance and provides abundant information for preoperative diag-nosis, treatment planning and survival prediction. Recent developmentson deep learning have significantly improved the performance of auto-mated medical image segmentation. In this paper, we compare severalstate-of-the-art convolutional neural network models for brain tumourimage segmentation. Based on the ensembled segmentation, we presenta biophysics-guided prognostic model for patient overall survival predic-tion which outperforms a data-driven radiomics approach. Our methodwon the second place of the MICCAI 2019 BraTS Challenge for theoverall survival prediction.