Skip to content
This repository has been archived by the owner on Apr 21, 2024. It is now read-only.

Latest commit

 

History

History
183 lines (183 loc) · 120 KB

20191126.md

File metadata and controls

183 lines (183 loc) · 120 KB

ArXiv cs.CV --Tue, 26 Nov 2019

1.A Benchmark for Anomaly Segmentation ⬇️

Detecting out-of-distribution examples is important for safety-critical machine learning applications such as self-driving vehicles. However, existing research mainly focuses on small-scale images where the whole image is considered anomalous. We propose to segment only the anomalous regions within an image, and hence we introduce the Combined Anomalous Object Segmentation benchmark for the more realistic task of large-scale anomaly segmentation. Our benchmark combines two novel datasets for anomaly segmentation that incorporate both realism and anomaly diversity. Using both real images and those from a simulated driving environment, we ensure the background context and a wide variety of anomalous objects are naturally integrated, unlike before. Additionally, we improve out-of-distribution detectors on large-scale multi-class datasets and introduce detectors for the previously unexplored setting of multi-label out-of-distribution detection. These novel baselines along with our anomaly segmentation benchmark open the door to further research in large-scale out-of-distribution detection and segmentation.

2.Unsupervised Learning of Probably Symmetric Deformable 3D Objects from Images in the Wild ⬇️

We propose a method to learn 3D deformable object categories from raw single-view images, without external supervision. The method is based on an autoencoder that factors each input image into depth, albedo, viewpoint and illumination. In order to disentangle these components without supervision, we use the fact that many object categories have, at least in principle, a symmetric structure. We show that reasoning about illumination allows us to exploit the underlying object symmetry even if the appearance is not symmetric due to shading. Furthermore, we model objects that are probably, but not certainly, symmetric by predicting a symmetry probability map, learned end-to-end with the other components of the model. Our experiments show that this method can recover very accurately the 3D shape of human faces, cat faces and cars from single-view images, without any supervision or a prior shape model. On benchmarks, we demonstrate superior accuracy compared to another method that uses supervision at the level of 2D image correspondences.

3.Phase Contrast Microscopy Cell PopulationSegmentation: A Survey ⬇️

Phase contrast microscopy (PCM) has been widely used in biomedicine research, which allows users to observe objectives without staining or killing them. One important related research is to employ PCM to monitor live cells. How to segment cell populations in obtained PCM images gains more and more attention as its a critical step for downstream applications, such as cell tracking, cell classification and others. Many papers have been published to deal with this problem from different perspectives. In this paper we aim to present a comprehensive review on the development of PCM cell population segmentation.

4.StructEdit: Learning Structural Shape Variations ⬇️

Learning to encode differences in the geometry and (topological) structure of the shapes of ordinary objects is key to generating semantically plausible variations of a given shape, transferring edits from one shape to another, and many other applications in 3D content creation. The common approach of encoding shapes as points in a high-dimensional latent feature space suggests treating shape differences as vectors in that space. Instead, we treat shape differences as primary objects in their own right and propose to encode them in their own latent space. In a setting where the shapes themselves are encoded in terms of fine-grained part hierarchies, we demonstrate that a separate encoding of shape deltas or differences provides a principled way to deal with inhomogeneities in the shape space due to different combinatorial part structures, while also allowing for compactness in the representation, as well as edit abstraction and transfer. Our approach is based on a conditional variational autoencoder for encoding and decoding shape deltas, conditioned on a source shape. We demonstrate the effectiveness and robustness of our approach in multiple shape modification and generation tasks, and provide comparison and ablation studies on the PartNet dataset, one of the largest publicly available 3D datasets.

5.Explaining Neural Networks via Perturbing Important Learned Features ⬇️

Attributing the output of a neural network to the contribution of given input elements is one way of shedding light on the black box nature of neural networks. We propose a novel input feature attribution method that finds an input perturbation that maximally changes the output neuron by exclusively perturbing important hidden neurons (i.e. learned features) on the path to output neuron. Given an input, this is achieved by 1) pruning unimportant neurons, and subsequently 2) finding a local input perturbation that maximizes the output in the pruned network. Since our method considers the importance of hidden neurons (high-level features), it inherently considers interdependencies between multiple input elements, which is vital for input feature attribution. We propose PruneGrad, an efficient gradient-based solution for the pruning and perturbation steps of our method. The efficacy of our method is evaluated by quantitatively benchmarking against other attribution methods using 1) sanity checks, 2) pixel perturbation, and 3) Remove and Retrain (ROAR). Our results show that while most of the existing attribution methods are prone to fail or get mediocre results in at least one benchmark, our proposed method achieves state of the art results in all three benchmarks. The results are further supported by comparative visual evaluation.

6.Radius Adaptive Convolutional Neural Network ⬇️

Convolutional neural network (CNN) is widely used in computer vision applications. In the networks that deal with images, CNNs are the most time-consuming layer of the networks. Usually, the solution to address the computation cost is to decrease the number of trainable parameters. This solution usually comes with the cost of dropping the accuracy. Another problem with this technique is that usually the cost of memory access is not taken into account which results in insignificant speedup gain. The number of operations and memory access in a standard convolution layer is independent of the input content, which makes it limited for certain accelerations. We propose a simple modification to a standard convolution to bridge this gap. We propose an adaptive convolution that adopts different kernel sizes (or radii) based on the content. The network can learn and select the proper radius based on the input content in a soft decision manner. Our proposed radius-adaptive convolutional neural network (RACNN) has a similar number of weights to a standard one, yet, results show it can reach higher speeds. The code has been made available at: this https URL.

7.Gating Revisited: Deep Multi-layer RNNs That Can Be Trained ⬇️

We propose a new stackable recurrent cell (STAR) for recurrent neural networks (RNNs) that has significantly less parameters than widely used LSTM and GRU while being more robust against vanishing or exploding gradients. Stacking multiple layers of recurrent units has two major drawbacks: i) many recurrent cells (e.g., LSTM cells) are extremely eager in terms of parameters and computation resources, ii) deep RNNs are prone to vanishing or exploding gradients during training. We investigate the training of multi-layer RNNs and examine the magnitude of the gradients as they propagate through the network in the "vertical" direction. We show that, depending on the structure of the basic recurrent unit, the gradients are systematically attenuated or amplified. Based on our analysis we design a new type of gated cell that better preserves gradient magnitude. We validate our design on a large number of sequence modelling tasks and demonstrate that the proposed STAR cell allows to build and train deeper recurrent architectures, ultimately leading to improved performance while being computationally efficient.

8.Deep Decomposition Learning for Inverse Imaging Problems ⬇️

Deep learning is emerging as a new paradigm for solving inverse imaging problems. However, the deep learning methods often lack the assurance of traditional physics-based methods due to the lack of physical information considerations in neural network training and deploying. The appropriate supervision and explicit calibration by the information of the physic model can enhance the neural network learning and its practical performance. In this paper, inspired by the geometry that data can be decomposed by two components from the null-space of the forward operator and the range space of its pseudo-inverse, we train neural networks to learn the two components and therefore learn the decomposition, i.e. we explicitly reformulate the neural network layers as learning range-nullspace decomposition functions with reference to the layer inputs, instead of learning unreferenced functions. We show that the decomposition networks not only produce superior results, but also enjoy good interpretability and generalization. We demonstrate the advantages of decomposition learning on different inverse problems including compressive sensing and image super-resolution as examples.

9.Event Recognition with Automatic Album Detection based on Sequential Processing, Neural Attention and Image Captioning ⬇️

In this paper a new formulation of event recognition task is examined: it is required to predict event categories in a gallery of images, for which albums (groups of photos corresponding to a single event) are unknown. We propose the novel two-stage approach. At first, features are extracted in each photo using the pre-trained convolutional neural network. These features are classified individually. The scores of the classifier are used to group sequential photos into several clusters. Finally, the features of photos in each group are aggregated into a single descriptor using neural attention mechanism. This algorithm is optionally extended to improve the accuracy for classification of each image in an album. In contrast to conventional fine-tuning of convolutional neural networks (CNN) we proposed to use image captioning, i.e., generative model that converts images to textual descriptions. They are one-hot encoded and summarized into sparse feature vector suitable for learning of arbitrary classifier. Experimental study with Photo Event Collection and Multi-Label Curation of Flickr Events Dataset demonstrates that our approach is 9-20% more accurate than event recognition on single photos. Moreover, proposed method has 13-16% lower error rate than classification of groups of photos obtained with hierarchical clustering. It is experimentally shown that the image captions trained on Conceptual Captions dataset can be classified more accurately than the features from object detector, though they both are obviously not as rich as the CNN-based features. However, it is possible to combine our approach with conventional CNNs in an ensemble to provide the state-of-the-art results for several event datasets.

10.Microscopy Image Restoration with Deep Wiener-Kolmogorov filters ⬇️

Microscopy is a powerful visualization tool in biology, enabling the study of cells, tissues, and the fundamental biological processes. Yet, the observed images of the objects at the micro-scale suffer from two major inherent distortions: the blur caused by the diffraction of light, and the background noise caused by the imperfections of the imaging detectors. The latter is especially severe in fluorescence and in confocal microscopes, which are known for operating at the low photon count with the Poisson noise statistics. Restoration of such images is usually accomplished by image deconvolution, with the nature of the noise statistics taken into account, and by solving an optimization problem given some prior information about the underlying data (i.e., regularization). In this work, we propose a unifying framework of algorithms for Poisson image deblurring and denoising. The algorithms are based on deep learning techniques for the design of learnable regularizers paired with an appropriate optimization scheme. Our extensive experimentation line showcases that the proposed approach achieves superior quality of image reconstruction and beats the solutions that rely on deep learning or on the optimization schemes alone. Moreover, several implementations of the proposed framework demonstrate competitive performance at a low computational complexity, which is of high importance for real-time imaging applications.

11.Forecasting Human Object Interaction: Joint Prediction of Motor Attention and Egocentric Activity ⬇️

We address the challenging task of anticipating human-object interaction in first person videos. Most existing methods ignore how the camera wearer interacts with the objects, or simply consider body motion as a separate modality. In contrast, we observe that the international hand movement reveals critical information about the future activity. Motivated by this, we adopt intentional hand movement as a future representation and propose a novel deep network that jointly models and predicts the egocentric hand motion, interaction hotspots and future action. Specifically, we consider the future hand motion as the motor attention, and model this attention using latent variables in our deep model. The predicted motor attention is further used to characterise the discriminative spatial-temporal visual features for predicting actions and interaction hotspots. We present extensive experiments demonstrating the benefit of the proposed joint model. Importantly, our model produces new state-of-the-art results for action anticipation on both EGTEA Gaze+ and the EPIC-Kitchens datasets. At the time of submission, our method is ranked first on unseen test set during EPIC-Kitchens Action Anticipation Challenge Phase 2.

12.PQ-NET: A Generative Part Seq2Seq Network for 3D Shapes ⬇️

We introduce PQ-NET, a deep neural network which represents and generates 3D shapes via sequential part assembly. The input to our network is a 3D shape segmented into parts, where each part is first encoded into a feature representation using a part autoencoder. The core component of PQ-NET is a sequence-to-sequence or Seq2Seq autoencoder which encodes a sequence of part features into a latent vector of fixed size, and the decoder reconstructs the 3D shape, one part at a time, resulting in a sequential assembly. The latent space formed by the Seq2Seq encoder encodes both part structure and fine part geometry. The decoder can be adapted to perform several generative tasks including shape autoencoding, interpolation, novel shape generation, and single-view 3D reconstruction, where the generated shapes are all composed of meaningful parts.

13.Sub-frame Appearance and 6D Pose Estimation of Fast Moving Objects ⬇️

We propose a novel method that tracks fast moving objects, mainly non-uniform spherical, in full 6 degrees of freedom, estimating simultaneously their 3D motion trajectory, 3D pose and object appearance changes with a time step that is a fraction of the video frame exposure time. The sub-frame object localization and appearance estimation allows realistic temporal super-resolution and precise shape estimation. The method, called TbD-3D (Tracking by Deblatting in 3D) relies on a novel reconstruction algorithm which solves a piece-wise deblurring and matting problem. The 3D rotation is estimated by minimizing the reprojection error. As a second contribution, we present a new challenging dataset with fast moving objects that change their appearance and distance to the camera. High speed camera recordings with zero lag between frame exposures were used to generate videos with different frame rates annotated with ground-truth trajectory and pose.

14.ColorFool: Semantic Adversarial Colorization ⬇️

Adversarial attacks that generate small L_p-norm perturbations to mislead classifiers have limited success in black-box settings and with unseen classifiers. These attacks are also fragile with defenses that use denoising filters and to adversarial training procedures. Instead, adversarial attacks that generate unrestricted perturbations are more robust to defenses, are generally more successful in black-box settings and are more transferable to unseen classifiers. However, unrestricted perturbations may be noticeable to humans. In this paper, we propose a content-based black-box adversarial attack that generates unrestricted perturbations by exploiting image semantics to selectively modify colors within chosen ranges that are perceived as natural by humans. We show that the proposed approach, ColorFool, outperforms in terms of success rate, robustness to defense frameworks and transferability five state-of-the-art adversarial attacks on two different tasks, scene and object classification, when attacking three state-of-the-art deep neural networks using three standard datasets. We will make the code of the proposed approach and the whole evaluation framework publicly available.

15.Learning New Tricks from Old Dogs -- Inter-Species, Inter-Tissue Domain Adaptation for Mitotic Figure Assessment ⬇️

For histopathological tumor assessment, the count of mitotic figures per area is an important part of prognostication. Algorithmic approaches - such as for mitotic figure identification - have significantly improved in recent times, potentially allowing for computer-augmented or fully automatic screening systems in the future. This trend is further supported by whole slide scanning microscopes becoming available in many pathology labs and could soon become a standard imaging tool.
For an application in broader fields of such algorithms, the availability of mitotic figure data sets of sufficient size for the respective tissue type and species is an important precondition, that is, however, rarely met. While algorithmic performance climbed steadily for e.g. human mammary carcinoma, thanks to several challenges held in the field, for most tumor types, data sets are not available.
In this work, we assess domain transfer of mitotic figure recognition using domain adversarial training on four data sets, two from dogs and two from humans. We were able to show that domain adversarial training considerably improves accuracy when applying mitotic figure classification learned from the canine on the human data sets (up to +12.8% in accuracy) and is thus a helpful method to transfer knowledge from existing data sets to new tissue types and species.

16.Binarized Neural Architecture Search ⬇️

Neural architecture search (NAS) can have a significant impact in computer vision by automatically designing optimal neural network architectures for various tasks. A variant, binarized neural architecture search (BNAS), with a search space of binarized convolutions, can produce extremely compressed models. Unfortunately, this area remains largely unexplored. BNAS is more challenging than NAS due to the learning inefficiency caused by optimization requirements and the huge architecture space. To address these issues, we introduce channel sampling and operation space reduction into a differentiable NAS to significantly reduce the cost of searching. This is accomplished through a performance-based strategy used to abandon less potential operations. Two optimization methods for binarized neural networks are used to validate the effectiveness of our BNAS. Extensive experiments demonstrate that the proposed BNAS achieves a performance comparable to NAS on both CIFAR and ImageNet databases. An accuracy of $96.53%$ vs. $97.22%$ is achieved on the CIFAR-10 dataset, but with a significantly compressed model, and a $40%$ faster search than the state-of-the-art PC-DARTS.

17.Deep Image Deraining Via Intrinsic Rainy Image Priors and Multi-scale Auxiliary Decoding ⬇️

Different rain models and novel network structures have been proposed to remove rain streaks from single rainy images. In this work, we bring attention to the intrinsic priors and multi-scale features of the rainy images, and develop several intrinsic loss functions to train a CNN deraining network. We first study the sparse priors of rainy images, which have been verified to preserve unbroken edges in image decomposition. However, its mathematical formulation usually leads to an intractable solution, we propose quasi-sparsity priors to decrease complexity, so that our network can be trained under the supervision of sparse properties of rainy images. Quasi-sparsity supervises network training in different gradient domain which is still ill-posed to decompose a rainy image into rain layer and background layer. We develop another $L_1$ loss based on the intrinsic low-value property of rain layer to restore image contents together with the commonly-used $L_1$ similarity loss. Multi-scale features are further explored via a multi-scale auxiliary decoding structure to show which kinds of features contribute the most to the deraining task, and the corresponding multi-scale auxiliary loss improves the deraining performance further. In our network, more efficient group convolution and feature sharing are utilized to obtain an one order of magnitude improvement in network running speed. The proposed deraining method performs favorably against state-of-the-art deraining approaches.

18.Fast and Generalized Adaptation for Few-Shot Learning ⬇️

The ability of fast generalizing to novel tasks from a few examples is critical in dealing with few-shot learning problems. However, deep learning models severely suffer from overfitting in extreme low data regime. In this paper, we propose Adaptable Cosine Classifier (ACC) and Amphibian to achieve fast and generalized adaptation for few-shot learning. The ACC realizes the flexible retraining of a deep network on small data without overfitting. The Amphibian learns a good weight initialization in the parameter space where optimal solutions for the tasks of the same class cluster tightly. It enables rapid adaptation to novel tasks with few gradient updates. We conduct comprehensive experiments on four few-shot datasets and achieve state-of-the-art performance in all cases. Notably, we achieve the accuracy of 87.75% on 5-shot miniImageNet which approximately outperforms existing methods by 10%. We also conduct experiment on cross-domain few-shot tasks and provide the best results.

19.Estimating People Flows to Better Count them in Crowded Scenes ⬇️

State-of-the-art methods for counting people in crowded scenes rely on deep networks to estimate people densities in individual images. As such, only very few take advantage of temporal consistency in video sequences, and those that do only impose weak smoothness constraints across consecutive frames.
In this paper, we show that estimating people flows across image locations between consecutive images and inferring the people densities from these flows instead of directly regressing them makes it possible to impose much stronger constraints encoding the conservation of the number of people, which significantly boost performance without requiring a more complex architecture. Furthermore, it also enables us to exploit the correlation between people flow and optical flow to further improve the results.
We will demonstrate that we consistently outperform state-of-the-art methods on five benchmark datasets.

20.Fine-grained Attention and Feature-sharing Generative Adversarial Networks for Single Image Super-Resolution ⬇️

The traditional super-resolution methods that aim to minimize the mean square error usually produce the images with over-smoothed and blurry edges, due to the lose of high-frequency details. In this paper, we propose two novel techniques in the generative adversarial networks to produce photo-realistic images for image super-resolution. Firstly, instead of producing a single score to discriminate images between real and fake, we propose a variant, called Fine-grained Attention Generative Adversarial Network for image super-resolution (FASRGAN), to discriminate each pixel between real and fake. FASRGAN adopts a Unet-like network as the discriminator with two outputs: an image score and an image score map. The score map has the same spatial size as the HR/SR images, serving as the fine-grained attention to represent the degree of reconstruction difficulty for each pixel. Secondly, instead of using different networks for the generator and the discriminator in the SR problem, we use a feature-sharing network (Fs-SRGAN) for both the generator and the discriminator. By network sharing, certain information is shared between the generator and the discriminator, which in turn can improve the ability of producing high-quality images. Quantitative and visual comparisons with the state-of-the-art methods on the benchmark datasets demonstrate the superiority of our methods. The application of super-resolution images to object recognition further proves that the proposed methods endow the power to reconstruction capabilities and the excellent super-resolution effects.

21.Regularized Fine-grained Meta Face Anti-spoofing ⬇️

Face presentation attacks have become an increasingly critical concern when face recognition is widely applied. Many face anti-spoofing methods have been proposed, but most of them ignore the generalization ability to unseen attacks. To overcome the limitation, this work casts face anti-spoofing as a domain generalization (DG) problem, and attempts to address this problem by developing a new meta-learning framework called Regularized Fine-grained Meta-learning. To let our face anti-spoofing model generalize well to unseen attacks, the proposed framework trains our model to perform well in the simulated domain shift scenarios, which is achieved by finding generalized learning directions in the meta-learning process. Specifically, the proposed framework incorporates the domain knowledge of face anti-spoofing as the regularization so that meta-learning is conducted in the feature space regularized by the supervision of domain knowledge. This enables our model more likely to find generalized learning directions with the regularized meta-learning for face anti-spoofing task. Besides, to further enhance the generalization ability of our model, the proposed framework adopts a fine-grained learning strategy that simultaneously conducts meta-learning in a variety of domain shift scenarios in each iteration. Extensive experiments on four public datasets validate the effectiveness of the proposed method.

22.Deep Image-to-Video Adaptation and Fusion Networks for Action Recognition ⬇️

Existing deep learning methods for action recognition in videos require a large number of labeled videos for training, which is labor-intensive and time-consuming. For the same action, the knowledge learned from different media types, e.g., videos and images, may be related and complementary. However, due to the domain shifts and heterogeneous feature representations between videos and images, the performance of classifiers trained on images may be dramatically degraded when directly deployed to videos. In this paper, we propose a novel method, named Deep Image-to-Video Adaptation and Fusion Networks (DIVAFN), to enhance action recognition in videos by transferring knowledge from images using video keyframes as a bridge. The DIVAFN is a unified deep learning model, which integrates domain-invariant representations learning and cross-modal feature fusion into a unified optimization framework. Specifically, we design an efficient cross-modal similarities metric to reduce the modality shift among images, keyframes and videos. Then, we adopt an autoencoder architecture, whose hidden layer is constrained to be the semantic representations of the action class names. In this way, when the autoencoder is adopted to project the learned features from different domains to the same space, more compact, informative and discriminative representations can be obtained. Finally, the concatenation of the learned semantic feature representations from these three autoencoders are used to train the classifier for action recognition in videos. Comprehensive experiments on four real-world datasets show that our method outperforms some state-of-the-art domain adaptation and action recognition methods.

23.Empirical Study of Easy and Hard Examples in CNN Training ⬇️

Deep Neural Networks (DNNs) generalize well despite their massive size and capability of memorizing all examples. There is a hypothesis that DNNs start learning from simple patterns and the hypothesis is based on the existence of examples that are consistently well-classified at the early training stage (i.e., easy examples) and examples misclassified (i.e., hard examples). Easy examples are the evidence that DNNs start learning from specific patterns and there is a consistent learning process. It is important to know how DNNs learn patterns and obtain generalization ability, however, properties of easy and hard examples are not thoroughly investigated (e.g., contributions to generalization and visual appearances). In this work, we study the similarities of easy and hard examples respectively for different Convolutional Neural Network (CNN) architectures, assessing how those examples contribute to generalization. Our results show that easy examples are visually similar to each other and hard examples are visually diverse, and both examples are largely shared across different CNN architectures. Moreover, while hard examples tend to contribute more to generalization than easy examples, removing a large number of easy examples leads to poor generalization. By analyzing those results, we hypothesize that biases in a dataset and Stochastic Gradient Descent (SGD) are the reasons why CNNs have consistent easy and hard examples. Furthermore, we show that large scale classification datasets can be efficiently compressed by using easiness proposed in this work.

24.Nearest Neighbor Sampling of Point Sets using Random Rays ⬇️

We propose a new framework for the sampling, compression, and analysis of distributions of point sets and other geometric objects embedded in Euclidean spaces. A set of randomly selected rays are projected onto their closest points in the data set, forming the ray signature. From the signature, statistical information about the data set, as well as certain geometrical information, can be extracted, independent of the ray set. We present promising results from "RayNN", a neural network for the classification of point clouds based on ray signatures.

25.Point Cloud Processing via Recurrent Set Encoding ⬇️

We present a new permutation-invariant network for 3D point cloud processing. Our network is composed of a recurrent set encoder and a convolutional feature aggregator. Given an unordered point set, the encoder firstly partitions its ambient space into parallel beams. Points within each beam are then modeled as a sequence and encoded into subregional geometric features by a shared recurrent neural network (RNN). The spatial layout of the beams is regular, and this allows the beam features to be further fed into an efficient 2D convolutional neural network (CNN) for hierarchical feature aggregation. Our network is effective at spatial feature learning, and competes favorably with the state-of-the-arts (SOTAs) on a number of benchmarks. Meanwhile, it is significantly more efficient compared to the SOTAs.

26.AOP: An Anti-overfitting Pretreatment for Practical Image-based Plant Diagnosis ⬇️

In image-based plant diagnosis, clues related to diagnosis are often unclear, and the other factors such as image backgrounds often have a significant impact on the final decision. As a result, overfitting due to latent similarities in the dataset often occurs, and the diagnostic performance on real unseen data (e,g. images from other farms) is usually dropped significantly. However, this problem has not been sufficiently explored, since many systems have shown excellent diagnostic performance due to the bias caused by the similarities in the dataset. In this study, we investigate this problem with experiments using more than 50,000 images of cucumber leaves, and propose an anti-overfitting pretreatment (AOP) for realizing practical image-based plant diagnosis systems. The AOP detects the area of interest (leaf, fruit etc.) and performs brightness calibration as a preprocessing step. The experimental results demonstrate that our AOP can improve the accuracy of diagnosis for unknown test images from different farms by 12.2% in a practical setting.

27.Cascaded Detail-Preserving Networks for Super-Resolution of Document Images ⬇️

The accuracy of OCR is usually affected by the quality of the input document image and different kinds of marred document images hamper the OCR results. Among these scenarios, the low-resolution image is a common and challenging case. In this paper, we propose the cascaded networks for document image super-resolution. Our model is composed by the Detail-Preserving Networks with small magnification. The loss function with perceptual terms is designed to simultaneously preserve the original patterns and enhance the edge of the characters. These networks are trained with the same architecture and different parameters and then assembled into a pipeline model with a larger magnification. The low-resolution images can upscale gradually by passing through each Detail-Preserving Network until the final high-resolution images. Through extensive experiments on two scanning document image datasets, we demonstrate that the proposed approach outperforms recent state-of-the-art image super-resolution methods, and combining it with standard OCR system lead to signification improvements on the recognition results.

28.Prototype Rectification for Few-Shot Learning ⬇️

Few-shot learning is a challenging problem that requires a model to recognize novel classes with few labeled data. In this paper, we aim to find the expected prototypes of the novel classes, which have the maximum cosine similarity with the samples of the same class. Firstly, we propose a cosine similarity based prototypical network to compute basic prototypes of the novel classes from the few samples. A bias diminishing module is further proposed for prototype rectification since the basic prototypes computed in the low-data regime are biased against the expected prototypes. In our method, the intra-class bias and the cross-class bias are diminished to modify the prototypes. Then we give a theoretical analysis of the impact of the bias diminishing module on the expected performance of our method. We conduct extensive experiments on four few-shot benchmarks and further analyze the advantage of the bias diminishing module. The bias diminishing module brings in significant improvement by a large margin of 3% to 9% in general. Notably, our approach achieves state-of-the-art performance on miniImageNet (70.31% in 1-shot and 81.89% in 5-shot) and tieredImageNet (78.74% in 1-shot and 86.92% in 5-shot), which demonstrates the superiority of the proposed method.

29.Mitigate Bias in Face Recognition using Skewness-Aware Reinforcement Learning ⬇️

Racial equality is an important theme of international human rights law, but it has been largely obscured when the overall face recognition accuracy is pursued blindly. More facts indicate racial bias indeed degrades the fairness of recognition system and the error rates on non-Caucasians are usually much higher than Caucasians. To encourage fairness, we introduce the idea of adaptive margin to learn balanced performance for different races based on large margin losses. A reinforcement learning based race balance network (RL-RBN) is proposed. We formulate the process of finding the optimal margins for non-Caucasians as a Markov decision process and employ deep Q-learning to learn policies for an agent to select appropriate margin by approximating the Q-value function. Guided by the agent, the skewness of feature scatter between races can be reduced. Besides, we provide two ethnicity aware training datasets, called BUPT-Globalface and BUPT-Balancedface dataset, which can be utilized to study racial bias from both data and algorithm aspects. Extensive experiments on RFW database show that RL-RBN successfully mitigates racial bias and learns more balanced performance for different races.

30.Image-based table recognition: data, model, and evaluation ⬇️

Important information that relates to a specific topic in a document is often organized in tabular format to assist readers with information retrieval and comparison, which may be difficult to provide in natural language. However, tabular data in unstructured digital documents, e.g., Portable Document Format (PDF) and images, are difficult to parse into structured machine-readable format, due to complexity and diversity in their structure and style. To facilitate image-based table recognition with deep learning, we develop the largest publicly available table recognition dataset PubTabNet (this https URL), containing 568k table images with corresponding structured HTML representation. PubTabNet is automatically generated by matching the XML and PDF representations of the scientific articles in PubMed Central Open Access Subset (PMCOA). We also propose a novel attention-based encoder-dual-decoder (EDD) architecture that converts images of tables into HTML code. The model has a structure decoder which reconstructs the table structure and helps the cell decoder to recognize cell content. In addition, we propose a new Tree-Edit-Distance-based Similarity (TEDS) metric for table recognition. The experiments demonstrate that the EDD model can accurately recognize complex tables solely relying on the image representation, outperforming the state-of-the-art by 7.7% absolute TEDS score.

31.Inverse-Transform AutoEncoder for Anomaly Detection ⬇️

Reconstruction-based methods have recently shown great promise for anomaly detection. We here propose a new transform-based framework for anomaly detection. A selected set of transformations based on human priors is used to erase certain targeted information from input data. An inverse-transform autoencoder is trained with the normal data only to embed corresponding erased information during the restoration of the original data. The normal and anomalous data are thus expected to be differentiable based on restoration errors. Extensive experiments have demonstrated that the proposed method significantly outperforms several state-of-the-arts on multiple benchmark datasets, especially on ImageNet, increasing the AUROC of the top-performing baseline by 10.1%. We also evaluate our method on a real-world anomaly detection dataset MVTec AD and a video anomaly detection dataset ShanghaiTech to validate the effectiveness of the method in real-world environments.

32.Appearance Composing GAN: A General Method for Appearance-Controllable Human Video Motion Transfer ⬇️

Due to the rapid development of GANs, there has been significant progress in the field of human video motion transfer which has a wide range of applications in computer vision and graphics. However, existing works only support motion-controllable video synthesis while appearances of different video components are bound together and uncontrollable, which means one person can only appear with the same clothing and background. Besides, most of these works are person-specific and require to train an individual model for each person, which is inflexible and inefficient. Therefore, we propose appearance composing GAN: a general method enabling control over not only human motions but also video appearances for arbitrary human subjects within only one model. The key idea is to exert layout-level appearance control on different video components and fuse them to compose the desired full video scene. Specifically, we achieve such appearance control by providing our model with optimal appearance conditioning inputs obtained separately for each component, allowing controllable component appearance synthesis for different people by changing the input appearance conditions accordingly. In terms of synthesis, a two-stage GAN framework is proposed to sequentially generate the desired body semantic layouts and component appearances, both are consistent with the input human motions and appearance conditions. Coupled with our ACGAN loss and background modulation block, the proposed method can achieve general and appearance-controllable human video motion transfer. Moreover, we build a dataset containing a large number of dance videos for training and evaluation. Experimental results show that, when applied to motion transfer tasks involving a variety of human subjects, our proposed method achieves appearance-controllable synthesis with higher video quality than state-of-arts based on only one-time training.

33.Reducing the Human Effort in Developing PET-CT Registration ⬇️

We aim to reduce the tedious nature of developing and evaluating methods for aligning PET-CT scans from multiple patient visits. Current methods for registration rely on correspondences that are created manually by medical experts with 3D manipulation, or assisted alignments done by utilizing mutual information across CT scans that may not be consistent when transferred to the PET images. Instead, we propose to label multiple key points across several 2D slices, which we then fit a key curve to. This removes the need for creating manual alignments in 3D and makes the labelling process easier. We use these key curves to define an error metric for the alignments that can be computed efficiently. While our metric is non-differentiable, we further show that we can utilize it during the training of our deep model via a novel method. Specifically, instead of relying on detailed geometric labels -- e.g., manual 3D alignments -- we use synthetically generated deformations of real data. To incorporate robustness to changes that occur between visits other than geometric changes, we enforce consistency across visits in the deep network's internal representations. We demonstrate the potential of our method via qualitative and quantitative experiments.

34.Pyramid Vector Quantization and Bit Level Sparsity in Weights for Efficient Neural Networks Inference ⬇️

This paper discusses three basic blocks for the inference of convolutional neural networks (CNNs). Pyramid Vector Quantization (PVQ) is discussed as an effective quantizer for CNNs weights resulting in highly sparse and compressible networks. Properties of PVQ are exploited for the elimination of multipliers during inference while maintaining high performance. The result is then extended to any other quantized weights. The Tiny Yolo v3 CNN is used to compare such basic blocks.

35.Deep Multivariate Mixture of Gaussians for Object Detection under Occlusion ⬇️

In this paper, we consider the problem of detecting object under occlusion. Most object detectors formulate bounding box regression as a unimodal task (i.e., regressing a single set of bounding box coordinates independently). However, we observe that the bounding box borders of an occluded object can have multiple plausible configurations. Also, the occluded bounding box borders have correlations with visible ones. Motivated by these two observations, we propose a deep multivariate mixture of Gaussians model for bounding box regression under occlusion. The mixture components potentially learn different configurations of an occluded part, and the covariances between variates help to learn the relationship between the occluded parts and the visible ones. Quantitatively, our model improves the AP of the baselines by 3.9% and 1.2% on CrowdHuman and MS-COCO respectively with almost no computational or memory overhead. Qualitatively, our model enjoys explainability since we can interpret the resulting bounding boxes via the covariance matrices and the mixture components.

36.Invenio: Discovering Hidden Relationships Between Tasks/Domains Using Structured Meta Learning ⬇️

Exploiting known semantic relationships between fine-grained tasks is critical to the success of recent model agnostic approaches. These approaches often rely on meta-optimization to make a model robust to systematic task or domain shifts. However, in practice, the performance of these methods can suffer, when there are no coherent semantic relationships between the tasks (or domains). We present Invenio, a structured meta-learning algorithm to infer semantic similarities between a given set of tasks and to provide insights into the complexity of transferring knowledge between different tasks. In contrast to existing techniques such as Task2Vec and Taskonomy, which measure similarities between pre-trained models, our approach employs a novel self-supervised learning strategy to discover these relationships in the training loop and at the same time utilizes them to update task-specific models in the meta-update step. Using challenging task and domain databases, under few-shot learning settings, we show that Invenio can discover intricate dependencies between tasks or domains, and can provide significant gains over existing approaches in terms of generalization performance. The learned semantic structure between tasks/domains from Invenio is interpretable and can be used to construct meaningful priors for tasks or domains.

37.Pixel Adaptive Filtering Units ⬇️

State-of-the-art methods for computer vision rely heavily on the translation equivariance and spatial sharing properties of convolutional layers without explicitly taking into consideration the input content. Modern techniques employ deep sophisticated architectures in order to circumvent this issue. In this work, we propose a Pixel Adaptive Filtering Unit (PAFU) which introduces a differentiable kernel selection mechanism paired with a discrete, learnable and decorrelated group of kernels to allow for content-based spatial adaptation. First, we demonstrate the applicability of the technique in applications where runtime is of importance. Next, we employ PAFU in deep neural networks as a replacement of standard convolutional layers to enhance the original architectures with spatially varying computations to achieve considerable performance improvements. Finally, diverse and extensive experimentation provides strong empirical evidence in favor of the proposed content-adaptive processing scheme across different image processing and high-level computer vision tasks.

38.Facial Landmark Correlation Analysis ⬇️

We present a facial landmark position correlation analysis as well as its applications. Although numerous facial landmark detection methods have been presented in the literature, few of them concern the intrinsic relationship among the landmarks. In order to reveal and interpret this relationship, we propose to analyze the facial landmark correlation by using Canonical Correlation Analysis (CCA). We experimentally show that dense facial landmark annotations in current benchmarks are strongly correlated, and we propose several applications based on this analysis. First, we give insights into the predictions from different facial landmark detection models (including cascaded random forests, cascaded Convolutional Neural Networks (CNNs), heatmap regression models) and interpret how CNNs progressively learn to predict facial landmarks. Second, we propose a few-shot learning method that allows to considerably reduce manual effort for dense landmark annotation. To this end, we select a portion of landmarks from the dense annotation format to form a sparse format, which is mostly correlated to the rest of them. Thanks to the strong correlation among the landmarks, the entire set of dense facial landmarks can then be inferred from the annotation in the sparse format by transfer learning. Unlike the previous methods, we mainly focus on how to find the most efficient sparse format to annotate. Overall, our correlation analysis provides new perspectives for the research on facial landmark detection.

39.Unsupervised Neural Sensor Models for Synthetic LiDAR Data Augmentation ⬇️

Data scarcity is a bottleneck to machine learning-based perception modules, usually tackled by augmenting real data with synthetic data from simulators. Realistic models of the vehicle perception sensors are hard to formulate in closed form, and at the same time, they require the existence of paired data to be learned. In this work, we propose two unsupervised neural sensor models based on unpaired domain translations with CycleGANs and Neural Style Transfer techniques. We employ CARLA as the simulation environment to obtain simulated LiDAR point clouds, together with their annotations for data augmentation, and we use KITTI dataset as the real LiDAR dataset from which we learn the realistic sensor model mapping. Moreover, we provide a framework for data augmentation and evaluation of the developed sensor models, through extrinsic object detection task evaluation using YOLO network adapted to provide oriented bounding boxes for LiDAR Bird-eye-View projected point clouds. Evaluation is performed on unseen real LiDAR frames from KITTI dataset, with different amounts of simulated data augmentation using the two proposed approaches, showing improvement of 6% mAP for the object detection task, in favor of the augmenting LiDAR point clouds adapted with the proposed neural sensor models over the raw simulated LiDAR.

40.2D Wasserstein Loss for Robust Facial Landmark Detection ⬇️

Facial landmark detection is an important preprocessing task for most applications related to face analysis. In recent years, the performance of facial landmark detection has been significantly improved by using deep Convolutional Neural Networks (CNNs), especially the Heatmap Regression Models (HRMs). Although their performance on common benchmark datasets have reached a high level, the robustness of these models still remains a challenging problem in the practical use under more noisy conditions of realistic environments. Contrary to most existing work focusing on the design of new models, we argue that improving the robustness requires rethinking many other aspects, including the use of datasets, the format of landmark annotation, the evaluation metric as well as the training and detection algorithm itself. In this paper, we propose a novel method for robust facial landmark detection using a loss function based on the 2D Wasserstein distance combined with a new landmark coordinate sampling relying on the barycenter of the individual propability distributions. The most intriguing fact of our method is that it can be plugged-and-play on most state-of-the-art HRMs with neither additional complexity nor structural modifications of the models. Further, with the large performance increase of state-of-the-art deep CNN models, we found that current evaluation metrics can no longer fully reflect the robustness of these models. Therefore, we propose several improvements on the standard evaluation protocol. Extensive experimental results on both traditional evaluation metrics and our evaluation metrics demonstrate that our approach significantly improves the robustness of state-of-the-art facial landmark detection models.

41.Controllable List-wise Ranking for Universal No-reference Image Quality Assessment ⬇️

No-reference image quality assessment (NR-IQA) has received increasing attention in the IQA community since reference image is not always available. Real-world images generally suffer from various types of distortion. Unfortunately, existing NR-IQA methods do not work with all types of distortion. It is a challenging task to develop universal NR-IQA that has the ability of evaluating all types of distorted images. In this paper, we propose a universal NR-IQA method based on controllable list-wise ranking (CLRIQA). First, to extend the authentically distorted image dataset, we present an imaging-heuristic approach, in which the over-underexposure is formulated as an inverse of Weber-Fechner law, and fusion strategy and probabilistic compression are adopted, to generate the degraded real-world images. These degraded images are label-free yet associated with quality ranking information. We then design a controllable list-wise ranking function by limiting rank range and introducing an adaptive margin to tune rank interval. Finally, the extended dataset and controllable list-wise ranking function are used to pre-train a CNN. Moreover, in order to obtain an accurate prediction model, we take advantage of the original dataset to further fine-tune the pre-trained network. Experiments evaluated on four benchmark datasets (i.e. LIVE, CSIQ, TID2013, and LIVE-C) show that the proposed CLRIQA improves the state of the art by over 8% in terms of overall performance. The code and model are publicly available at this https URL.

42.AttKGCN: Attribute Knowledge Graph Convolutional Network for Person Re-identification ⬇️

Discriminative feature representation of person image is important for person re-identification (Re-ID) task. Recently, attributes have been demonstrated beneficially in guiding for learning more discriminative feature representations for Re-ID. As attributes normally co-occur in person images, it is desirable to model the attribute dependencies to improve the attribute prediction and thus Re-ID results. In this paper, we propose to model these attribute dependencies via a novel attribute knowledge graph (AttKG), and propose a novel Attribute Knowledge Graph Convolutional Network (AttKGCN) to solve Re-ID problem. AttKGCN integrates both attribute prediction and Re-ID learning together in a unified end-to-end framework which can boost their performances, respectively. AttKGCN first builds a directed attribute KG whose nodes denote attributes and edges encode the co-occurrence relationships of different attributes. Then, AttKGCN learns a set of inter-dependent attribute classifiers which are combined with person visual descriptors for attribute prediction. Finally, AttKGCN integrates attribute description and deeply visual representation together to construct a more discriminative feature representation for Re-ID task. Extensive experiments on several benchmark datasets demonstrate the effectiveness of AttKGCN on attribute prediction and Re-ID tasks.

43.Breaking the cycle -- Colleagues are all you need ⬇️

This paper proposes a novel approach to performing image-to-image translation between unpaired domains. Rather than relying on a cycle constraint, our method takes advantage of collaboration between various GANs. This results in a multi-modal method, in which multiple optional and diverse images are produced for a given image. Our model addresses some of the shortcomings of classical GANs: (1) It is able to remove large objects, such as glasses. (2) Since it does not need to support the cycle constraint, no irrelevant traces of the input are left on the generated image. (3) It manages to translate between domains that require large shape modifications. Our results are shown to outperform those generated by state-of-the-art methods for several challenging applications on commonly-used datasets, both qualitatively and quantitatively.

44.Using panoramic videos for multi-person localization and tracking in a 3D panoramic coordinate ⬇️

This work proposes a new human-related video processing task named 3D panoramic multi-person localization and tracking. With the first benchmark dataset and a simple yet effective solution, it establishes a new paradigm for multi-person tracking systems and related applications. Unlike existing methods that can only work on a 2D coordinate or a narrow-angle-view 3D coordinate, our proposal can maximally explore the 3D trajectory information of tracking targets. This is approached by applying camera geometry to transform human locations from 2D panoramic image coordinates to a 3D panoramic camera coordinate, and then by applying a tracking algorithm that associates human appearance and 3D trajectory together.

45.A Proposal-based Approach for Activity Image-to-Video Retrieval ⬇️

Activity image-to-video retrieval task aims to retrieve videos containing the similar activity as the query image, which is a challenging task because videos generally have many background segments irrelevant to the activity. In this paper, we utilize R-C3D model to represent a video by a bag of activity proposals, which can filter out background segments to some extent. However, there are still noisy proposals in each bag. Thus, we propose an Activity Proposal-based Image-to-Video Retrieval (APIVR) approach, which incorporates multi-instance learning into cross-modal retrieval framework to address the proposal noise issue. Specifically, we propose a Graph Multi-Instance Learning (GMIL) module with graph convolutional layer, and integrate this module with classification loss, adversarial loss, and triplet loss in our cross-modal retrieval framework. Moreover, we propose geometry-aware triplet loss based on point-to-subspace distance to preserve the structural information of activity proposals. Extensive experiments on three widely-used datasets verify the effectiveness of our approach.

46.Simple Pose: Rethinking and Improving a Bottom-up Approach for Multi-Person Pose Estimation ⬇️

We rethink a well-know bottom-up approach for multi-person pose estimation and propose an improved one. The improved approach surpasses the baseline significantly thanks to (1) an intuitional yet more sensible representation, which we refer to as body parts to encode the connection information between keypoints, (2) an improved stacked hourglass network with attention mechanisms, (3) a novel focal L2 loss which is dedicated to hard keypoint and keypoint association (body part) mining, and (4) a robust greedy keypoint assignment algorithm for grouping the detected keypoints into individual poses. Our approach not only works straightforwardly but also outperforms the baseline by about 15% in average precision and is comparable to the state of the art on the MS-COCO test-dev dataset. The code and pre-trained models are publicly available online.

47.EDIT: Exemplar-Domain Aware Image-to-Image Translation ⬇️

Image-to-image translation is to convert an image of the certain style to another of the target style with the content preserved. A desired translator should be capable to generate diverse results in a controllable (many-to-many) fashion. To this end, we design a novel generative adversarial network, namely exemplar-domain aware image-to-image translator (EDIT for short). The principle behind is that, for images from multiple domains, the content features can be obtained by a uniform extractor, while (re-)stylization is achieved by mapping the extracted features specifically to different purposes (domains and exemplars). The generator of our EDIT comprises of a part of blocks configured by shared parameters, and the rest by varied parameters exported by an exemplar-domain aware parameter network. In addition, a discriminator is equipped during the training phase to guarantee the output satisfying the distribution of the target domain. Our EDIT can flexibly and effectively work on multiple domains and arbitrary exemplars in a unified neat model. We conduct experiments to show the efficacy of our design, and reveal its advances over other state-of-the-art methods both quantitatively and qualitatively.

48.Exploiting Operation Importance for Differentiable Neural Architecture Search ⬇️

Recently, differentiable neural architecture search methods significantly reduce the search cost by constructing a super network and relax the architecture representation by assigning architecture weights to the candidate operations. All the existing methods determine the importance of each operation directly by architecture weights. However, architecture weights cannot accurately reflect the importance of each operation; that is, the operation with the highest weight might not related to the best performance. To alleviate this deficiency, we propose a simple yet effective solution to neural architecture search, termed as exploiting operation importance for effective neural architecture search (EoiNAS), in which a new indicator is proposed to fully exploit the operation importance and guide the model search. Based on this new indicator, we propose a gradual operation pruning strategy to further improve the search efficiency and accuracy. Experimental results have demonstrated the effectiveness of the proposed method. Specifically, we achieve an error rate of 2.50% on CIFAR-10, which significantly outperforms state-of-the-art methods. When transferred to ImageNet, it achieves the top-1 error of 25.6%, comparable to the state-of-the-art performance under the mobile setting.

49.Deep Visual Waterline Detection within Inland Marine Environment ⬇️

Waterline usually plays as an important visual cue for maritime applications. However, the visual complexity of inland waterline presents a significant challenge for the development of highly efficient computer vision algorithms tailored for waterline detection in a complicated inland water environment. This paper attempts to find a solution to guarantee the effectiveness of waterline detection for inland maritime applications with general digital camera sensor. To this end, a general deep-learning-based paradigm applicable in variable inland waters, named DeepWL, is proposed, which concerns the efficiency of waterline detection simultaneously. Specifically, there are two novel deep network models, named WLdetectNet and WLgenerateNet respectively, cooperating in the paradigm that afford a continuous waterline image-map estimation from a single captured video stream. Experimental results demonstrate the effectiveness and superiority of the proposed approach via qualitative and quantitative assessment on the concerned performances. Moreover, due to its own generality, the proposed approach has the potential to be applied to the waterline detection tasks of other water areas such as coastal waters.

50.Two Causal Principles for Improving Visual Dialog ⬇️

This paper is a winner report from team MReaL-BDAI for Visual Dialog Challenge 2019. We present two causal principles for improving Visual Dialog (VisDial). By "improving", we mean that they can promote almost every existing VisDial model to the state-of-the-art performance on Visual Dialog 2019 Challenge leader-board. Such a major improvement is only due to our careful inspection on the causality behind the model and data, finding that the community has overlooked two causalities in VisDial. Intuitively, Principle 1 suggests: we should remove the direct input of the dialog history to the answer model, otherwise the harmful shortcut bias will be introduced; Principle 2 says: there is an unobserved confounder for history, question, and answer, leading to spurious correlations from training data. In particular, to remove the confounder suggested in Principle 2, we propose several causal intervention algorithms, which make the training fundamentally different from the traditional likelihood estimation. Note that the two principles are model-agnostic, so they are applicable in any VisDial model.

51.Image Cropping with Composition and Saliency Aware Aesthetic Score Map ⬇️

Aesthetic image cropping is a practical but challenging task which aims at finding the best crops with the highest aesthetic quality in an image. Recently, many deep learning methods have been proposed to address this problem, but they did not reveal the intrinsic mechanism of aesthetic evaluation. In this paper, we propose an interpretable image cropping model to unveil the mystery. For each image, we use a fully convolutional network to produce an aesthetic score map, which is shared among all candidate crops during crop-level aesthetic evaluation. Then, we require the aesthetic score map to be both composition-aware and saliency-aware. In particular, the same region is assigned with different aesthetic scores based on its relative positions in different crops. Moreover, a visually salient region is supposed to have more sensitive aesthetic scores so that our network can learn to place salient objects at more proper positions. Such an aesthetic score map can be used to localize aesthetically important regions in an image, which sheds light on the composition rules learned by our model. We show the competitive performance of our model in the image cropping task on several benchmark datasets, and also demonstrate its generality in real-world applications.

52.Reinventing 2D Convolutions for 3D Medical Images ⬇️

There has been considerable debate over 2D and 3D representation learning on 3D medical images. 2D approaches could benefit from large-scale 2D pretraining, whereas they are generally weak in capturing large 3D contexts. 3D approaches are natively strong in 3D contexts, however few publicly available 3D medical dataset is large and diverse enough for universal 3D pretraining. Even for hybrid (2D + 3D) approaches, the intrinsic disadvantages within the 2D / 3D parts still exist. In this study, we bridge the gap between 2D and 3D convolutions by reinventing the 2D convolutions. We propose ACS (axial-coronal-sagittal) convolutions to perform natively 3D representation learning, while utilizing the pretrained weights from 2D counterparts. In ACS convolutions, 2D convolution kernels are split by channel into three parts, and convoluted separately on the three views (axial, coronal and sagittal) of 3D representations. Theoretically, ANY 2D CNN (ResNet, DenseNet, or DeepLab) is able to be converted into a 3D ACS CNN, with pretrained weights of same parameter sizes. Extensive experiments on proof-of-concept dataset and several medical benchmarks validate the consistent superiority of the pretrained ACS CNNs, over the 2D / 3D CNN counterparts with / without pretraining. Even without pretraining, the ACS convolution can be used as a plug-and-play replacement of standard 3D convolution, with smaller model size.

53."Looking at the right stuff" -- Guided semantic-gaze for autonomous driving ⬇️

In recent years, predicting driver's focus of attention has been a very active area of research in the autonomous driving community. Unfortunately, existing state-of-the-art techniques achieve this by relying only on human gaze information, thereby ignoring scene semantics. We propose a novel Semantics Augmented GazE (SAGE) detection approach that captures driving specific contextual information, in addition to the raw gaze. Such a combined attention mechanism serves as a powerful tool to focus on the relevant regions in an image frame in order to make driving both safe and efficient. Using this, we design a complete saliency prediction framework -- SAGE-Net, which modifies the initial prediction from SAGE by taking into account vital aspects such as distance to objects (depth), ego vehicle speed, and pedestrian crossing intent. Exhaustive experiments conducted through four popular saliency algorithms show that on 49/56 (87.5%) cases -- considering both the overall dataset and crucial driving scenarios, SAGE outperforms existing techniques without any additional computational overhead during the training process. The final paper will be accompanied by the release of our dataset and relevant code.

54.3FabRec: Fast Few-shot Face alignment by Reconstruction ⬇️

Current supervised frameworks for facial landmark detection require a large amount of training data and due to the massive number of parameters may suffer from overfitting to the specific datasets. We introduce a semi-supervised method in which the crucial idea is to first generate implicit knowledge about the face appearance & shape from the large amounts of unlabeled images of faces available today. In a first, unsupervised stage, we train an adversarial autoencoder to reconstruct faces via a low-dimensional, latent face-representation vector. In a second, supervised stage, we augment the generator-decoder pipeline with interleaved transfer layers in order to both reconstruct the face and a probabilistic landmark heatmap. We show that this framework (3FabRec) achieves state-of-the-art performance on popular benchmarks, such as 300-W, AFLW, and WLFW. Importantly, due to the power of the implicit face representation, our framework achieves impressive landmark localization accuracy from only a few percent of training data to as low as even 10 images. As the interleaved layers only add a small number of parameters to the encoder, inference runs at several hundred FPS on a GPU.

55.Normal Assisted Stereo Depth Estimation ⬇️

Accurate stereo depth estimation plays a critical role in various 3D tasks in both indoor and outdoor environments. Recently, learning-based multi-view stereo methods have demonstrated competitive performance with limited number of views. However, in challenging scenarios, especially when building cross-view correspondences is hard, these methods still cannot produce satisfying results. In this paper, we study how to enforce the consistency between surface normal and depth at training time to improve the performance. We couple the learning of a multi-view normal estimation module and a multi-view depth estimation module. In addition, we propose a novel consistency loss to train an independent consistency module that refines the depths from depth/normal pairs. We find that the joint learning can improve both the prediction of normal and depth, and the accuracy and smoothness can be further improved by enforcing the consistency. Experiments on MVS, SUN3D, RGBD and Scenes11 demonstrate the effectiveness of our method and state-of-the-art performance.

56.Robustness Metrics for Real-World Adversarial Examples ⬇️

We explore metrics to evaluate the robustness of real-world adversarial attacks, in particular adversarial patches, to changes in environmental conditions. We demonstrate how these metrics can be used to establish model baseline performance and uncover model biases to then compare against real-world adversarial attacks. We establish a custom score for an adversarial condition that is adjusted for different environmental conditions and explore how the score changes with respect to specific environmental factors. Lastly, we propose two use cases for confidence distributions in each environmental condition.

57.Constrained Linear Data-feature Mapping for Image Classification ⬇️

In this paper, we propose a constrained linear data-feature mapping model as an interpretable mathematical model for image classification using convolutional neural network (CNN) such as the ResNet. From this viewpoint, we establish the detailed connections in a technical level between the traditional iterative schemes for constrained linear system and the architecture for the basic block of ResNet. Under these connections, we propose some natural modifications of ResNet type models which will have less parameters but can keep almost the same accuracy as these original models. Some numerical experiments are shown to demonstrate the validity of this constrained learning data-feature mapping assumption.

58.Atlas Based Segmentations via Semi-Supervised Diffeomorphic Registrations ⬇️

Purpose: Segmentation of organs-at-risk (OARs) is a bottleneck in current radiation oncology pipelines and is often time consuming and labor intensive. In this paper, we propose an atlas-based semi-supervised registration algorithm to generate accurate segmentations of OARs for which there are ground truth contours and rough segmentations of all other OARs in the atlas. To the best of our knowledge, this is the first study to use learning-based registration methods for the segmentation of head and neck patients and demonstrate its utility in clinical applications. Methods: Our algorithm cascades rigid and deformable deformation blocks, and takes on an atlas image (M), set of atlas-space segmentations (S_A), and a patient image (F) as inputs, while outputting patient-space segmentations of all OARs defined on the atlas. We train our model on 475 CT images taken from public archives and Stanford RadOnc Clinic (SROC), validate on 5 CT images from SROC, and test our model on 20 CT images from SROC. Results: Our method outperforms current state of the art learning-based registration algorithms and achieves an overall dice score of 0.789 on our test set. Moreover, our method yields a performance comparable to manual segmentation and supervised segmentation, while solving a much more complex registration problem. Whereas supervised segmentation methods only automate the segmentation process for a select few number of OARs, we demonstrate that our methods can achieve similar performance for OARs of interest, while also providing segmentations for every other OAR on the provided atlas. Conclusions: Our proposed algorithm has significant clinical applications and could help reduce the bottleneck for segmentation of head and neck OARs. Further, our results demonstrate that semi-supervised diffeomorphic registration can be accurately applied to both registration and segmentation problems.

59.Visualizing Point Cloud Classifiers by Curvature Smoothing ⬇️

Recently, several networks that operate directly on point clouds have been proposed. There is significant utility in understanding them better, so that humans can understand more about the mechanisms how those networks classify point clouds, potentially helping diagnosing them and designing better architectures and data augmentation pipelines. In this paper, we propose a novel approach to visualize important features used in classification decisions of point cloud networks. Following ideas in visualizing 2-D convolutional networks, our approach is based on gradually smoothing parts of the point cloud. However, different from the 2-D case, we smooth the curvature of the point cloud to remove sharp shape features. The resulting point cloud is then evaluated on the original point cloud network to see whether the performance has dropped or remained the same, from which parts that are important to the point cloud classification are identified. A technical contribution of the paper is an approximated curvature smoothing algorithm, which can smoothly transition from the original point cloud to one of constant curvature, such as a uniform sphere. With this smoothing algorithm, we propose PCI-GOS, a 3-D extension of the Integrated-Gradients Optimized Saliency (I-GOS) algorithm, as a perturbation-based visualization technique realized on 3-D shapes. Experiment results revealed insights into these classifiers.

60.SAL: Sign Agnostic Learning of Shapes from Raw Data ⬇️

Recently, neural networks have been used as implicit representations for surface reconstruction, modelling, learning, and generation. So far, training neural networks to be implicit representations of surfaces required training data sampled from a ground-truth signed implicit functions such as signed distance or occupancy functions, which are notoriously hard to compute.
In this paper we introduce Sign Agnostic Learning (SAL), a deep learning approach for learning implicit shape representations directly from raw, unsigned geometric data, such as point clouds and triangle soups.
We have tested SAL on the challenging problem of surface reconstruction from an un-oriented point cloud, as well as end-to-end human shape space learning directly from raw scans dataset, and achieved state of the art reconstructions compared to current approaches. We believe SAL opens the door to many geometric deep learning applications with real-world data, alleviating the usual painstaking, often manual pre-process.

61.Region Normalization for Image Inpainting ⬇️

Feature Normalization (FN) is an important technique to help neural network training, which typically normalizes features across spatial dimensions. Most previous image inpainting methods apply FN in their networks without considering the impact of the corrupted regions of the input image on normalization, e.g. mean and variance shifts. In this work, we show that the mean and variance shifts caused by full-spatial FN limit the image inpainting network training and we propose a spatial region-wise normalization named Region Normalization (RN) to overcome the limitation. RN divides spatial pixels into different regions according to the input mask, and computes the mean and variance in each region for normalization. We develop two kinds of RN for our image inpainting network: (1) Basic RN (RN-B), which normalizes pixels from the corrupted and uncorrupted regions separately based on the original inpainting mask to solve the mean and variance shift problem; (2) Learnable RN (RN-L), which automatically detects potentially corrupted and uncorrupted regions for separate normalization, and performs global affine transformation to enhance their fusion. We apply RN-B in the early layers and RN-L in the latter layers of the network respectively. Experiments show that our method outperforms current state-of-the-art methods quantitatively and qualitatively. We further generalize RN to other inpainting networks and achieve consistent performance improvements.

62.Differentiable Meta-learning Model for Few-shot Semantic Segmentation ⬇️

To address the annotation scarcity issue in some cases of semantic segmentation, there have been a few attempts to develop the segmentation model in the few-shot learning paradigm. However, most existing methods only focus on the traditional 1-way segmentation setting (i.e., one image only contains a single object). This is far away from practical semantic segmentation tasks where the K-way setting (K>1) is usually required by performing the accurate multi-object segmentation. To deal with this issue, we formulate the few-shot semantic segmentation task as a learning-based pixel classification problem and propose a novel framework called MetaSegNet based on meta-learning. In MetaSegNet, an architecture of embedding module consisting of the global and local feature branches is developed to extract the appropriate meta-knowledge for the few-shot segmentation. Moreover, we incorporate a linear model into MetaSegNet as a base learner to directly predict the label of each pixel for the multi-object segmentation. Furthermore, our MetaSegNet can be trained by the episodic training mechanism in an end-to-end manner from scratch. Experiments on two popular semantic segmentation datasets, i.e., PASCAL VOC and COCO, reveal the effectiveness of the proposed MetaSegNet in the K-way few-shot semantic segmentation task.

63.Universal Adversarial Perturbations to Understand Robustness of Texture vs. Shape-biased Training ⬇️

Convolutional Neural Networks (CNNs) used on image classification tasks such as ImageNet have been shown to be biased towards recognizing textures rather than shapes. Recent work has attempted to alleviate this by augmenting the training dataset with shape-based examples to create Stylized-ImageNet. However, in this paper we show that models trained on this dataset remain vulnerable to Universal Adversarial Perturbations (UAPs). We use UAPs to evaluate and compare the robustness of CNN models with varying degrees of shape-based training. We also find that a posteriori fine-tuning on ImageNet negates features learned from training on Stylized-ImageNet. This study reveals an important limitation and reiterates the need for further research into understanding the robustness of CNNs for visual recognition.

64.Globally Guided Progressive Fusion Network for 3D Pancreas Segmentation ⬇️

Recently 3D volumetric organ segmentation attracts much research interest in medical image analysis due to its significance in computer aided diagnosis. This paper aims to address the pancreas segmentation task in 3D computed tomography volumes. We propose a novel end-to-end network, Globally Guided Progressive Fusion Network, as an effective and efficient solution to volumetric segmentation, which involves both global features and complicated 3D geometric information. A progressive fusion network is devised to extract 3D information from a moderate number of neighboring slices and predict a probability map for the segmentation of each slice. An independent branch for excavating global features from downsampled slices is further integrated into the network. Extensive experimental results demonstrate that our method achieves state-of-the-art performance on two pancreas datasets.

65.Unsupervised Keyword Extraction for Full-sentence VQA ⬇️

In existing studies on Visual Question Answering (VQA), which aims to train an intelligent system to be able to answer questions about images, the answers corresponding to the questions consists of short, almost single words. However, considering the natural conversation with humans, the answers would more likely to be sentences, rather than single words. In such a situation, the system needs to focus on a keyword, i.e., the most important word in the sentence, to answer the question. Therefore, we have proposed a novel keyword extraction method for VQA. Because collecting keywords and full-sentence annotations for VQA can be highly costly, we perform the keyword extraction in an unsupervised manner. Our key insight is that the full-sentence answer can be decomposed into two parts: the part contains new information for the question and the part only contains information already included in the question. Since the keyword is considered as the part which contains new information as the answer, we need to identify which words in the full-sentence answer are the part of new information and which words are not. To ensure such decomposition, we extracted two features from the full-sentence answers, and designed discriminative decoders to make each feature to include the information of the question and answers respectively. We conducted experiments on existing VQA datasets, which contains full-sentence annotations, and show that our proposed model can correctly extract the keyword without any keyword annotations.

66.Simple and Lightweight Human Pose Estimation ⬇️

Recent research on human pose estimation has achieved significant improvement. However, most existing methods tend to pursue higher scores using complex architecture or computationally expensive models on benchmark datasets, ignoring the deployment costs in practice. In this paper, we investigate the problem of simple and lightweight human pose estimation. We first redesign a lightweight bottleneck block with two non-novel concepts: depthwise convolution and attention mechanism. And then, based on the lightweight block, we present a Lightweight Pose Network (LPN) following the architecture design principles of SimpleBaseline. The model size (#Params) of our small network LPN-50 is only 9% of SimpleBaseline(ResNet50), and the computational complexity (FLOPs) is only 11%. To give full play to the potential of our LPN and get more accurate predicted results, we also propose an iterative training strategy and a model-agnostic post-processing function Beta-Soft-Argmax. We empirically demonstrate the effectiveness and efficiency of our methods on the benchmark dataset: the COCO keypoint detection dataset. Besides, we show the speed superiority of our lightweight network at inference time on a non-GPU platform. Specifically, our LPN-50 can achieve 68.7 in AP score on the COCO test-dev set, with only 2.7M parameters and 1.0 GFLOPs, while the inference speed is 33 FPS on an Intel i7-8700K CPU machine.

67.Attention Deep Model with Multi-Scale Deep Supervision for Person Re-Identification ⬇️

In recent years, person re-identification (PReID) has become a hot topic in computer vision duo to it is an important part in intelligent surveillance. Many state-of-the-art PReID methods are attention-based or multi-scale feature learning deep models. However, introducing attention mechanism may lead to some important feature information losing issue. Besides, most of the multi-scale models embedding the multi-scale feature learning block into the feature extraction deep network, which reduces the efficiency of inference network. To address these issue, in this study, we introduce an attention deep architecture with multi-scale deep supervision for PReID. Technically, we contribute a reverse attention block to complement the attention block, and a novel multi-scale layer with deep supervision operator for training the backbone network. The proposed block and operator are only used for training, and discard in test phase. Experiments have been performed on Market-1501, DukeMTMC-reID and CUHK03 datasets. All the experiment results show that the proposed model significantly outperforms the other competitive state-of-the-art methods.

68.Iteratively-Refined Interactive 3D Medical Image Segmentation with Multi-Agent Reinforcement Learning ⬇️

Existing automatic 3D image segmentation methods usually fail to meet the clinic use. Many studies have explored an interactive strategy to improve the image segmentation performance by iteratively incorporating user hints. However, the dynamic process for successive interactions is largely ignored. We here propose to model the dynamic process of iterative interactive image segmentation as a Markov decision process (MDP) and solve it with reinforcement learning (RL). Unfortunately, it is intractable to use single-agent RL for voxel-wise prediction due to the large exploration space. To reduce the exploration space to a tractable size, we treat each voxel as an agent with a shared voxel-level behavior strategy so that it can be solved with multi-agent reinforcement learning. An additional advantage of this multi-agent model is to capture the dependency among voxels for segmentation task. Meanwhile, to enrich the information of previous segmentations, we reserve the prediction uncertainty in the state space of MDP and derive an adjustment action space leading to a more precise and finer segmentation. In addition, to improve the efficiency of exploration, we design a relative cross-entropy gain-based reward to update the policy in a constrained direction. Experimental results on various medical datasets have shown that our method significantly outperforms existing state-of-the-art methods, with the advantage of fewer interactions and a faster convergence.

69.PlantDoc: A Dataset for Visual Plant Disease Detection ⬇️

India loses 35% of the annual crop yield due to plant diseases. Early detection of plant diseases remains difficult due to the lack of lab infrastructure and expertise. In this paper, we explore the possibility of computer vision approaches for scalable and early plant disease detection. The lack of availability of sufficiently large-scale non-lab data set remains a major challenge for enabling vision based plant disease detection. Against this background, we present PlantDoc: a dataset for visual plant disease detection. Our dataset contains 2,598 data points in total across 13 plant species and up to 17 classes of diseases, involving approximately 300 human hours of effort in annotating internet scraped images. To show the efficacy of our dataset, we learn 3 models for the task of plant disease classification. Our results show that modelling using our dataset can increase the classification accuracy by up to 31%. We believe that our dataset can help reduce the entry barrier of computer vision techniques in plant disease detection.

70.Learning a Representation with the Block-Diagonal Structure for Pattern Classification ⬇️

Sparse-representation-based classification (SRC) has been widely studied and developed for various practical signal classification applications. However, the performance of a SRC-based method is degraded when both the training and test data are corrupted. To counteract this problem, we propose an approach that learns Representation with Block-Diagonal Structure (RBDS) for robust image recognition. To be more specific, we first introduce a regularization term that captures the block-diagonal structure of the target representation matrix of the training data. The resulting problem is then solved by an optimizer. Last, based on the learned representation, a simple yet effective linear classifier is used for the classification task. The experimental results obtained on several benchmarking datasets demonstrate the efficacy of the proposed RBDS method.

71.Invert and Defend: Model-based Approximate Inversion of Generative Adversarial Networks for Secure Inference ⬇️

Inferring the latent variable generating a given test sample is a challenging problem in Generative Adversarial Networks (GANs). In this paper, we propose InvGAN - a novel framework for solving the inference problem in GANs, which involves training an encoder network capable of inverting a pre-trained generator network without access to any training data. Under mild assumptions, we theoretically show that using InvGAN, we can approximately invert the generations of any latent code of a trained GAN model. Furthermore, we empirically demonstrate the superiority of our inference scheme by quantitative and qualitative comparisons with other methods that perform a similar task. We also show the effectiveness of our framework in the problem of adversarial defenses where InvGAN can successfully be used as a projection-based defense mechanism. Additionally, we show how InvGAN can be used to implement reparameterization white-box attacks on projection-based defense mechanisms. Experimental validation on several benchmark datasets demonstrate the efficacy of our method in achieving improved performance on several white-box and black-box attacks. Our code is available at this https URL.

72.Real-Time 3D Model Tracking in Color and Depth on a Single CPU Core ⬇️

We present a novel method to track 3D models in color and depth data. To this end, we introduce approximations that accelerate the state-of-the-art in region-based tracking by an order of magnitude while retaining similar accuracy. Furthermore, we show how the method can be made more robust in the presence of depth data and consequently formulate a new joint contour and ICP tracking energy. We present better results than the state-of-the-art while being much faster then most other methods and achieving all of the above on a single CPU core.

73.ViewSynth: Learning Local Features from Depth using View Synthesis ⬇️

We address the problem of jointly detecting keypoints and learning descriptors in depth data with challenging viewpoint changes. Despite great improvements in recent RGB based local feature learning methods, we show that these methods cannot be directly transferred to the depth image modality. These methods also do not utilize the 2.5D information present in depth images. We propose a framework ViewSynth, designed to jointly learn 3D structure aware depth image representation, and local features from that representation. ViewSynth consists of `View Synthesis Network' (VSN), trained to synthesize depth image views given a depth image representation and query viewpoints. ViewSynth framework includes joint learning of keypoints and feature descriptor, paired with our view synthesis loss, which guides the model to propose keypoints robust to viewpoint changes. We demonstrate the effectiveness of our formulation on several depth image datasets, where learned local features using our proposed ViewSynth framework outperforms the state-of-the-art methods in keypoint matching and camera localization tasks.

74.Rigging the Lottery: Making All Tickets Winners ⬇️

Sparse neural networks have been shown to be more parameter and compute efficient compared to dense networks and in some cases are used to decrease wall clock inference times. There is a large body of work on training dense networks to yield sparse networks for inference. This limits the size of the largest trainable sparse model to that of the largest trainable dense model. In this paper we introduce a method to train sparse neural networks with a fixed parameter count and a fixed computational cost throughout training, without sacrificing accuracy relative to existing dense-to-sparse training methods. Our method updates the topology of the network during training by using parameter magnitudes and infrequent gradient calculations. We show that this approach requires fewer floating-point operations (FLOPs) to achieve a given level of accuracy compared to prior techniques. Importantly, by adjusting the topology it can start from any initialization - not just "lucky" ones. We demonstrate state-of-the-art sparse training results with ResNet-50, MobileNet v1 and MobileNet v2 on the ImageNet-2012 dataset, WideResNets on the CIFAR-10 dataset and RNNs on the WikiText-103 dataset. Finally, we provide some insights into why allowing the topology to change during the optimization can overcome local minima encountered when the topology remains static.

75.A Simple yet Effective Way for Improving the Performance of GANs ⬇️

This paper presents a simple but effective way that improves the performance of generative adversarial networks (GANs) without imposing the training overhead or modifying the network architectures of existing methods. The proposed method employs a novel cascading rejection (CR) module for discriminator, which extracts multiple non-overlapped features in an iterative manner. The CR module supports the discriminator to effectively distinguish between real and generated images, which results in a strong penalization to the generator. In order to deceive the robust discriminator containing the CR module, the generator produces the images that are more similar to the real images. Since the proposed CR module requires only a few simple vector operations, it can be readily applied to existing frameworks with marginal training overheads. Quantitative evaluations on various datasets including CIFAR-10, Celeb-HQ, LSUN, and tiny-ImageNet confirm that the proposed method significantly improves the performance of GANs and conditional GANs in terms of Frechet inception distance (FID) indicating the diversity and visual appearance of the generated images.

76.End-to-End Model-Free Reinforcement Learning for Urban Driving using Implicit Affordances ⬇️

Reinforcement Learning (RL) aims at learning an optimal behavior policy from its own experiments and not rule-based control methods. However, there is no RL algorithm yet capable of handling a task as difficult as urban driving. We present a novel technique, coined implicit affordances, to effectively leverage RL for urban driving thus including lane keeping, pedestrians and vehicles avoidance, and traffic light detection. To our knowledge we are the first to present a successful RL agent handling such a complex task especially regarding the traffic light detection. We demonstrate the effectiveness of our method by being one of the top teams of the camera only track of the CARLA challenge.

77.Discriminative training of conditional random fields with probably submodular constraints ⬇️

Problems of segmentation, denoising, registration and 3D reconstruction are often addressed with the graph cut algorithm. However, solving an unconstrained graph cut problem is NP-hard. For tractable optimization, pairwise potentials have to fulfill the submodularity inequality. In our learning paradigm, pairwise potentials are created as the dot product of a learned vector w with positive feature vectors. In order to constrain such a model to remain tractable, previous approaches have enforced the weight vector to be positive for pairwise potentials in which the labels differ, and set pairwise potentials to zero in the case that the label remains the same. Such constraints are sufficient to guarantee that the resulting pairwise potentials satisfy the submodularity inequality. However, we show that such an approach unnecessarily restricts the capacity of the learned models. Guaranteeing submodularity for all possible inputs, no matter how improbable, reduces inference error to effectively zero, but increases model error. In contrast, we relax the requirement of guaranteed submodularity to solutions that are probably approximately submodular. We show that the conceptually simple strategy of enforcing submodularity on the training examples guarantees with low sample complexity that test images will also yield submodular pairwise potentials. Results are presented in the binary and muticlass settings, showing substantial improvement from the resulting increased model capacity.

78.Matrix Normal PCA for Interpretable Dimension Reduction and Graphical Noise Modeling ⬇️

Principal component analysis (PCA) is one of the most widely used dimension reduction and multivariate statistical techniques. From a probabilistic perspective, PCA seeks a low-dimensional representation of data in the presence of independent identical Gaussian noise. Probabilistic PCA (PPCA) and its variants have been extensively studied for decades. Most of them assume the underlying noise follows a certain independent identical distribution. However, the noise in the real world is usually complicated and structured. To address this challenge, some non-linear variants of PPCA have been proposed. But those methods are generally difficult to interpret. To this end, we propose a powerful and intuitive PCA method (MN-PCA) through modeling the graphical noise by the matrix normal distribution, which enables us to explore the structure of noise in both the feature space and the sample space. MN-PCA obtains a low-rank representation of data and the structure of noise simultaneously. And it can be explained as approximating data over the generalized Mahalanobis distance. We develop two algorithms to solve this model: one maximizes the regularized likelihood, the other exploits the Wasserstein distance, which is more robust. Extensive experiments on various data demonstrate their effectiveness.

79.Fast and Incremental Loop Closure Detection Using Proximity Graphs ⬇️

Visual loop closure detection, which can be considered as an image retrieval task, is an important problem in SLAM (Simultaneous Localization and Mapping) systems. The frequently used bag-of-words (BoW) models can achieve high precision and moderate recall. However, the requirement for lower time costs and fewer memory costs for mobile robot applications is not well satisfied. In this paper, we propose a novel loop closure detection framework titled `FILD' (Fast and Incremental Loop closure Detection), which focuses on an on-line and incremental graph vocabulary construction for fast loop closure detection. The global and local features of frames are extracted using the Convolutional Neural Networks (CNN) and SURF on the GPU, which guarantee extremely fast extraction speeds. The graph vocabulary construction is based on one type of proximity graph, named Hierarchical Navigable Small World (HNSW) graphs, which is modified to adapt to this specific application. In addition, this process is coupled with a novel strategy for real-time geometrical verification, which only keeps binary hash codes and significantly saves on memory usage. Extensive experiments on several publicly available datasets show that the proposed approach can achieve fairly good recall at 100% precision compared to other state-of-the-art methods. The source code can be downloaded at this https URL for further studies.

80.When NAS Meets Robustness: In Search of Robust Architectures against Adversarial Attacks ⬇️

Recent advances in adversarial attacks uncover the intrinsic vulnerability of modern deep neural networks. Since then, extensive efforts have been devoted to enhancing the robustness of deep networks via specialized learning algorithms and loss functions. In this work, we take an architectural perspective and investigate the patterns of network architectures that are resilient to adversarial attacks. To obtain the large number of networks needed for this study, we adopt one-shot neural architecture search, training a large network for once and then finetuning the sub-networks sampled therefrom. The sampled architectures together with the accuracies they achieve provide a rich basis for our study. Our "robust architecture Odyssey" reveals several valuable observations: 1) densely connected patterns result in improved robustness; 2) under computational budget, adding convolution operations to direct connection edge is effective; 3) flow of solution procedure (FSP) matrix is a good indicator of network robustness. Based on these observations, we discover a family of robust architectures (RobNets). On various datasets, including CIFAR, SVHN, and Tiny-ImageNet, RobNets exhibit superior robustness performance to other widely used architectures. Notably, RobNets substantially improve the robust accuracy (~5% absolute gains) under both white-box and black-box attacks, even with fewer parameter numbers.

81.Rethinking Softmax with Cross-Entropy: Neural Network Classifier as Mutual Information Estimator ⬇️

Mutual information is widely applied to learn latent representations of observations, whilst its implication in classification neural networks remain to be better explained. In this paper, we show that optimising the parameters of classification neural networks with softmax cross-entropy is equivalent to maximising the mutual information between inputs and labels under the balanced data assumption. Through the experiments on synthetic and real datasets, we show that softmax cross-entropy can estimate mutual information approximately. When applied to image classification, this relation helps approximate the point-wise mutual information between an input image and a label without modifying the network structure. In this end, we propose infoCAM, informative class activation map, which highlights regions of the input image that are the most relevant to a given label based on differences in information. The activation map helps localise the target object in an image. Through the experiments on the semi-supervised object localisation task with two real-world datasets, we evaluate the effectiveness of the information-theoretic approach.

82.Robot Learning and Execution of Collaborative Manipulation Plans from YouTube Videos ⬇️

People often watch videos on the web to learn how to cook new recipes, assemble furniture or repair a computer. We wish to enable robots with the very same capability. This is challenging; there is a large variation in manipulation actions and some videos even involve multiple persons, who collaborate by sharing and exchanging objects and tools. Furthermore, the learned representations need to be general enough to be transferable to robotic systems. Previous systems have enabled generation of semantic and human-interpretable robot commands in the form of visual sentences. However, they require manual selection of short action clips, which are then individually processed. We propose a framework for executing demonstrated action sequences from full-length, unconstrained videos on the web. The framework takes as input a video annotated with object labels and bounding boxes, and outputs a collaborative manipulation action plan for one or more robotic arms. We demonstrate the performance of the system in three full-length collaborative cooking videos on the web and propose an open-source platform for executing the learned plans in a simulation environment.

83.DeepSmartFuzzer: Reward Guided Test Generation For Deep Learning ⬇️

Testing Deep Neural Network (DNN) models has become more important than ever with the increasing usage of DNN models in safety-critical domains such as autonomous cars. The traditional approach of testing DNNs is to create a test set, which is a random subset of the dataset about the problem of interest. This kind of approach is not enough for testing most of the real-world scenarios since these traditional test sets do not include corner cases, while a corner case input is generally considered to introduce erroneous behaviors. Recent works on adversarial input generation, data augmentation, and coverage-guided fuzzing (CGF) have provided new ways to extend traditional test sets. Among those, CGF aims to produce new test inputs by fuzzing existing ones to achieve high coverage on a test adequacy criterion (i.e. coverage criterion). Given that the subject test adequacy criterion is a well-established one, CGF can potentially find error inducing inputs for different underlying reasons. In this paper, we propose a novel CGF solution for structural testing of DNNs. The proposed fuzzer employs Monte Carlo Tree Search to drive the coverage-guided search in the pursuit of achieving high coverage. Our evaluation shows that the inputs generated by our method result in higher coverage than the inputs produced by the previously introduced coverage-guided fuzzing techniques.

84.Biological sex classification with structural MRI data shows increased misclassification in transgender women ⬇️

Transgender individuals show brain structural alterations that differ from their biological sex as well as their perceived gender. To substantiate evidence that the brain structure of transgender individuals differs from male and female, we use a combined multivariate and univariate approach. Gray matter segments resulting from voxel-based morphometry preprocessing of N = 1753 cisgender (CG) healthy participants were used to train (N = 1402) and validate (20% hold-out N = 351) a support vector machine classifying the biological sex. As a second validation, we classified N = 1104 patients with depression. A third validation was performed using the matched CG sample of the transgender women (TW) application sample. Subsequently, the classifier was applied to N = 25 TW. Finally, we compared brain volumes of CG-men, women and TW pre/post treatment (CHT) in a univariate analysis controlling for sexual orientation, age and total brain volume. The application of our biological sex classifier to the transgender sample resulted in a significantly lower true positive rate (TPR-male = 56.0%). The TPR did not differ between CG-individuals with (TPR-male = 86.9%) and without depression (TPR-male = 88.5%). The univariate analysis of the transgender application sample revealed that TW pre/post treatment show brain structural differences from CG-women and CG-men in the putamen and insula, as well as the whole-brain analysis. Our results support the hypothesis that brain structure in TW differs from brain structure of their biological sex (male) as well as their perceived gender (female). This finding substantiates evidence that transgender individuals show specific brain structural alterations leading to a different pattern of brain structure than CG individuals.

85.AnoNet: Weakly Supervised Anomaly Detection in Textured Surfaces ⬇️

Humans can easily detect a defect (anomaly) because it is different or salient when compared to the surface it resides on. Today, manual human visual inspection is still the norm because it is difficult to automate anomaly detection. Neural networks are a useful tool that can teach a machine to find defects. However, they require a lot of training examples to learn what a defect is and it is tedious and expensive to get these samples. We tackle the problem of teaching a network with a low number of training samples with a system we call AnoNet. AnoNet's architecture is similar to CompactCNN with the exceptions that (1) it is a fully convolutional network and does not use strided convolution; (2) it is shallow and compact which minimizes over-fitting by design; (3) the compact design constrains the size of intermediate features which allows training to be done without image downsizing; (4) the model footprint is low making it suitable for edge computation; and (5) the anomaly can be detected and localized despite the weak labelling. AnoNet learns to detect the underlying shape of the anomalies despite the weak annotation as well as preserves the spatial localization of the anomaly. Pre-seeding AnoNet with an engineered filter bank initialization technique reduces the total samples required for training and also achieves state-of-the-art performance. Compared to the CompactCNN, AnoNet achieved a massive 94% reduction of network parameters from 1.13 million to 64 thousand parameters. Experiments were conducted on four data-sets and results were compared against CompactCNN and DeepLabv3. AnoNet improved the performance on an average across all data-sets by 106% to an F1 score of 0.98 and by 13% to an AUROC value of 0.942. AnoNet can learn from a limited number of images. For one of the data-sets, AnoNet learnt to detect anomalies after a single pass through just 53 training images.

86.Towards a Hypothesis on Visual Transformation based Self-Supervision ⬇️

We propose the first qualitative hypothesis characterizing the behavior of visual transformation based self-supervision, called the VTSS hypothesis. Given a dataset upon which a self-supervised task is performed while predicting instantiations of a transformation, the hypothesis states that if the predicted instantiations of the transformations are already present in the dataset, then the representation learned will be less useful. The hypothesis was derived by observing a key constraint in the application of self-supervision using a particular transformation. This constraint, which we term the transformation conflict for this paper, forces a network learn degenerative features thereby reducing the usefulness of the representation. The VTSS hypothesis helps us identify transformations that have the potential to be effective as a self-supervision task. Further, it helps to generally predict whether a particular transformation based self-supervision technique would be effective or not for a particular dataset. We provide extensive evaluations on CIFAR 10, CIFAR 100, SVHN and FMNIST confirming the hypothesis and the trends it predicts. We also propose novel cost-effective self-supervision techniques based on translation and scale, which when combined with rotation outperforms all transformations applied individually. Overall, this paper aims to shed light on the phenomenon of visual transformation based self-supervision.

87.dpVAEs: Fixing Sample Generation for Regularized VAEs ⬇️

Unsupervised representation learning via generative modeling is a staple to many computer vision applications in the absence of labeled data. Variational Autoencoders (VAEs) are powerful generative models that learn representations useful for data generation. However, due to inherent challenges in the training objective, VAEs fail to learn useful representations amenable for downstream tasks. Regularization-based methods that attempt to improve the representation learning aspect of VAEs come at a price: poor sample generation. In this paper, we explore this representation-generation trade-off for regularized VAEs and introduce a new family of priors, namely decoupled priors, or dpVAEs, that decouple the representation space from the generation space. This decoupling enables the use of VAE regularizers on the representation space without impacting the distribution used for sample generation, and thereby reaping the representation learning benefits of the regularizations without sacrificing the sample generation. dpVAE leverages invertible networks to learn a bijective mapping from an arbitrarily complex representation distribution to a simple, tractable, generative distribution. Decoupled priors can be adapted to the state-of-the-art VAE regularizers without additional hyperparameter tuning. We showcase the use of dpVAEs with different regularizers. Experiments on MNIST, SVHN, and CelebA demonstrate, quantitatively and qualitatively, that dpVAE fixes sample generation for regularized VAEs.

88.Neural Storyboard Artist: Visualizing Stories with Coherent Image Sequences ⬇️

A storyboard is a sequence of images to illustrate a story containing multiple sentences, which has been a key process to create different story products. In this paper, we tackle a new multimedia task of automatic storyboard creation to facilitate this process and inspire human artists. Inspired by the fact that our understanding of languages is based on our past experience, we propose a novel inspire-and-create framework with a story-to-image retriever that selects relevant cinematic images for inspiration and a storyboard creator that further refines and renders images to improve the relevancy and visual consistency. The proposed retriever dynamically employs contextual information in the story with hierarchical attentions and applies dense visual-semantic matching to accurately retrieve and ground images. The creator then employs three rendering steps to increase the flexibility of retrieved images, which include erasing irrelevant regions, unifying styles of images and substituting consistent characters. We carry out extensive experiments on both in-domain and out-of-domain visual story datasets. The proposed model achieves better quantitative performance than the state-of-the-art baselines for storyboard creation. Qualitative visualizations and user studies further verify that our approach can create high-quality storyboards even for stories in the wild.

89.Ground Truth Simulation for Deep Learning Classification of Mid-Resolution Venus Images Via Unmixing of High-Resolution Hyperspectral Fenix Data ⬇️

Training a deep neural network for classification constitutes a major problem in remote sensing due to the lack of adequate field data. Acquiring high-resolution ground truth (GT) by human interpretation is both cost-ineffective and inconsistent. We propose, instead, to utilize high-resolution, hyperspectral images for solving this problem, by unmixing these images to obtain reliable GT for training a deep network. Specifically, we simulate GT from high-resolution, hyperspectral FENIX images, and use it for training a convolutional neural network (CNN) for pixel-based classification. We show how the model can be transferred successfully to classify new mid-resolution VENuS imagery.

90.GRASPEL: Graph Spectral Learning at Scale ⬇️

Learning meaningful graphs from data plays important roles in many data mining and machine learning tasks, such as data representation and analysis, dimension reduction, data clustering, and visualization, etc. In this work, for the first time, we present a highly-scalable spectral approach (GRASPEL) for learning large graphs from data. By limiting the precision matrix to be a graph Laplacian, our approach aims to estimate ultra-sparse (tree-like) weighted undirected graphs and shows a clear connection with the prior graphical Lasso method. By interleaving the latest high-performance nearly-linear time spectral methods for graph sparsification, coarsening and embedding, ultra-sparse yet spectrally-robust graphs can be learned by identifying and including the most spectrally-critical edges into the graph. Compared with prior state-of-the-art graph learning approaches, GRASPEL is more scalable and allows substantially improving computing efficiency and solution quality of a variety of data mining and machine learning applications, such as spectral clustering (SC), and t-Distributed Stochastic Neighbor Embedding (t-SNE). {For example, when comparing with graphs constructed using existing methods, GRASPEL achieved the best spectral clustering efficiency and accuracy.

91.Shape Detection of Liver From 2D Ultrasound Images ⬇️

Applications of ultrasound images have expanded from fetal imaging to abdominal and cardiac diagnosis. Liver-being the largest gland in the body and responsible for metabolic activities requires to be to be diagnosed and therefore subject to utmost injury. Although, ultrasound imaging has developed into three and four dimensions providing higher amount of information; it requires highly trained medical staff due to the image complexity and dimensions it contain. Since 2D ultrasound images are still considered to be the basis of clinical treatments,computer aided automated liver diagnosis is very essential. Due to the limitations of ultrasound images, such as loss of resolution leading to speckle noise, it is difficult to detect shape of this http URL this project, we propose a shape detection method for liver in 2D Ultrasound images. Then we compare the accuracies of the method for both noise and after noise removal.