1.A Quantum Computational Approach to Correspondence Problems on Point Sets ⬇️
Modern adiabatic quantum computers (AQC) are already used to solve difficult combinatorial optimisation problems in various domains of science. Currently, only a few applications of AQC in computer vision have been demonstrated. We review modern AQC and derive the first algorithm for transformation estimation and point set alignment suitable for AQC. Our algorithm has a subquadratic computational complexity of state preparation. We perform a systematic experimental analysis of the proposed approach and show several examples of successful point set alignment by simulated sampling. With this paper, we hope to boost the research on AQC for computer vision.
2.Seeing without Looking: Contextual Rescoring of Object Detections for AP Maximization ⬇️
The majority of current object detectors lack context: class predictions are made independently from other detections. We propose to incorporate context in object detection by post-processing the output of an arbitrary detector to rescore the confidences of its detections. Rescoring is done by conditioning on contextual information from the entire set of detections: their confidences, predicted classes, and positions. We show that AP can be improved by simply reassigning the detection confidence values such that true positives that survive longer (i.e., those with the correct class and large IoU) are scored higher than false positives or detections with small IoU. In this setting, we use a bidirectional RNN with attention for contextual rescoring and introduce a training target that uses the IoU with ground truth to maximize AP for the given set of detections. The fact that our approach does not require access to visual features makes it computationally inexpensive and agnostic to the detection architecture. In spite of this simplicity, our model consistently improves AP over strong pre-trained baselines (Cascade R-CNN and Faster R-CNN with several backbones), particularly by reducing the confidence of duplicate detections (a learned form of non-maximum suppression) and removing out-of-context objects by conditioning on the confidences, classes, positions, and sizes of the co-occurrent detections (e.g., a high-confidence detection of bird makes a detection of sports ball less likely).
3.Combining Deep Learning and Verification for Precise Object Instance Detection ⬇️
Deep learning object detectors often return false positives with very high confidence. Although they optimize generic detection performance, such as mean average precision (mAP), they are not designed for reliability. For a reliable detection system, if a high confidence detection is made, we would want high certainty that the object has indeed been detected. To achieve this, we have developed a set of verification tests which a proposed detection must pass to be accepted. We develop a theoretical framework which proves that, under certain assumptions, our verification tests will not accept any false positives. Based on an approximation to this framework, we present a practical detection system that can verify, with high precision, whether each detection of a machine-learning based object detector is correct. We show that these tests can improve the overall accuracy of a base detector and that accepted examples are highly likely to be correct. This allows the detector to operate in a high precision regime and can thus be used for robotic perception systems as a reliable instance detection method.
4.Local Class-Specific and Global Image-Level Generative Adversarial Networks for Semantic-Guided Scene Generation ⬇️
In this paper, we address the task of semantic-guided scene generation. One open challenge in scene generation is the difficulty of the generation of small objects and detailed local texture, which has been widely observed in global image-level generation methods. To tackle this issue, in this work we consider learning the scene generation in a local context, and correspondingly design a local class-specific generative network with semantic maps as a guidance, which separately constructs and learns sub-generators concentrating on the generation of different classes, and is able to provide more scene details. To learn more discriminative class-specific feature representations for the local generation, a novel classification module is also proposed. To combine the advantage of both the global image-level and the local class-specific generation, a joint generation network is designed with an attention fusion module and a dual-discriminator structure embedded. Extensive experiments on two scene image generation tasks show superior generation performance of the proposed model. The state-of-the-art results are established by large margins on both tasks and on challenging public benchmarks. The source code and trained models are available at this https URL.
5.Explain Your Move: Understanding Agent Actions Using Focused Feature Saliency ⬇️
As deep reinforcement learning (RL) is applied to more tasks, there is a need to visualize and understand the behavior of learned agents. Saliency maps explain agent behavior by highlighting the features of the input state that are most relevant for the agent in taking an action. Existing perturbation-based approaches to compute saliency often highlight regions of the input that are not relevant to the action taken by the agent. Our approach generates more focused saliency maps by balancing two aspects (specificity and relevance) that capture different desiderata of saliency. The first captures the impact of perturbation on the relative expected reward of the action to be explained. The second downweights irrelevant features that alter the relative expected rewards of actions other than the action to be explained. We compare our approach with existing approaches on agents trained to play board games (Chess and Go) and Atari games (Breakout, Pong and Space Invaders). We show through illustrative examples (Chess, Atari, Go), human studies (Chess), and automated evaluation methods (Chess) that our approach generates saliency maps that are more interpretable for humans than existing approaches.
6.Unsupervised Representation Learning by Predicting Random Distances ⬇️
Deep neural networks have gained tremendous success in a broad range of machine learning tasks due to its remarkable capability to learn semantic-rich features from high-dimensional data. However, they often require large-scale labelled data to successfully learn such features, which significantly hinders their adaption into unsupervised learning tasks, such as anomaly detection and clustering, and limits their applications into critical domains where obtaining massive labelled data is prohibitively expensive. To enable downstream unsupervised learning on those domains, in this work we propose to learn features without using any labelled data by training neural networks to predict data distances in a randomly projected space. Random mapping is a theoretical proven approach to obtain approximately preserved distances. To well predict these random distances, the representation learner is optimised to learn genuine class structures that are implicitly embedded in the randomly projected space. Experimental results on 19 real-world datasets show our learned representations substantially outperform state-of-the-art competing methods in both anomaly detection and clustering tasks.
7.Detecting Deepfake-Forged Contents with Separable Convolutional Neural Network and Image Segmentation ⬇️
Recent advances in AI technology have made the forgery of digital images and videos easier, and it has become significantly more difficult to identify such forgeries. These forgeries, if disseminated with malicious intent, can negatively impact social and political stability, and pose significant ethical and legal challenges as well. Deepfake is a variant of auto-encoders that use deep learning techniques to identify and exchange images of a person's face in a picture or film. Deepfake can result in an erosion of public trust in digital images and videos, which has far-reaching effects on political and social stability. This study therefore proposes a solution for facial forgery detection to determine if a picture or film has ever been processed by Deepfake. The proposed solution reaches detection efficiency by using the recently proposed separable convolutional neural network (CNN) and image segmentation. In addition, this study also examined how different image segmentation methods affect detection results. Finally, the ensemble model is used to improve detection capabilities. Experiment results demonstrated the excellent performance of the proposed solution.
8.Axial Attention in Multidimensional Transformers ⬇️
We propose Axial Transformers, a self-attention-based autoregressive model for images and other data organized as high dimensional tensors. Existing autoregressive models either suffer from excessively large computational resource requirements for high dimensional data, or make compromises in terms of distribution expressiveness or ease of implementation in order to decrease resource requirements. Our architecture, by contrast, maintains both full expressiveness over joint distributions over data and ease of implementation with standard deep learning frameworks, while requiring reasonable memory and computation and achieving state-of-the-art results on standard generative modeling benchmarks. Our models are based on axial attention, a simple generalization of self-attention that naturally aligns with the multiple dimensions of the tensors in both the encoding and the decoding settings. Notably the proposed structure of the layers allows for the vast majority of the context to be computed in parallel during decoding without introducing any independence assumptions. This semi-parallel structure goes a long way to making decoding from even a very large Axial Transformer broadly applicable. We demonstrate state-of-the-art results for the Axial Transformer on the ImageNet-32 and ImageNet-64 image benchmarks as well as on the BAIR Robotic Pushing video benchmark. We open source the implementation of Axial Transformers.
9.Locality and compositionality in zero-shot learning ⬇️
In this work we study locality and compositionality in the context of learning representations for Zero Shot Learning (ZSL). In order to well-isolate the importance of these properties in learned representations, we impose the additional constraint that, differently from most recent work in ZSL, no pre-training on different datasets (e.g. ImageNet) is performed. The results of our experiments show how locality, in terms of small parts of the input, and compositionality, i.e. how well can the learned representations be expressed as a function of a smaller vocabulary, are both deeply related to generalization and motivate the focus on more local-aware models in future research directions for representation learning.
10.Unsupervised Few-shot Learning via Self-supervised Training ⬇️
Learning from limited exemplars (few-shot learning) is a fundamental, unsolved problem that has been laboriously explored in the machine learning community. However, current few-shot learners are mostly supervised and rely heavily on a large amount of labeled examples. Unsupervised learning is a more natural procedure for cognitive mammals and has produced promising results in many machine learning tasks. In the current study, we develop a method to learn an unsupervised few-shot learner via self-supervised training (UFLST), which can effectively generalize to novel but related classes. The proposed model consists of two alternate processes, progressive clustering and episodic training. The former generates pseudo-labeled training examples for constructing episodic tasks; and the later trains the few-shot learner using the generated episodic tasks which further optimizes the feature representations of data. The two processes facilitate with each other, and eventually produce a high quality few-shot learner. Using the benchmark dataset Omniglot and Mini-ImageNet, we show that our model outperforms other unsupervised few-shot learning methods. Using the benchmark dataset Market1501, we further demonstrate the feasibility of our model to a real-world application on person re-identification.
11.So2Sat LCZ42: A Benchmark Dataset for Global Local Climate Zones Classification ⬇️
Access to labeled reference data is one of the grand challenges in supervised machine learning endeavors. This is especially true for an automated analysis of remote sensing images on a global scale, which enables us to address global challenges such as urbanization and climate change using state-of-the-art machine learning techniques. To meet these pressing needs, especially in urban research, we provide open access to a valuable benchmark dataset named "So2Sat LCZ42," which consists of local climate zone (LCZ) labels of about half a million Sentinel-1 and Sentinel-2 image patches in 42 urban agglomerations (plus 10 additional smaller areas) across the globe. This dataset was labeled by 15 domain experts following a carefully designed labeling work flow and evaluation process over a period of six months. As rarely done in other labeled remote sensing dataset, we conducted rigorous quality assessment by domain experts. The dataset achieved an overall confidence of 85%. We believe this LCZ dataset is a first step towards an unbiased globallydistributed dataset for urban growth monitoring using machine learning methods, because LCZ provide a rather objective measure other than many other semantic land use and land cover classifications. It provides measures of the morphology, compactness, and height of urban areas, which are less dependent on human and culture. This dataset can be accessed from this http URL.
12.Mitigating large adversarial perturbations on X-MAS (X minus Moving Averaged Samples) ⬇️
We propose the scheme that mitigates an adversarial perturbation
$\epsilon$ on the adversarial example$X_{adv}$ ($=$ $X$ $\pm$ $\epsilon$ ) by subtracting the estimated perturbation$\hat{\epsilon}$ from$X$ $+$ $\epsilon$ and adding$\hat{\epsilon}$ to$X$ $-$ $\epsilon$ . The estimated perturbation$\hat{\epsilon}$ comes from the difference between$X_{adv}$ and its moving-averaged outcome$W_{avg}*X_{adv}$ where$W_{avg}$ is$N \times N$ moving average kernel that all the coefficients are one. Usually, the adjacent samples of an image are close to each other such that we can let$X$ $\approx$ $W_{avg}X$ (naming this relation after X-MAS[X minus Moving Averaged Sample]). Since the X-MAS relation is approximately zero, the estimated perturbation can be less than the adversarial perturbation. The scheme is also extended to do the multi-level mitigation by configuring the mitigated adversarial example $X_{adv}$ $\pm$ $\hat{\epsilon}$ as a new adversarial example to be mitigated. The multi-level mitigation gets $X_{adv}$ closer to $X$ with a smaller (i.e. mitigated) perturbation than original unmitigated perturbation by setting $W_{avg} * X_{adv}$ ($<$ $X$ $+$ $W_{avg}\epsilon$ if$X$ $\approx$ $W_{avg}*X$ ) as the boundary condition that the multi-level mitigation cannot cross over (i.e. decreasing$\epsilon$ cannot go below and increasing$\epsilon$ cannot go beyond). With the multi-level mitigation, we can get high prediction accuracies even in the adversarial example having a large perturbation (i.e.$\epsilon$ $\geq$ $16$ ). The proposed scheme is evaluated with adversarial examples crafted by the Iterative FGSM (Fast Gradient Sign Method) on ResNet-50 trained with ImageNet dataset.
13.Image Analytics for Legal Document Review: A Transfer Learning Approach ⬇️
Though technology assisted review in electronic discovery has been focusing on text data, the need of advanced analytics to facilitate reviewing multimedia content is on the rise. In this paper, we present several applications of deep learning in computer vision to Technology Assisted Review of image data in legal industry. These applications include image classification, image clustering, and object detection. We use transfer learning techniques to leverage established pretrained models for feature extraction and fine tuning. These applications are first of their kind in the legal industry for image document review. We demonstrate effectiveness of these applications with solving real world business challenges.
14.Intra-Variable Handwriting Inspection Reinforced with Idiosyncrasy Analysis ⬇️
In this paper, we work on intra-variable handwriting, where the writing samples of an individual can vary significantly. Such within-writer variation throws a challenge for automatic writer inspection, where the state-of-the-art methods do not perform well. To deal with intra-variability, we analyze the idiosyncrasy in individual handwriting. We identify/verify the writer from highly idiosyncratic text-patches. Such patches are detected using a deep recurrent reinforcement learning-based architecture. An idiosyncratic score is assigned to every patch, which is predicted by employing deep regression analysis. For writer identification, we propose a deep neural architecture, which makes the final decision by the idiosyncratic score-induced weighted sum of patch-based decisions. For writer verification, we propose two algorithms for deep feature aggregation, which assist in authentication using a triplet network. The experiments were performed on two databases, where we obtained encouraging results.
15.Design Considerations for Efficient Deep Neural Networks on Processing-in-Memory Accelerators ⬇️
This paper describes various design considerations for deep neural networks that enable them to operate efficiently and accurately on processing-in-memory accelerators. We highlight important properties of these accelerators and the resulting design considerations using experiments conducted on various state-of-the-art deep neural networks with the large-scale ImageNet dataset.
16.ResNetX: a more disordered and deeper network architecture ⬇️
Designing efficient network structures has always been the core content of neural network research.
ResNet and its variants have proved to be efficient in architecture.
However, how to theoretically character the influence of network structure on performance is still vague.
With the help of techniques in complex networks, We here provide a natural yet efficient extension to ResNet by folding its backbone chain.
Our architecture has two structural features when being mapped to directed acyclic graphs:
First is a higher degree of the disorder compared with ResNet, which let ResNetX explore a larger number of feature maps with different sizes of receptive fields.
Second is a larger proportion of shorter paths compared to ResNet, which improves the direct flow of information through the entire network.
Our architecture exposes a new dimension, namely "fold depth", in addition to existing dimensions of depth, width, and cardinality.
Our architecture is a natural extension to ResNet, and can be integrated with existing state-of-the-art methods with little effort. Image classification results on CIFAR-10 and CIFAR-100 benchmarks suggested that our new network architecture performs better than ResNet.
17.Unsupervised Adversarial Image Inpainting ⬇️
We consider inpainting in an unsupervised setting where there is neither access to paired nor unpaired training data. The only available information is provided by the uncomplete observations and the inpainting process statistics. In this context, an observation should give rise to several plausible reconstructions which amounts at learning a distribution over the space of reconstructed images. We model the reconstruction process by using a conditional GAN with constraints on the stochastic component that introduce an explicit dependency between this component and the generated output. This allows us sampling from the latent component in order to generate a distribution of images associated to an observation. We demonstrate the capacity of our model on several image datasets: faces (CelebA), food images (Recipe-1M) and bedrooms (LSUN Bedrooms) with different types of imputation masks. The approach yields comparable performance to model variants trained with additional supervision.
18.Metamorphic Testing for Object Detection Systems ⬇️
Recent advances in deep neural networks (DNNs) have led to object detectors that can rapidly process pictures or videos, and recognize the objects that they contain. Despite the promising progress by industrial manufacturers such as Amazon and Google in commercializing deep learning-based object detection as a standard computer vision service, object detection systems - similar to traditional software - may still produce incorrect results. These errors, in turn, can lead to severe negative outcomes for the users of these object detection systems. For instance, an autonomous driving system that fails to detect pedestrians can cause accidents or even fatalities. However, principled, systematic methods for testing object detection systems do not yet exist, despite their importance.
To fill this critical gap, we introduce the design and realization of MetaOD, the first metamorphic testing system for object detectors to effectively reveal erroneous detection results by commercial object detectors. To this end, we (1) synthesize natural-looking images by inserting extra object instances into background images, and (2) design metamorphic conditions asserting the equivalence of object detection results between the original and synthetic images after excluding the prediction results on the inserted objects. MetaOD is designed as a streamlined workflow that performs object extraction, selection, and insertion. Evaluated on four commercial object detection services and four pretrained models provided by the TensorFlow API, MetaOD found tens of thousands of detection defects in these object detectors. To further demonstrate the practical usage of MetaOD, we use the synthetic images that cause erroneous detection results to retrain the model. Our results show that the model performance is increased significantly, from an mAP score of 9.3 to an mAP score of 10.5.
19.DADA: A Large-scale Benchmark and Model for Driver Attention Prediction in Accidental Scenarios ⬇️
Driver attention prediction has recently absorbed increasing attention in traffic scene understanding and is prone to be an essential problem in vision-centered and human-like driving systems. This work, different from other attempts, makes an attempt to predict the driver attention in accidental scenarios containing normal, critical and accidental situations simultaneously. However, challenges tread on the heels of that because of the dynamic traffic scene, intricate and imbalanced accident categories. With the hypothesis that driver attention can provide a selective role of crash-object for assisting driving accident detection or prediction, this paper designs a multi-path semantic-guided attentive fusion network (MSAFNet) that learns the spatio-temporal semantic and scene variation in prediction. For fulfilling this, a large-scale benchmark with 2000 video sequences (named as DADA-2000) is contributed with laborious annotation for driver attention (fixation, saccade, focusing time), accident objects/intervals, as well as the accident categories, and superior performance to state-of-the-arts are provided by thorough evaluations. As far as we know, this is the first comprehensive and quantitative study for the human-eye sensing exploration in accidental scenarios. DADA-2000 is available at this https URL.
20.Cooperative Perception for 3D Object Detection in Driving Scenarios using Infrastructure Sensors ⬇️
The perception system of an autonomous vehicle is responsible for mapping sensor observations into a semantic description of the vehicle's environment. 3D object detection is a common function within this system and outputs a list of 3D bounding boxes around objects of interest. Various 3D object detection methods have relied on fusion of different sensor modalities to overcome limitations of individual sensors. However, occlusion, limited field-of-view and low-point density of the sensor data cannot be reliably and cost-effectively addressed by multi-modal sensing from a single point of view. Alternatively, cooperative perception incorporates information from spatially diverse sensors distributed around the environment as a way to mitigate these limitations. This paper proposes two schemes for cooperative 3D object detection. The early fusion scheme combines point clouds from multiple spatially diverse sensing points of view before detection. In contrast, the late fusion scheme fuses the independently estimated bounding boxes from multiple spatially diverse sensors. We evaluate the performance of both schemes using a synthetic cooperative dataset created in two complex driving scenarios, a T-junction and a roundabout. The evaluation show that the early fusion approach outperforms late fusion by a significant margin at the cost of higher communication bandwidth. The results demonstrate that cooperative perception can recall more than 95% of the objects as opposed to 30% for single-point sensing in the most challenging scenario. To provide practical insights into the deployment of such system, we report how the number of sensors and their configuration impact the detection performance of the system.
21.Evaluating Usage of Images for App Classification ⬇️
App classification is useful in a number of applications such as adding apps to an app store or building a user model based on the installed apps. Presently there are a number of existing methods to classify apps based on a given taxonomy on the basis of their text metadata. However, text based methods for app classification may not work in all cases, such as when the text descriptions are in a different language, or missing, or inadequate to classify the app. One solution in such cases is to utilize the app images to supplement the text description. In this paper, we evaluate a number of approaches in which app images can be used to classify the apps. In one approach, we use Optical character recognition (OCR) to extract text from images, which is then used to supplement the text description of the app. In another, we use pic2vec to convert the app images into vectors, then train an SVM to classify the vectors to the correct app label. In another, we use the this http URL tool to generate natural language descriptions from the app images. Finally, we use a method to detect and label objects in the app images and use a voting technique to determine the category of the app based on all the images. We compare the performance of our image-based techniques to classify a number of apps in our dataset. We use a text based SVM app classifier as our base and obtained an improved classification accuracy of 96% for some classes when app images are added.
22.Crack Detection Using Enhanced Hierarchical Convolutional Neural Networks ⬇️
Unmanned aerial vehicles (UAV) are expected to replace human in hazardous tasks of surface inspection due to their flexibility in operating space and capability of collecting high quality visual data. In this study, we propose enhanced hierarchical convolutional neural networks (HCNN) to detect cracks from image data collected by UAVs. Unlike traditional HCNN, here a set of branch networks is utilised to reduce the obscuration in the down-sampling process. Moreover, the feature preserving blocks combine the current and previous terms from the convolutional blocks to provide input to the loss functions. As a result, the weights of resized images can be reduced to minimise the information loss. Experiments on images of different crack datasets have been carried out to demonstrate the effectiveness of proposed HCNN.
23.Convolutional Dictionary Pair Learning Network for Image Representation Learning ⬇️
Both Convolutional Neural Networks (CNN) and Dictionary Learning (DL) are powerful image representation learning sys-tems based on different mechanisms and principles, so whether we can integrate them to improve the performance is notewor-thy exploring. To address this issue, we propose a novel general-ized end-to-end representation learning architecture, dubbed Convolutional Dictionary Pair Learning Network (CDPL-Net) in this paper, which seamlessly integrates the learning schemes of CNN and dictionary pair learning into a unified framework. Generally, the architecture of CDPL-Net includes two convolu-tional/pooling layers and two dictionary pair learning (DPL) layers in the representation learning module. Besides, it uses two fully-connected layers as the multi-layer perception layer in the nonlinear classification module. In particular, the DPL layer can jointly formulate the discriminative synthesis and analysis representations driven by minimizing the batch based recon-struction error over the flatted feature maps from the convolu-tion/pooling layer. Moreover, DPL layer uses the l1-norm on the analysis dictionary so that sparse representation can be delivered, and the embedding process will also be robust to noise. To speed up the training process of DPL layer, the efficient stochastic gradient descent is used. Extensive simulations on public data-bases show that our CDPL-Net can deliver enhanced perfor-mance over other state-of-the-art methods.
24.Symmetric block-low-rank layers for fully reversible multilevel neural networks ⬇️
Factors that limit the size of the input and output of a neural network include memory requirements for the network states/activations to compute gradients, as well as memory for the convolutional kernels or other weights. The memory restriction is especially limiting for applications where we want to learn how to map volumetric data to the desired output, such as video-to-video. Recently developed fully reversible neural networks enable gradient computations using storage of the network states for a couple of layers only. While this saves a tremendous amount of memory, it is the convolutional kernels that take up most memory if fully reversible networks contain multiple invertible pooling/coarsening layers. Invertible coarsening operators such as the orthogonal wavelet transform cause the number of channels to grow explosively. We address this issue by combining fully reversible networks with layers that contain the convolutional kernels in a compressed form directly. Specifically, we introduce a layer that has a symmetric block-low-rank structure. In spirit, this layer is similar to bottleneck and squeeze-and-expand structures. We contribute symmetry by construction, and a combination of notation and flattening of tensors allows us to interpret these network structures in linear algebraic fashion as a block-low-rank matrix in factorized form and observe various properties. A video segmentation example shows that we can train a network to segment the entire video in one go, which would not be possible, in terms of memory requirements, using non-reversible networks and previously proposed reversible networks.
25.Deep-learning-based classification and retrieval of components of a process plant from segmented point clouds ⬇️
Technology to recognize the type of component represented by a point cloud is required in the reconstruction process of an as-built model of a process plant based on laser scanning. The reconstruction process of a process plant through laser scanning is divided into point cloud registration, point cloud segmentation, and component type recognition and placement. Loss of shape data or imbalance of point cloud density problems generally occur in the point cloud data collected from large-scale facilities. In this study, we experimented with the possibility of applying object recognition technology based on 3D deep learning networks, which have been showing high performance recently, and analyzed the results. For training data, we used a segmented point cloud repository about components that we constructed by scanning a process plant. For networks, we selected the multi-view convolutional neural network (MVCNN), which is a view-based method, and PointNet, which is designed to allow the direct input of point cloud data. In the case of the MVCNN, we also performed an experiment on the generation method for two types of multi-view images that can complement the shape occlusion of the segmented point cloud. In this experiment, the MVCNN showed the highest retrieval accuracy of approximately 87%, whereas PointNet showed the highest retrieval mean average precision of approximately 84%. Furthermore, both networks showed high recognition performance for the segmented point cloud of plant components when there was sufficient training data.
26.Large-scale Multi-modal Person Identification in Real Unconstrained Environments ⬇️
Person identification (P-ID) under real unconstrained noisy environments is a huge challenge. In multiple-feature learning with Deep Convolutional Neural Networks (DCNNs) or Machine Learning method for large-scale person identification in the wild, the key is to design an appropriate strategy for decision layer fusion or feature layer fusion which can enhance discriminative power. It is necessary to extract different types of valid features and establish a reasonable framework to fuse different types of information. In traditional methods, different persons are identified based on single modal features to identify, such as face feature, audio feature, and head feature. These traditional methods cannot realize a highly accurate level of person identification in real unconstrained environments. The study aims to propose a fusion module to fuse multi-modal features for person identification in real unconstrained environments.
27.Machine Learning for Precipitation Nowcasting from Radar Images ⬇️
High-resolution nowcasting is an essential tool needed for effective adaptation to climate change, particularly for extreme weather. As Deep Learning (DL) techniques have shown dramatic promise in many domains, including the geosciences, we present an application of DL to the problem of precipitation nowcasting, i.e., high-resolution (1 km x 1 km) short-term (1 hour) predictions of precipitation. We treat forecasting as an image-to-image translation problem and leverage the power of the ubiquitous UNET convolutional neural network. We find this performs favorably when compared to three commonly used models: optical flow, persistence and NOAA's numerical one-hour HRRR nowcasting prediction.
28.Discriminative Autoencoder for Feature Extraction: Application to Character Recognition ⬇️
Conventionally, autoencoders are unsupervised representation learning tools. In this work, we propose a novel discriminative autoencoder. Use of supervised discriminative learning ensures that the learned representation is robust to variations commonly encountered in image datasets. Using the basic discriminating autoencoder as a unit, we build a stacked architecture aimed at extracting relevant representation from the training data. The efficiency of our feature extraction algorithm ensures a high classification accuracy with even simple classification schemes like KNN (K-nearest neighbor). We demonstrate the superiority of our model for representation learning by conducting experiments on standard datasets for character/image recognition and subsequent comparison with existing supervised deep architectures like class sparse stacked autoencoder and discriminative deep belief network.
29.Kernel Transform Learning ⬇️
This work proposes kernel transform learning. The idea of dictionary learning is well known; it is a synthesis formulation where a basis is learnt along with the coefficients so as to generate or synthesize the data. Transform learning is its analysis equivalent; the transforms operates or analyses on the data to generate the coefficients. The concept of kernel dictionary learning has been introduced in the recent past, where the dictionary is represented as a linear combination of non-linear version of the data. Its success has been showcased in feature extraction. In this work we propose to kernelize transform learning on line similar to kernel dictionary learning. An efficient solution for kernel transform learning has been proposed especially for problems where the number of samples is much larger than the dimensionality of the input samples making the kernel matrix very high dimensional. Kernel transform learning has been compared with other representation learning tools like autoencoder, restricted Boltzmann machine as well as with dictionary learning (and its kernelized version). Our proposed kernel transform learning yields better results than all the aforesaid techniques; experiments have been carried out on benchmark databases.
30.Large-scale 6D Object Pose Estimation Dataset for Industrial Bin-Picking ⬇️
In this paper, we introduce a new public dataset for 6D object pose estimation and instance segmentation for industrial bin-picking. The dataset comprises both synthetic and real-world scenes. For both, point clouds, depth images, and annotations comprising the 6D pose (position and orientation), a visibility score, and a segmentation mask for each object are provided. Along with the raw data, a method for precisely annotating real-world scenes is proposed. To the best of our knowledge, this is the first public dataset for 6D object pose estimation and instance segmentation for bin-picking containing sufficiently annotated data for learning-based approaches. Furthermore, it is one of the largest public datasets for object pose estimation in general. The dataset is publicly available at this http URL.
31.Bias Remediation in Driver Drowsiness Detection systems using Generative Adversarial Networks ⬇️
Datasets are crucial when training a deep neural network. When datasets are unrepresentative, trained models are prone to bias because they are unable to generalise to real world settings. This is particularly problematic for models trained in specific cultural contexts, which may not represent a wide range of races, and thus fail to generalise. This is a particular challenge for Driver drowsiness detection, where many publicly available datasets are unrepresentative as they cover only certain ethnicity groups. Traditional augmentation methods are unable to improve a model's performance when tested on other groups with different facial attributes, and it is often challenging to build new, more representative datasets. In this paper, we introduce a novel framework that boosts the performance of detection of drowsiness for different ethnicity groups. Our framework improves Convolutional Neural Network (CNN) trained for prediction by using Generative Adversarial networks (GAN) for targeted data augmentation based on a population bias visualisation strategy that groups faces with similar facial attributes and highlights where the model is failing. A sampling method selects faces where the model is not performing well, which are used to fine-tune the CNN. Experiments show the efficacy of our approach in improving driver drowsiness detection for under represented ethnicity groups. Here, models trained on publicly available datasets are compared with a model trained using the proposed data augmentation strategy. Although developed in the context of driver drowsiness detection, the proposed framework is not limited to the driver drowsiness detection task, but can be applied to other applications.
32.Approximating Human Judgment of Generated Image Quality ⬇️
Generative models have made immense progress in recent years, particularly in their ability to generate high quality images. However, that quality has been difficult to evaluate rigorously, with evaluation dominated by heuristic approaches that do not correlate well with human judgment, such as the Inception Score and Fréchet Inception Distance. Real human labels have also been used in evaluation, but are inefficient and expensive to collect for each image. Here, we present a novel method to automatically evaluate images based on their quality as perceived by humans. By not only generating image embeddings from Inception network activations and comparing them to the activations for real images, of which other methods perform a variant, but also regressing the activation statistics to match gold standard human labels, we demonstrate 66% accuracy in predicting human scores of image realism, matching the human inter-rater agreement rate. Our approach also generalizes across generative models, suggesting the potential for capturing a model-agnostic measure of image quality. We open source our dataset of human labels for the advancement of research and techniques in this area.
33.A Deep Learning Model for Chilean Bills Classification ⬇️
Automatic bill classification is an attractive task with many potential applications such as automated detection and counting in images or videos. To address this purpose we present a Deep Learning Model to classify Chilean Banknotes, because of its successful results in image processing applications. For optimal performance of the proposed model, data augmentation techniques are introduced due to the limited number of image samples. Positive results were achieved in this work, verifying that it could be a stating point to be extended to more complex applications.
34.White Noise Analysis of Neural Networks ⬇️
A white noise analysis of modern deep neural networks is presented to unveil their biases at the whole network level or the single neuron level. Our analysis is based on two popular and related methods in psychophysics and neurophysiology namely classification images and spike triggered analysis. These methods have been widely used to understand the underlying mechanisms of sensory systems in humans and monkeys. We leverage them to investigate the inherent biases of deep neural networks and to obtain a first-order approximation of their functionality. We emphasize on CNNs since they are currently the state of the art methods in computer vision and are a decent model of human visual processing. In addition, we study multi-layer perceptrons, logistic regression, and recurrent neural networks. Experiments over four classic datasets, MNIST, Fashion-MNIST, CIFAR-10, and ImageNet, show that the computed bias maps resemble the target classes and when used for classification lead to an over twofold performance than the chance level. Further, we show that classification images can be used to attack a black-box classifier and to detect adversarial patch attacks. Finally, we utilize spike triggered averaging to derive the filters of CNNs and explore how the behavior of a network changes when neurons in different layers are modulated. Our effort illustrates a successful example of borrowing from neurosciences to study ANNs and highlights the importance of cross-fertilization and synergy across machine learning, deep learning, and computational neuroscience.
35.A 3D-Deep-Learning-based Augmented Reality Calibration Method for Robotic Environments using Depth Sensor Data ⬇️
Augmented Reality and mobile robots are gaining much attention within industries due to the high potential to make processes cost and time efficient. To facilitate augmented reality, a calibration between the Augmented Reality device and the environment is necessary. This is a challenge when dealing with mobile robots due to the mobility of all entities making the environment dynamic. On this account, we propose a novel approach to calibrate the Augmented Reality device using 3D depth sensor data. We use the depth camera of a cutting edge Augmented Reality Device - the Microsoft Hololens for deep learning based calibration. Therefore, we modified a neural network based on the recently published VoteNet architecture which works directly on the point cloud input observed by the Hololens. We achieve satisfying results and eliminate external tools like markers, thus enabling a more intuitive and flexible work flow for Augmented Reality integration. The results are adaptable to work with all depth cameras and are promising for further research. Furthermore, we introduce an open source 3D point cloud labeling tool, which is to our knowledge the first open source tool for labeling raw point cloud data.
36.One Point, One Object: Simultaneous 3D Object Segmentation and 6-DOF Pose Estimation ⬇️
We propose a single-shot method for simultaneous 3D object segmentation and 6-DOF pose estimation in pure 3D point clouds scenes based on a consensus that \emph{one point only belongs to one object}, i.e., each point has the potential power to predict the 6-DOF pose of its corresponding object. Unlike the recently proposed methods of the similar task, which rely on 2D detectors to predict the projection of 3D corners of the 3D bounding boxes and the 6-DOF pose must be estimated by a PnP like spatial transformation method, ours is concise enough not to require additional spatial transformation between different dimensions. Due to the lack of training data for many objects, the recently proposed 2D detection methods try to generate training data by using rendering engine and achieve good results. However, rendering in 3D space along with 6-DOF is relatively difficult. Therefore, we propose an augmented reality technology to generate the training data in semi-virtual reality 3D space. The key component of our method is a multi-task CNN architecture that can simultaneously predicts the 3D object segmentation and 6-DOF pose estimation in pure 3D point clouds.
For experimental evaluation, we generate expanded training data for two state-of-the-arts 3D object datasets \cite{PLCHF}\cite{TLINEMOD} by using Augmented Reality technology (AR). We evaluate our proposed method on the two datasets. The results show that our method can be well generalized into multiple scenarios and provide performance comparable to or better than the state-of-the-arts.
37.Pointwise Attention-Based Atrous Convolutional Neural Networks ⬇️
With the rapid progress of deep convolutional neural networks, in almost all robotic applications, the availability of 3D point clouds improves the accuracy of 3D semantic segmentation methods. Rendering of these irregular, unstructured, and unordered 3D points to 2D images from multiple viewpoints imposes some issues such as loss of information due to 3D to 2D projection, discretizing artifacts, and high computational costs. To efficiently deal with a large number of points and incorporate more context of each point, a pointwise attention-based atrous convolutional neural network architecture is proposed. It focuses on salient 3D feature points among all feature maps while considering outstanding contextual information via spatial channel-wise attention modules. The proposed model has been evaluated on the two most important 3D point cloud datasets for the 3D semantic segmentation task. It achieves a reasonable performance compared to state-of-the-art models in terms of accuracy, with a much smaller number of parameters.
38.A sparsity augmented probabilistic collaborative representation based classification method ⬇️
In order to enhance the performance of image recognition, a sparsity augmented probabilistic collaborative representation based classification (SA-ProCRC) method is presented. The proposed method obtains the dense coefficient through ProCRC, then augments the dense coefficient with a sparse one, and the sparse coefficient is attained by the orthogonal matching pursuit (OMP) algorithm. In contrast to conventional methods which require explicit computation of the reconstruction residuals for each class, the proposed method employs the augmented coefficient and the label matrix of the training samples to classify the test sample. Experimental results indicate that the proposed method can achieve promising results for face and scene images. The source code of our proposed SA-ProCRC is accessible at this https URL.
39.Deep Learning for 3D Point Clouds: A Survey ⬇️
Point cloud learning has lately attracted increasing attention due to its wide applications in many areas, such as computer vision, autonomous driving, and robotics. As a dominating technique in AI, deep learning has been successfully used to solve various 2D vision problems. However, deep learning on point clouds is still in its infancy due to the unique challenges faced by the processing of point clouds with deep neural networks. Recently, deep learning on point clouds has become even thriving, with numerous methods being proposed to address different problems in this area. To stimulate future research, this paper presents a comprehensive review of recent progress in deep learning methods for point clouds. It covers three major tasks, including 3D shape classification, 3D object detection and tracking, and 3D point cloud segmentation. It also presents comparative results on several publicly available datasets, together with insightful observations and inspiring future research directions.
40.A General Framework for Saliency Detection Methods ⬇️
Saliency detection is one of the most challenging problems in the fields of image analysis and computer vision. Many approaches propose different architectures based on the psychological and biological properties of the human visual attention system. However, there is not still an abstract framework, which summarized the existed methods. In this paper, we offered a general framework for saliency models, which consists of five main steps: pre-processing, feature extraction, saliency map generation, saliency map combination, and post-processing. Also, we study different saliency models containing each level and compare their performance together. This framework helps researchers to have a comprehensive view of studying new methods.
41.An Abstraction Model for Semantic Segmentation Algorithms ⬇️
Semantic segmentation is a process of classifying each pixel in the image. Due to its advantages, sematic segmentation is used in many tasks such as cancer detection, robot-assisted surgery, satellite image analysis, self-driving car control, etc. In this process, accuracy and efficiency are the two crucial goals for this purpose, and there are several state of the art neural networks. In each method, by employing different techniques, new solutions have been presented for increasing efficiency, accuracy, and reducing the costs. The diversity of the implemented approaches for semantic segmentation makes it difficult for researches to achieve a comprehensive view of the field. To offer a comprehensive view, in this paper, an abstraction model for the task of semantic segmentation is offered. The proposed framework consists of four general blocks that cover the majority of majority of methods that have been proposed for semantic segmentation. We also compare different approaches and consider the importance of each part in the overall performance of a method.
42.Necessary and Sufficient Polynomial Constraints on Compatible Triplets of Essential Matrices ⬇️
The essential matrix incorporates relative rotation and translation parameters of two calibrated cameras. The well-known algebraic characterization of essential matrices, i.e. necessary and sufficient conditions under which an arbitrary matrix (of rank two) becomes essential, consists of a unique matrix equation of degree three. Based on this equation, a number of efficient algorithmic solutions to different relative pose estimation problems have been proposed. In three views, a possible way to describe the geometry of three calibrated cameras comes from considering compatible triplets of essential matrices. The compatibility is meant the correspondence of a triplet to a certain configuration of calibrated cameras. The main goal of this paper is to give an algebraic characterization of compatible triplets of essential matrices. Specifically, we propose necessary and sufficient polynomial constraints on a triplet of real rank-two essential matrices that ensure its compatibility. The constraints are given in the form of six cubic matrix equations, one quartic and one sextic scalar equations. An important advantage of the proposed constraints is their sufficiency even in the case of cameras with collinear centers. The applications of the constraints may include relative camera pose estimation in three and more views, averaging of essential matrices for incremental structure from motion, multiview camera auto-calibration, etc.
43.Spotting Macro- and Micro-expression Intervals in Long Video Sequences ⬇️
This paper presents baseline results for the Third Facial Micro-Expression Grand Challenge (MEGC 2020). Both macro- and micro-expression intervals in CAS(ME)$^2$ and SAMM Long Videos are spotted by employing the method of Main Directional Maximal Difference Analysis (MDMD). The MDMD method uses the magnitude maximal difference in the main direction of optical flow features to spot facial movements. The single frame prediction results of the original MDMD method are post processed into reasonable video intervals. The metric F1-scores of baseline results are evaluated: for CAS(ME)$^2$, the F1-scores are 0.1196 and 0.0082 for macro- and micro-expressions respectively, and the overall F1-score is 0.0376; for SAMM Long Videos, the F1-scores are 0.0629 and 0.0364 for macro- and micro-expressions respectively, and the overall F1-score is 0.0445. The baseline project codes is publicly available at this https URL.
44.HoMM: Higher-order Moment Matching for Unsupervised Domain Adaptation ⬇️
Minimizing the discrepancy of feature distributions between different domains is one of the most promising directions in unsupervised domain adaptation. From the perspective of distribution matching, most existing discrepancy-based methods are designed to match the second-order or lower statistics, which however, have limited expression of statistical characteristic for non-Gaussian distributions. In this work, we explore the benefits of using higher-order statistics (mainly refer to third-order and fourth-order statistics) for domain matching. We propose a Higher-order Moment Matching (HoMM) method, and further extend the HoMM into reproducing kernel Hilbert spaces (RKHS). In particular, our proposed HoMM can perform arbitrary-order moment tensor matching, we show that the first-order HoMM is equivalent to Maximum Mean Discrepancy (MMD) and the second-order HoMM is equivalent to Correlation Alignment (CORAL). Moreover, the third-order and the fourth-order moment tensor matching are expected to perform comprehensive domain alignment as higher-order statistics can approximate more complex, non-Gaussian distributions. Besides, we also exploit the pseudo-labeled target samples to learn discriminative representations in the target domain, which further improves the transfer performance. Extensive experiments are conducted, showing that our proposed HoMM consistently outperforms the existing moment matching methods by a large margin. Codes are available at \url{this https URL}
45.A single target tracking algorithm based on Generative Adversarial Networks ⬇️
In the single target tracking field, occlusion leads to the loss of tracking targets is a ubiquitous and arduous problem. To solve this problem, we propose a single target tracking algorithm with anti-occlusion capability. The main content of our algorithm is to use the Region Proposal Network to obtain the tracked target and potential interferences, and use the occlusion awareness module to judge whether the interfering object occludes the target. If no occlusion occurs, continue tracking. If occlusion occurs, the prediction module is started, and the motion trajectory of the target in subsequent frames is predicted according to the motion trajectory before occlusion. The result obtained by the prediction module is used to replace the target position feature obtained by the original tracking algorithm. So we solve the problem that the occlusion causes the tracking algorithm to lose the target. In actual performance, our algorithm can successfully track the target in the occluded dataset. On the VOT2018 dataset, our algorithm has an EAO of 0.421, an Accuracy of 0.67, and a Robustness of 0.186. Compared with SiamRPN ++, they increased by 1.69%, 11.67% and 9.3%, respectively.
46.Non-Cooperative Game Theory Based Rate Adaptation for Dynamic Video Streaming over HTTP ⬇️
Dynamic Adaptive Streaming over HTTP (DASH) has demonstrated to be an emerging and promising multimedia streaming technique, owing to its capability of dealing with the variability of networks. Rate adaptation mechanism, a challenging and open issue, plays an important role in DASH based systems since it affects Quality of Experience (QoE) of users, network utilization, etc. In this paper, based on non-cooperative game theory, we propose a novel algorithm to optimally allocate the limited export bandwidth of the server to multi-users to maximize their QoE with fairness guaranteed. The proposed algorithm is proxy-free. Specifically, a novel user QoE model is derived by taking a variety of factors into account, like the received video quality, the reference buffer length, and user accumulated buffer lengths, etc. Then, the bandwidth competing problem is formulated as a non-cooperation game with the existence of Nash Equilibrium that is theoretically proven. Finally, a distributed iterative algorithm with stability analysis is proposed to find the Nash Equilibrium. Compared with state-of-the-art methods, extensive experimental results in terms of both simulated and realistic networking scenarios demonstrate that the proposed algorithm can produce higher QoE, and the actual buffer lengths of all users keep nearly optimal states, i.e., moving around the reference buffer all the time. Besides, the proposed algorithm produces no playback interruption.
47.Apricot variety classification using image processing and machine learning approaches ⬇️
Apricot which is a cultivated type of Zerdali (wild apricot) has an important place in human nutrition and its medical properties are essential for human health. The objective of this research was to obtain a model for apricot mass and separate apricot variety with image processing technology using external features of apricot fruit. In this study, five verities of apricot were used. In order to determine the size of the fruits, three mutually perpendicular axes were defined, length, width, and thickness. Measurements show that the effect of variety on all properties was statistically significant at the 1% probability level. Furthermore, there is no significant difference between the estimated dimensions by image processing approach and the actual dimensions. The developed system consists of a digital camera, a light diffusion chamber, a distance adjustment pedestal, and a personal computer. Images taken by the digital camera were stored as (RGB) for further analysis. The images were taken for a number of 49 samples of each cultivar in three directions. A linear equation is recommended to calculate the apricot mass based on the length and the width with R 2 = 0.97. In addition, ANFIS model with C-means was the best model for classifying the apricot varieties based on the physical features including length, width, thickness, mass, and projected area of three perpendicular surfaces. The accuracy of the model was 87.7.
48.Category-Level Articulated Object Pose Estimation ⬇️
This paper addresses the task of category-level pose estimation for articulated objects from a single depth image. We present a novel category-level approach that correctly accommodates object instances not previously seen during training. A key aspect of the work is the new Articulation-Aware Normalized Coordinate Space Hierarchy (A-NCSH), which represents the different articulated objects for a given object category. This approach not only provides the canonical representation of each rigid part, but also normalizes the joint parameters and joint states. We developed a deep network based on PointNet++ that is capable of predicting an A-NCSH representation for unseen object instances from single depth input. The predicted A-NCSH representation is then used for global pose optimization using kinematic constraints. We demonstrate that constraints associated with joints in the kinematic chain lead to improved performance in estimating pose and relative scale for each part of the object. We also demonstrate that the approach can tolerate cases of severe occlusion in the observed data. Project webpage this https URL
49.A simple baseline for domain adaptation using rotation prediction ⬇️
Recently, domain adaptation has become a hot research area with lots of applications. The goal is to adapt a model trained in one domain to another domain with scarce annotated data. We propose a simple yet effective method based on self-supervised learning that outperforms or is on par with most state-of-the-art algorithms, e.g. adversarial domain adaptation. Our method involves two phases: predicting random rotations (self-supervised) on the target domain along with correct labels for the source domain (supervised), and then using self-distillation on the target domain. Our simple method achieves state-of-the-art results on semi-supervised domain adaptation on DomainNet dataset.
Further, we observe that the unlabeled target datasets of popular domain adaptation benchmarks do not contain any categories apart from testing categories. We believe this introduces a bias that does not exist in many real applications. We show that removing this bias from the unlabeled data results in a large drop in performance of state-of-the-art methods, while our simple method is relatively robust.
50.3DFR: A Swift 3D Feature Reductionist Framework for Scene Independent Change Detection ⬇️
In this paper we propose an end-to-end swift 3D feature reductionist framework (3DFR) for scene independent change detection. The 3DFR framework consists of three feature streams: a swift 3D feature reductionist stream (AvFeat), a contemporary feature stream (ConFeat) and a temporal median feature map. These multilateral foreground/background features are further refined through an encoder-decoder network. As a result, the proposed framework not only detects temporal changes but also learns high-level appearance features. Thus, it incorporates the object semantics for effective change detection. Furthermore, the proposed framework is validated through a scene independent evaluation scheme in order to demonstrate the robustness and generalization capability of the network. The performance of the proposed method is evaluated on the benchmark CDnet 2014 dataset. The experimental results show that the proposed 3DFR network outperforms the state-of-the-art approaches.
51.W-PoseNet: Dense Correspondence Regularized Pixel Pair Pose Regression ⬇️
Solving 6D pose estimation is non-trivial to cope with intrinsic appearance and shape variation and severe inter-object occlusion, and is made more challenging in light of extrinsic large illumination changes and low quality of the acquired data under an uncontrolled environment. This paper introduces a novel pose estimation algorithm W-PoseNet, which densely regresses from input data to 6D pose and also 3D coordinates in model space. In other words, local features learned for pose regression in our deep network are regularized by explicitly learning pixel-wise correspondence mapping onto 3D pose-sensitive coordinates as an auxiliary task. Moreover, a sparse pair combination of pixel-wise features and soft voting on pixel-pair pose predictions are designed to improve robustness to inconsistent and sparse local features. Experiment results on the popular YCB-Video and LineMOD benchmarks show that the proposed W-PoseNet consistently achieves superior performance to the state-of-the-art algorithms.
52.Vision and Language: from Visual Perception to Content Creation ⬇️
Vision and language are two fundamental capabilities of human intelligence. Humans routinely perform tasks through the interactions between vision and language, supporting the uniquely human capacity to talk about what they see or hallucinate a picture on a natural-language description. The valid question of how language interacts with vision motivates us researchers to expand the horizons of computer vision area. In particular, "vision to language" is probably one of the most popular topics in the past five years, with a significant growth in both volume of publications and extensive applications, e.g., captioning, visual question answering, visual dialog, language navigation, etc. Such tasks boost visual perception with more comprehensive understanding and diverse linguistic representations. Going beyond the progresses made in "vision to language," language can also contribute to vision understanding and offer new possibilities of visual content creation, i.e., "language to vision." The process performs as a prism through which to create visual content conditioning on the language inputs. This paper reviews the recent advances along these two dimensions: "vision to language" and "language to vision." More concretely, the former mainly focuses on the development of image/video captioning, as well as typical encoder-decoder structures and benchmarks, while the latter summarizes the technologies of visual content creation. The real-world deployment or services of vision and language are elaborated as well.
53.Hyperspectral and multispectral image fusion under spectrally varying spatial blurs -- Application to high dimensional infrared astronomical imaging ⬇️
Hyperspectral imaging has become a significant source of valuable data for astronomers over the past decades. Current instrumental and observing time constraints allow direct acquisition of multispectral images, with high spatial but low spectral resolution, and hyperspectral images, with low spatial but high spectral resolution. To enhance scientific interpretation of the data, we propose a data fusion method which combines the benefits of each image to recover a high spatio-spectral resolution datacube. The proposed inverse problem accounts for the specificities of astronomical instruments, such as spectrally variant blurs. We provide a fast implementation by solving the problem in the frequency domain and in a low-dimensional subspace to efficiently handle the convolution operators as well as the high dimensionality of the data. We conduct experiments on a realistic synthetic dataset of simulated observation of the upcoming James Webb Space Telescope, and we show that our fusion algorithm outperforms state-of-the-art methods commonly used in remote sensing for Earth observation.
54.A Review on Intelligent Object Perception Methods Combining Knowledge-based Reasoning and Machine Learning ⬇️
Object perception is a fundamental sub-field of Computer Vision, covering a multitude of individual areas and having contributed high-impact results. While Machine Learning has been traditionally applied to address related problems, recent works also seek ways to integrate knowledge engineering in order to expand the level of intelligence of the visual interpretation of objects, their properties and their relations with their environment. In this paper, we attempt a systematic investigation of how knowledge-based methods contribute to diverse object perception tasks. We review the latest achievements and identify prominent research directions.
55.Domain Adaptation Regularization for Spectral Pruning ⬇️
Deep Neural Networks (DNNs) have recently been achieving state-of-the-art performance on a variety of computer vision related tasks. However, their computational cost limits their ability to be implemented in embedded systems with restricted resources or strict latency constraints. Model compression has therefore been an active field of research to overcome this issue. On the other hand, DNNs typically require massive amounts of labeled data to be trained. This represents a second limitation to their deployment. Domain Adaptation (DA) addresses this issue by allowing to transfer knowledge learned on one labeled source distribution to a target distribution, possibly unlabeled. In this paper, we investigate on possible improvements of compression methods in DA setting. We focus on a compression method that was previously developed in the context of a single data distribution and show that, with a careful choice of data to use during compression and additional regularization terms directly related to DA objectives, it is possible to improve compression results. We also show that our method outperforms an existing compression method studied in the DA setting by a large margin for high compression rates. Although our work is based on one specific compression method, we also outline some general guidelines for improving compression in DA setting.
56.Benchmarking Adversarial Robustness ⬇️
Deep neural networks are vulnerable to adversarial examples, which becomes one of the most important research problems in the development of deep learning. While a lot of efforts have been made in recent years, it is of great significance to perform correct and complete evaluations of the adversarial attack and defense algorithms. In this paper, we establish a comprehensive, rigorous, and coherent benchmark to evaluate adversarial robustness on image classification tasks. After briefly reviewing plenty of representative attack and defense methods, we perform large-scale experiments with two robustness curves as the fair-minded evaluation criteria to fully understand the performance of these methods. Based on the evaluation results, we draw several important findings and provide insights for future research.
57.Graph Embedded Pose Clustering for Anomaly Detection ⬇️
We propose a new method for anomaly detection of human actions. Our method works directly on human pose graphs that can be computed from an input video sequence. This makes the analysis independent of nuisance parameters such as viewpoint or illumination. We map these graphs to a latent space and cluster them. Each action is then represented by its soft-assignment to each of the clusters. This gives a kind of "bag of words" representation to the data, where every action is represented by its similarity to a group of base action-words. Then, we use a Dirichlet process based mixture, that is useful for handling proportional data such as our soft-assignment vectors, to determine if an action is normal or not.
We evaluate our method on two types of data sets. The first is a fine-grained anomaly detection data set (e.g. ShanghaiTech) where we wish to detect unusual variations of some action. The second is a coarse-grained anomaly detection data set (e.g.,\ a Kinetics-based data set) where few actions are considered normal, and every other action should be considered abnormal.
Extensive experiments on the benchmarks show that our method performs considerably better than other state of the art methods.
58.Efficient Video Semantic Segmentation with Labels Propagation and Refinement ⬇️
This paper tackles the problem of real-time semantic segmentation of high definition videos using a hybrid GPU / CPU approach. We propose an Efficient Video Segmentation(EVS) pipeline that combines:
(i) On the CPU, a very fast optical flow method, that is used to exploit the temporal aspect of the video and propagate semantic information from one frame to the next. It runs in parallel with the GPU.
(ii) On the GPU, two Convolutional Neural Networks: A main segmentation network that is used to predict dense semantic labels from scratch, and a Refiner that is designed to improve predictions from previous frames with the help of a fast Inconsistencies Attention Module (IAM). The latter can identify regions that cannot be propagated accurately.
We suggest several operating points depending on the desired frame rate and accuracy. Our pipeline achieves accuracy levels competitive to the existing real-time methods for semantic image segmentation(mIoU above 60%), while achieving much higher frame rates. On the popular Cityscapes dataset with high resolution frames (2048 x 1024), the proposed operating points range from 80 to 1000 Hz on a single GPU and CPU.
59.History-based Anomaly Detector: an Adversarial Approach to Anomaly Detection ⬇️
Anomaly detection is a difficult problem in many areas and has recently been subject to a lot of attention. Classifying unseen data as anomalous is a challenging matter. Latest proposed methods rely on Generative Adversarial Networks (GANs) to estimate the normal data distribution, and produce an anomaly score prediction for any given data. In this article, we propose a simple yet new adversarial method to tackle this problem, denoted as History-based anomaly detector (HistoryAD). It consists of a self-supervised model, trained to recognize 'normal' samples by comparing them to samples based on the training history of a previously trained GAN. Quantitative and qualitative results are presented evaluating its performance. We also present a comparison to several state-of-the-art methods for anomaly detection showing that our proposal achieves top-tier results on several datasets.
60.An Ensemble Rate Adaptation Framework for Dynamic Adaptive Streaming Over HTTP ⬇️
Rate adaptation is one of the most important issues in dynamic adaptive streaming over HTTP (DASH). Due to the frequent fluctuations of the network bandwidth and complex variations of video content, it is difficult to deal with the varying network conditions and video content perfectly by using a single rate adaptation method. In this paper, we propose an ensemble rate adaptation framework for DASH, which aims to leverage the advantages of multiple methods involved in the framework to improve the quality of experience (QoE) of users. The proposed framework is simple yet very effective. Specifically, the proposed framework is composed of two modules, i.e., the method pool and method controller. In the method pool, several rate adap tation methods are integrated. At each decision time, only the method that can achieve the best QoE is chosen to determine the bitrate of the requested video segment. Besides, we also propose two strategies for switching methods, i.e., InstAnt Method Switching, and InterMittent Method Switching, for the method controller to determine which method can provide the best QoEs. Simulation results demonstrate that, the proposed framework always achieves the highest QoE for the change of channel environment and video complexity, compared with state-of-the-art rate adaptation methods.
61.SESS: Self-Ensembling Semi-Supervised 3D Object Detection ⬇️
The performance of existing point cloud-based 3D object detection methods heavily relies on large-scale high-quality 3D annotations. However, such annotations are often tedious and expensive to collect. Semi-supervised learning is a good alternative to mitigate the data annotation issue, but has remained largely unexplored in 3D object detection. Inspired by the recent success of self-ensembling technique in semi-supervised image classification task, we propose SESS, a self-ensembling semi-supervised 3D object detection framework. Specifically, we design a thorough perturbation scheme to enhance generalization of the network on unlabeled and new unseen data. Furthermore, we propose three consistency losses to enforce the consistency between two sets of predicted 3D object proposals, to facilitate the learning of structure and semantic invariances of objects. Extensive experiments conducted on SUN RGB-D and ScanNet datasets demonstrate the effectiveness of SESS in both inductive and transductive semi-supervised 3D object detection. Our SESS achieves competitive performance compared to the state-of-the-art fully-supervised method by using only 50% labeled data.
62.Learning Inverse Depth Regression for Multi-View Stereo with Correlation Cost Volume ⬇️
Deep learning has shown to be effective for depth inference in multi-view stereo (MVS). However, the scalability and accuracy still remain an open problem in this domain. This can be attributed to the memory-consuming cost volume representation and inappropriate depth inference. Inspired by the group-wise correlation in stereo matching, we propose an average group-wise correlation similarity measure to construct a lightweight cost volume. This can not only reduce the memory consumption but also reduce the computational burden in the cost volume filtering. Based on our effective cost volume representation, we propose a cascade 3D U-Net module to regularize the cost volume to further boost the performance. Unlike the previous methods that treat multi-view depth inference as a depth regression problem or an inverse depth classification problem, we recast multi-view depth inference as an inverse depth regression task. This allows our network to achieve sub-pixel estimation and be applicable to large-scale scenes. Through extensive experiments on DTU dataset and Tanks and Temples dataset, we show that our proposed network with Correlation cost volume and Inverse DEpth Regression (CIDER), achieves state-of-the-art results, demonstrating its superior performance on scalability and accuracy.
63.Planar Prior Assisted PatchMatch Multi-View Stereo ⬇️
The completeness of 3D models is still a challenging problem in multi-view stereo (MVS) due to the unreliable photometric consistency in low-textured areas. Since low-textured areas usually exhibit strong planarity, planar models are advantageous to the depth estimation of low-textured areas. On the other hand, PatchMatch multi-view stereo is very efficient for its sampling and propagation scheme. By taking advantage of planar models and PatchMatch multi-view stereo, we propose a planar prior assisted PatchMatch multi-view stereo framework in this paper. In detail, we utilize a probabilistic graphical model to embed planar models into PatchMatch multi-view stereo and contribute a novel multi-view aggregated matching cost. This novel cost takes both photometric consistency and planar compatibility into consideration, making it suited for the depth estimation of both non-planar and planar regions. Experimental results demonstrate that our method can efficiently recover the depth information of extremely low-textured areas, thus obtaining high complete 3D models and achieving state-of-the-art performance.
64.Controllable and Progressive Image Extrapolation ⬇️
Image extrapolation aims at expanding the narrow field of view of a given image patch. Existing models mainly deal with natural scene images of homogeneous regions and have no control of the content generation process. In this work, we study conditional image extrapolation to synthesize new images guided by the input structured text. The text is represented as a graph to specify the objects and their spatial relation to the unknown regions of the image. Inspired by drawing techniques, we propose a progressive generative model of three stages, i.e., generating a coarse bounding-boxes layout, refining it to a finer segmentation layout, and mapping the layout to a realistic output. Such a multi-stage design is shown to facilitate the training process and generate more controllable results. We validate the effectiveness of the proposed method on the face and human clothing dataset in terms of visual results, quantitative evaluations and flexible controls.
65.Extreme Relative Pose Network under Hybrid Representations ⬇️
In this paper, we introduce a novel RGB-D based relative pose estimation approach that is suitable for small-overlapping or non-overlapping scans and can output multiple relative poses. Our method performs scene completion and matches the completed scans. However, instead of using a fixed representation for completion, the key idea is to utilize hybrid representations that combine 360-image, 2D image-based layout, and planar patches. This approach offers adaptively feature representations for relative pose estimation. Besides, we introduce a global-2-local matching procedure, which utilizes initial relative poses obtained during the global phase to detect and then integrate geometric relations for pose refinement. Experimental results justify the potential of this approach across a wide range of benchmark datasets. For example, on ScanNet, the rotation translation errors of the top-1/top-5 predictions of our approach are 34.9/0.69m and 19.6/0.57m, respectively. Our approach also considerably boosts the performance of multi-scan reconstruction in few-view reconstruction settings.
66.Multi-Modal Attention-based Fusion Model for Semantic Segmentation of RGB-Depth Images ⬇️
The 3D scene understanding is mainly considered as a crucial requirement in computer vision and robotics applications. One of the high-level tasks in 3D scene understanding is semantic segmentation of RGB-Depth images. With the availability of RGB-D cameras, it is desired to improve the accuracy of the scene understanding process by exploiting the depth features along with the appearance features. As depth images are independent of illumination, they can improve the quality of semantic labeling alongside RGB images. Consideration of both common and specific features of these two modalities improves the performance of semantic segmentation. One of the main problems in RGB-Depth semantic segmentation is how to fuse or combine these two modalities to achieve more advantages of each modality while being computationally efficient. Recently, the methods that encounter deep convolutional neural networks have reached the state-of-the-art results by early, late, and middle fusion strategies. In this paper, an efficient encoder-decoder model with the attention-based fusion block is proposed to integrate mutual influences between feature maps of these two modalities. This block explicitly extracts the interdependences among concatenated feature maps of these modalities to exploit more powerful feature maps from RGB-Depth images. The extensive experimental results on three main challenging datasets of NYU-V2, SUN RGB-D, and Stanford 2D-3D-Semantic show that the proposed network outperforms the state-of-the-art models with respect to computational cost as well as model size. Experimental results also illustrate the effectiveness of the proposed lightweight attention-based fusion model in terms of accuracy.
67.Look, Listen, and Act: Towards Audio-Visual Embodied Navigation ⬇️
A crucial aspect of mobile intelligent agents is their ability to integrate the evidence from multiple sensory inputs in an environment and plan a sequence of actions to achieve their goals. In this paper, we attempt to address the problem of Audio-Visual Embodied Navigation, the task of planning the shortest path from a random starting location in a scene to the sound source in an indoor environment, given only raw egocentric visual and audio sensory data. To accomplish this task, the agent is required to learn from various modalities, i.e. relating the audio signal to the visual environment. Here we describe an approach to the audio-visual embodied navigation that can take advantage of both visual and audio pieces of evidence. Our solution is based on three key ideas: a visual perception mapper module that can construct its spatial memory of the environment, a sound perception module that infers the relative location of the sound source from the agent, and a dynamic path planner that plans a sequence of actions based on the visual-audio observations and the spatial memory of the environment, and then navigates towards the goal. Experimental results on a newly collected Visual-Audio-Room dataset using the simulated multi-modal environment demonstrate the effectiveness of our approach over several competitive baselines.
68.Neural ODEs for Image Segmentation with Level Sets ⬇️
We propose a novel approach for image segmentation that combines Neural Ordinary Differential Equations (NODEs) and the Level Set method. Our approach parametrizes the evolution of an initial contour with a NODE that implicitly learns from data a speed function describing the evolution. In addition, for cases where an initial contour is not available and to alleviate the need for careful choice or design of contour embedding functions, we propose a NODE-based method that evolves an image embedding into a dense per-pixel semantic label space. We evaluate our methods on kidney segmentation (KiTS19) and on salient object detection (PASCAL-S, ECSSD and HKU-IS). In addition to improving initial contours provided by deep learning models while using a fraction of their number of parameters, our approach achieves F scores that are higher than several state-of-the-art deep learning algorithms.
69.Deep Learning-based Vehicle Behaviour Prediction For Autonomous Driving Applications: A Review ⬇️
Behaviour prediction function of an autonomous vehicle predicts the future states of the nearby vehicles based on the current and past observations of the surrounding environment. This helps enhance their awareness of the imminent hazards. However, conventional behaviour prediction solutions are applicable in simple driving scenarios that require short prediction horizons. Most recently, deep learning-based approaches have become popular due to their superior performance in more complex environments compared to the conventional approaches. Motivated by this increased popularity, we provide a comprehensive review of the state-of-the-art of deep learning-based approaches for vehicle behaviour prediction in this paper. We firstly give an overview of the generic problem of vehicle behaviour prediction and discuss its challenges, followed by classification and review of the most recent deep learning-based solutions based on three criteria: input representation, output type, and prediction method. The paper also discusses the performance of several well-known solutions, identifies the research gaps in the literature and outlines potential new research directions.
70.Asymmetric GAN for Unpaired Image-to-image Translation ⬇️
Unpaired image-to-image translation problem aims to model the mapping from one domain to another with unpaired training data. Current works like the well-acknowledged Cycle GAN provide a general solution for any two domains through modeling injective mappings with a symmetric structure. While in situations where two domains are asymmetric in complexity, i.e., the amount of information between two domains is different, these approaches pose problems of poor generation quality, mapping ambiguity, and model sensitivity. To address these issues, we propose Asymmetric GAN (AsymGAN) to adapt the asymmetric domains by introducing an auxiliary variable (aux) to learn the extra information for transferring from the information-poor domain to the information-rich domain, which improves the performance of state-of-the-art approaches in the following ways. First, aux better balances the information between two domains which benefits the quality of generation. Second, the imbalance of information commonly leads to mapping ambiguity, where we are able to model one-to-many mappings by tuning aux, and furthermore, our aux is controllable. Third, the training of Cycle GAN can easily make the generator pair sensitive to small disturbances and variations while our model decouples the ill-conditioned relevance of generators by injecting aux during training. We verify the effectiveness of our proposed method both qualitatively and quantitatively on asymmetric situation, label-photo task, on Cityscapes and Helen datasets, and show many applications of asymmetric image translations. In conclusion, our AsymGAN provides a better solution for unpaired image-to-image translation in asymmetric domains.
71.Improving Visual Recognition using Ambient Sound for Supervision ⬇️
Our brains combine vision and hearing to create a more elaborate interpretation of the world. When the visual input is insufficient, a rich panoply of sounds can be used to describe our surroundings. Since more than 1,000 hours of videos are uploaded to the internet everyday, it is arduous, if not impossible, to manually annotate these videos. Therefore, incorporating audio along with visual data without annotations is crucial for leveraging this explosion of data for recognizing and understanding objects and scenes. Owens,this http URL suggest that a rich representation of the physical world can be learned by using a convolutional neural network to predict sound textures associated with a given video frame. We attempt to reproduce the claims from their experiments, of which the code is not publicly available. In addition, we propose improvements in the pretext task that result in better performance in other downstream computer vision tasks.
72.DDI-100: Dataset for Text Detection and Recognition ⬇️
Nowadays document analysis and recognition remain challenging tasks. However, only a few datasets designed for text detection (TD) and optical character recognition (OCR) problems exist. In this paper we present Distorted Document Images dataset (DDI-100) and demonstrate its usefulness in a wide range of document analysis problems. DDI-100 dataset is a synthetic dataset based on 7000 real unique document pages and consists of more than 100000 augmented images. Ground truth comprises text and stamp masks, text and characters bounding boxes with relevant annotations. Validation of DDI-100 dataset was conducted using several TD and OCR models that show high-quality performance on real data.
73.Extending Multi-Object Tracking systems to better exploit appearance and 3D information ⬇️
Tracking multiple objects in real time is essential for a variety of real-world applications, with self-driving industry being at the foremost. This work involves exploiting temporally varying appearance and motion information for tracking. Siamese networks have recently become highly successful at appearance based single object tracking and Recurrent Neural Networks have started dominating both motion and appearance based tracking. Our work focuses on combining Siamese networks and RNNs to exploit appearance and motion information respectively to build a joint system capable of real time multi-object tracking. We further explore heuristics based constraints for tracking in the Birds Eye View Space for efficiently exploiting 3D information as a constrained optimization problem for track prediction.
74.Competing Ratio Loss for Discriminative Multi-class Image Classification ⬇️
The development of deep convolutional neural network architecture is critical to the improvement of image classification task performance. Many image classification studies use deep convolutional neural network and focus on modifying the network structure to improve image classification performance. Conversely, our study focuses on loss function design. Cross-entropy Loss (CEL) has been widely used for training deep convolutional neural network for the task of multi-class classification. Although CEL has been successfully implemented in several image classification tasks, it only focuses on the posterior probability of the correct class. For this reason, a negative log likelihood ratio loss (NLLR) was proposed to better differentiate between the correct class and the competing incorrect ones. However, during the training of the deep convolutional neural network, the value of NLLR is not always positive or negative, which severely affects the convergence of NLLR. Our proposed competing ratio loss (CRL) calculates the posterior probability ratio between the correct class and the competing incorrect classes to further enlarge the probability difference between the correct and incorrect classes. We added hyperparameters to CRL, thereby ensuring its value to be positive and that the update size of backpropagation is suitable for the CRL's fast convergence. To demonstrate the performance of CRL, we conducted experiments on general image classification tasks (CIFAR10/100, SVHN, ImageNet), the fine-grained image classification tasks (CUB200-2011 and Stanford Car), and the challenging face age estimation task (using Adience). Experimental results show the effectiveness and robustness of the proposed loss function on different deep convolutional neural network architectures and different image classification tasks.
75.Ranking and Classification driven Feature Learning for Person Re_identification ⬇️
Person re-identification has attracted many researchers' attention for its wide application, but it is still a very challenging task because only part of the image information can be used for personnel matching. Most of current methods uses CNN to learn to embeddings that can capture semantic similarity information among data points. Many of the state-of-the-arts methods use complex network structures with multiple branches that fuse multiple features while training or testing, using classification loss, Triplet loss or a combination of the two as loss function. However, the method that using Triplet loss as loss function converges slowly, and the method in which pull features of the same class as close as possible in features space leads to poor feature stability. This paper will combine the ranking motivated structured loss, proposed a new metric learning loss function that make the features of the same class are sparsely distributed into the range of small hyperspheres and the features of different classes are uniformly distributed at a clearly angle. And adopted a new single-branch network structure that only using global feature can also get great performance. The validity of our method is verified on the Market1501 and DukeMTMC-ReID person re-identification datasets. Finally acquires 90.9% rank-1 accuracy and 80.8% mAP on DukeMTMC-reID, 95.3% rank-1 accuracy and 88.7% mAP on Market1501. Codes and models are available in Github.this https URL.
76.Learn to Segment Retinal Lesions and Beyond ⬇️
Towards automated retinal screening, this paper makes an endeavor to simultaneously achieve pixel-level retinal lesion segmentation and image-level disease classification. Such a multi-task approach is crucial for accurate and clinically interpretable disease diagnosis. Prior art is insufficient due to three challenges, that is, lesions lacking objective boundaries, clinical importance of lesions irrelevant to their size, and the lack of one-to-one correspondence between lesion and disease classes. This paper attacks the three challenges in the context of diabetic retinopathy (DR) grading. We propose L-Net, a new variant of fully convolutional networks, with its expansive path re-designed to tackle the first challenge. A dual loss that leverages both semantic segmentation and image classification losses is devised to resolve the second challenge. We propose Side-Attention Net (SiAN) as our multi-task framework. Harnessing L-Net as a side-attention branch, SiAN simultaneously improves DR grading and interprets the decision with lesion maps. A set of 12K fundus images is manually segmented by 45 ophthalmologists for 8 DR-related lesions, resulting in 290K manual segments in total. Extensive experiments on this large-scale dataset show that our proposed approach surpasses the prior art for multiple tasks including lesion segmentation, lesion classification and DR grading.
77.Concise and Effective Network for 3D Human Modeling from Orthogonal Silhouettes ⬇️
In this paper, we revisit the problem of 3D human modeling from two orthogonal silhouettes of individuals (i.e., front and side views). Different from our prior work, a supervised learning approach based on \textit{convolutional neural network} (CNN) is investigated to solve the problem by establishing a mapping function that can effectively extract features from two silhouettes and fuse them into coefficients in the shape space of human bodies. A new CNN structure is proposed in our work to exact not only the discriminative features of front and side views and also their mixed features for the mapping function. 3D human models with high accuracy are synthesized from coefficients generated by the mapping function. Existing CNN approaches for 3D human modeling usually learn a large number of parameters (from 8M to 350M) from two binary images. Differently, we investigate a new network architecture and conduct the samples on silhouettes as input. As a consequence, more accurate models can be generated by our network with only 2.5M coefficients. The training of our network is conducted on samples obtained by augmenting a publicly accessible dataset. Learning transfer by using datasets with a smaller number of scanned models is applied to our network to enable the function of generating results with gender-oriented (or geographical) patterns.
78.InSphereNet: a Concise Representation and Classification Method for 3D Object ⬇️
In this paper, we present an InSphereNet method for the problem of 3D object classification. Unlike previous methods that use points, voxels, or multi-view images as inputs of deep neural network (DNN), the proposed method constructs a class of more representative features named infilling spheres from signed distance field (SDF). Because of the admirable spatial representation of infilling spheres, we can not only utilize very fewer number of spheres to accomplish classification task, but also design a lightweight InSphereNet with less layers and parameters than previous methods. Experiments on ModelNet40 show that the proposed method leads to superior performance than PointNet in accuracy. In particular, if there are only a few dozen sphere inputs or about 100000 DNN parameters, the accuracy of our method remains at a very high level. Keywords: 3D object classification , signed distance field , deep learning , infilling sphere
79.SketchTransfer: A Challenging New Task for Exploring Detail-Invariance and the Abstractions Learned by Deep Networks ⬇️
Deep networks have achieved excellent results in perceptual tasks, yet their ability to generalize to variations not seen during training has come under increasing scrutiny. In this work we focus on their ability to have invariance towards the presence or absence of details. For example, humans are able to watch cartoons, which are missing many visual details, without being explicitly trained to do so. As another example, 3D rendering software is a relatively recent development, yet people are able to understand such rendered scenes even though they are missing details (consider a film like Toy Story). The failure of machine learning algorithms to do this indicates a significant gap in generalization between human abilities and the abilities of deep networks. We propose a dataset that will make it easier to study the detail-invariance problem concretely. We produce a concrete task for this: SketchTransfer, and we show that state-of-the-art domain transfer algorithms still struggle with this task. The state-of-the-art technique which achieves over 95% on MNIST
$\xrightarrow{}$ SVHN transfer only achieves 59% accuracy on the SketchTransfer task, which is much better than random (11% accuracy) but falls short of the 87% accuracy of a classifier trained directly on labeled sketches. This indicates that this task is approachable with today's best methods but has substantial room for improvement.
80.Boundary Cues for 3D Object Shape Recovery ⬇️
Early work in computer vision considered a host of geometric cues for both shape reconstruction and recognition. However, since then, the vision community has focused heavily on shading cues for reconstruction, and moved towards data-driven approaches for recognition. In this paper, we reconsider these perhaps overlooked "boundary" cues (such as self occlusions and folds in a surface), as well as many other established constraints for shape reconstruction. In a variety of user studies and quantitative tasks, we evaluate how well these cues inform shape reconstruction (relative to each other) in terms of both shape quality and shape recognition. Our findings suggest many new directions for future research in shape reconstruction, such as automatic boundary cue detection and relaxing assumptions in shape from shading (e.g. orthographic projection, Lambertian surfaces).
81.Barycenters of Natural Images -- Constrained Wasserstein Barycenters for Image Morphing ⬇️
Image interpolation, or image morphing, refers to a visual transition between two (or more) input images. For such a transition to look visually appealing, its desirable properties are (i) to be smooth; (ii) to apply the minimal required change in the image; and (iii) to seem "real", avoiding unnatural artifacts in each image in the transition. To obtain a smooth and straightforward transition, one may adopt the well-known Wasserstein Barycenter Problem (WBP). While this approach guarantees minimal changes under the Wasserstein metric, the resulting images might seem unnatural. In this work, we propose a novel approach for image morphing that possesses all three desired properties. To this end, we define a constrained variant of the WBP that enforces the intermediate images to satisfy an image prior. We describe an algorithm that solves this problem and demonstrate it using the sparse prior and generative adversarial networks.
82.Focusing and Diffusion: Bidirectional Attentive Graph Convolutional Networks for Skeleton-based Action Recognition ⬇️
A collection of approaches based on graph convolutional networks have proven success in skeleton-based action recognition by exploring neighborhood information and dense dependencies between intra-frame joints. However, these approaches usually ignore the spatial-temporal global context as well as the local relation between inter-frame and intra-frame. In this paper, we propose a focusing and diffusion mechanism to enhance graph convolutional networks by paying attention to the kinematic dependence of articulated human pose in a frame and their implicit dependencies over frames. In the focusing process, we introduce an attention module to learn a latent node over the intra-frame joints to convey spatial contextual information. In this way, the sparse connections between joints in a frame can be well captured, while the global context over the entire sequence is further captured by these hidden nodes with a bidirectional LSTM. In the diffusing process, the learned spatial-temporal contextual information is passed back to the spatial joints, leading to a bidirectional attentive graph convolutional network (BAGCN) that can facilitate skeleton-based action recognition. Extensive experiments on the challenging NTU RGB+D and Skeleton-Kinetics benchmarks demonstrate the efficacy of our approach.
83.Learning by Cheating ⬇️
Vision-based urban driving is hard. The autonomous system needs to learn to perceive the world and act in it. We show that this challenging learning problem can be simplified by decomposing it into two stages. We first train an agent that has access to privileged information. This privileged agent cheats by observing the ground-truth layout of the environment and the positions of all traffic participants. In the second stage, the privileged agent acts as a teacher that trains a purely vision-based sensorimotor agent. The resulting sensorimotor agent does not have access to any privileged information and does not cheat. This two-stage training procedure is counter-intuitive at first, but has a number of important advantages that we analyze and empirically demonstrate. We use the presented approach to train a vision-based autonomous driving system that substantially outperforms the state of the art on the CARLA benchmark and the recent NoCrash benchmark. Our approach achieves, for the first time, 100% success rate on all tasks in the original CARLA benchmark, sets a new record on the NoCrash benchmark, and reduces the frequency of infractions by an order of magnitude compared to the prior state of the art. For the video that summarizes this work, see this https URL
84.An Unsupervised Deep Learning Method for Parallel Cardiac MRI via Time-Interleaved Sampling ⬇️
Deep learning has achieved good success in cardiac magnetic resonance imaging (MRI) reconstruction, in which convolutional neural networks (CNNs) learn the mapping from undersampled k-space to fully sampled images. Although these deep learning methods can improve reconstruction quality without complex parameter selection or a lengthy reconstruction time compared with iterative methods, the following issues still need to be addressed: 1) all of these methods are based on big data and require a large amount of fully sampled MRI data, which is always difficult for cardiac MRI; 2) All of these methods are only applicable for single-channel images without exploring coil correlation. In this paper, we propose an unsupervised deep learning method for parallel cardiac MRI via a time-interleaved sampling strategy. Specifically, a time-interleaved acquisition scheme is developed to build a set of fully encoded reference data by directly merging the k-space data of adjacent time frames. Then these fully encoded data can be used to train a parallel network for reconstructing images of each coil separately. Finally, the images from each coil are combined together via a CNN to implicitly explore the correlations between coils. The comparisons with classic k-t FOCUSS, k-t SLR and L+S methods on in vivo datasets show that our method can achieve improved reconstruction results in an extremely short amount of time.
85.Lung and Colon Cancer Histopathological Image Dataset (LC25000) ⬇️
The field of Machine Learning, a subset of Artificial Intelligence, has led to remarkable advancements in many areas, including medicine. Machine Learning algorithms require large datasets to train computer models successfully. Although there are medical image datasets available, more image datasets are needed from a variety of medical entities, especially cancer pathology. Even more scarce are ML-ready image datasets. To address this need, we created an image dataset (LC25000) with 25,000 color images in 5 classes. Each class contains 5,000 images of the following histologic entities: colon adenocarcinoma, benign colonic tissue, lung adenocarcinoma, lung squamous cell carcinoma, and benign lung tissue. All images are de-identified, HIPAA compliant, validated, and freely available for download to AI researchers.
86.Quaternion Equivariant Capsule Networks for 3D Point Clouds ⬇️
We present a 3D capsule architecture for processing of point clouds that is equivariant with respect to the
$SO(3)$ rotation group, translation and permutation of the unordered input sets. The network operates on a sparse set of local reference frames, computed from an input point cloud and establishes end-to-end equivariance through a novel 3D quaternion group capsule layer, including an equivariant dynamic routing procedure. The capsule layer enables us to disentangle geometry from pose, paving the way for more informative descriptions and a structured latent space. In the process, we theoretically connect the process of dynamic routing between capsules to the well-known Weiszfeld algorithm, a scheme for solving \emph{iterative re-weighted least squares (IRLS)} problems with provable convergence properties, enabling robust pose estimation between capsule layers. Due to the sparse equivariant quaternion capsules, our architecture allows joint object classification and orientation estimation, which we validate empirically on common benchmark datasets.
87.Visual Agreement Regularized Training for Multi-Modal Machine Translation ⬇️
Multi-modal machine translation aims at translating the source sentence into a different language in the presence of the paired image. Previous work suggests that additional visual information only provides dispensable help to translation, which is needed in several very special cases such as translating ambiguous words. To make better use of visual information, this work presents visual agreement regularized training. The proposed approach jointly trains the source-to-target and target-to-source translation models and encourages them to share the same focus on the visual information when generating semantically equivalent visual words (e.g. "ball" in English and "ballon" in French). Besides, a simple yet effective multi-head co-attention model is also introduced to capture interactions between visual and textual features. The results show that our approaches can outperform competitive baselines by a large margin on the Multi30k dataset. Further analysis demonstrates that the proposed regularized training can effectively improve the agreement of attention on the image, leading to better use of visual information.
88.Equations Derivation of VINS-Mono ⬇️
The VINS-Mono is a monocular visual-inertial 6 DOF state estimator proposed by Aerial Robotics Group at HKUST in 2017, which can be performed on MAVs, smartphones and many other intelligent platforms. It is a state-of-the-art visual-inertial odometry algorithms which has gained extensive attention worldwide. The main equations including IMU preintegration, visual/inertial co-initialization and tightly-coupled nonlinear optimization are derived and analyzed in this manuscript.
89.Efficient Adversarial Training with Transferable Adversarial Examples ⬇️
Adversarial training is an effective defense method to protect classification models against adversarial attacks. However, one limitation of this approach is that it can require orders of magnitude additional training time due to high cost of generating strong adversarial examples during training. In this paper, we first show that there is high transferability between models from neighboring epochs in the same training process, i.e., adversarial examples from one epoch continue to be adversarial in subsequent epochs. Leveraging this property, we propose a novel method, Adversarial Training with Transferable Adversarial Examples (ATTA), that can enhance the robustness of trained models and greatly improve the training efficiency by accumulating adversarial perturbations through epochs. Compared to state-of-the-art adversarial training methods, ATTA enhances adversarial accuracy by up to 7.2% on CIFAR10 and requires 12~14x less training time on MNIST and CIFAR10 datasets with comparable model robustness.
90.Handling Missing MRI Input Data in Deep Learning Segmentation of Brain Metastases: A Multi-Center Study ⬇️
The purpose was to assess the clinical value of a novel DropOut model for detecting and segmenting brain metastases, in which a neural network is trained on four distinct MRI sequences using an input dropout layer, thus simulating the scenario of missing MRI data by training on the full set and all possible subsets of the input data. This retrospective, multi-center study, evaluated 165 patients with brain metastases. A deep learning based segmentation model for automatic segmentation of brain metastases, named DropOut, was trained on multi-sequence MRI from 100 patients, and validated/tested on 10/55 patients. The segmentation results were compared with the performance of a state-of-the-art DeepLabV3 model. The MR sequences in the training set included pre- and post-gadolinium (Gd) T1-weighted 3D fast spin echo, post-Gd T1-weighted inversion recovery (IR) prepped fast spoiled gradient echo, and 3D fluid attenuated inversion recovery (FLAIR), whereas the test set did not include the IR prepped image-series. The ground truth were established by experienced neuroradiologists. The results were evaluated using precision, recall, Dice score, and receiver operating characteristics (ROC) curve statistics, while the Wilcoxon rank sum test was used to compare the performance of the two neural networks. The area under the ROC curve (AUC), averaged across all test cases, was 0.989+-0.029 for the DropOut model and 0.989+-0.023 for the DeepLabV3 model (p=0.62). The DropOut model showed a significantly higher Dice score compared to the DeepLabV3 model (0.795+-0.105 vs. 0.774+-0.104, p=0.017), and a significantly lower average false positive rate of 3.6/patient vs. 7.0/patient (p<0.001) using a 10mm3 lesion-size limit. The DropOut model may facilitate accurate detection and segmentation of brain metastases on a multi-center basis, even when the test cohort is missing MRI input data.
91.DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a Trained Classifier ⬇️
In this era of digital information explosion, an abundance of data from numerous modalities is being generated as well as archived everyday. However, most problems associated with training Deep Neural Networks still revolve around lack of data that is rich enough for a given task. Data is required not only for training an initial model, but also for future learning tasks such as Model Compression and Incremental Learning. A diverse dataset may be used for training an initial model, but it may not be feasible to store it throughout the product life cycle due to data privacy issues or memory constraints. We propose to bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a given trained network. We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples from a trained classifier, using a novel Data-enriching GAN (DeGAN) framework. We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance for the tasks of Data-free Knowledge Distillation and Incremental Learning on benchmark datasets. We further demonstrate that our proposed framework can enrich any data, even from unrelated domains, to make it more useful for the future learning tasks of a given network.
92.Colorectal Polyp Segmentation by U-Net with Dilation Convolution ⬇️
Colorectal cancer (CRC) is one of the most commonly diagnosed cancers and a leading cause of cancer deaths in the United States. Colorectal polyps that grow on the intima of the colon or rectum is an important precursor for CRC. Currently, the most common way for colorectal polyp detection and precancerous pathology is the colonoscopy. Therefore, accurate colorectal polyp segmentation during the colonoscopy procedure has great clinical significance in CRC early detection and prevention. In this paper, we propose a novel end-to-end deep learning framework for the colorectal polyp segmentation. The model we design consists of an encoder to extract multi-scale semantic features and a decoder to expand the feature maps to a polyp segmentation map. We improve the feature representation ability of the encoder by introducing the dilated convolution to learn high-level semantic features without resolution reduction. We further design a simplified decoder which combines multi-scale semantic features with fewer parameters than the traditional architecture. Furthermore, we apply three post processing techniques on the output segmentation map to improve colorectal polyp detection performance. Our method achieves state-of-the-art results on CVC-ClinicDB and ETIS-Larib Polyp DB.
93.Skeleton Extraction from 3D Point Clouds by Decomposing the Object into Parts ⬇️
Decomposing a point cloud into its components and extracting curve skeletons from point clouds are two related problems. Decomposition of a shape into its components is often obtained as a byproduct of skeleton extraction. In this work, we propose to extract curve skeletons, from unorganized point clouds, by decomposing the object into its parts, identifying part skeletons and then linking these part skeletons together to obtain the complete skeleton. We believe it is the most natural way to extract skeletons in the sense that this would be the way a human would approach the problem. Our parts are generalized cylinders (GCs). Since, the axis of a GC is an integral part of its definition, the parts have natural skeletal representations. We use translational symmetry, the fundamental property of GCs, to extract parts from point clouds. We demonstrate how this method can handle a large variety of shapes. We compare our method with state of the art methods and show how a part based approach can deal with some of the limitations of other methods. We present an improved version of an existing point set registration algorithm and demonstrate its utility in extracting parts from point clouds. We also show how this method can be used to extract skeletons from and identify parts of noisy point clouds. A part based approach also provides a natural and intuitive interface for user interaction. We demonstrate the ease with which mistakes, if any, can be fixed with minimal user interaction with the help of a graphical user interface.
94.A Comparative Study on Machine Learning Algorithms for the Control of a Wall Following Robot ⬇️
A comparison of the performance of various machine learning models to predict the direction of a wall following robot is presented in this paper. The models were trained using an open-source dataset that contains 24 ultrasound sensors readings and the corresponding direction for each sample. This dataset was captured using SCITOS G5 mobile robot by placing the sensors on the robot waist. In addition to the full format with 24 sensors per record, the dataset has two simplified formats with 4 and 2 input sensor readings per record. Several control models were proposed previously for this dataset using all three dataset formats. In this paper, two primary research contributions are presented. First, presenting machine learning models with accuracies higher than all previously proposed models for this dataset using all three formats. A perfect solution for the 4 and 2 inputs sensors formats is presented using Decision Tree Classifier by achieving a mean accuracy of 100%. On the other hand, a mean accuracy of 99.82% was achieves using the 24 sensor inputs by employing the Gradient Boost Classifier. Second, presenting a comparative study on the performance of different machine learning and deep learning algorithms on this dataset. Therefore, providing an overall insight on the performance of these algorithms for similar sensor fusion problems. All the models in this paper were evaluated using Monte-Carlo cross-validation.
95.Autonomous Removal of Perspective Distortion for Robotic Elevator Button Recognition ⬇️
Elevator button recognition is considered an indispensable function for enabling the autonomous elevator operation of mobile robots. However, due to unfavorable image conditions and various image distortions, the recognition accuracy remains to be improved. In this paper, we present a novel algorithm that can autonomously correct perspective distortions of elevator panel images. The algorithm first leverages the Gaussian Mixture Model (GMM) to conduct a grid fitting process based on button recognition results, then utilizes the estimated grid centers as reference features to estimate camera motions for correcting perspective distortions. The algorithm performs on a single image autonomously and does not need explicit feature detection or feature matching procedure, which is much more robust to noises and outliers than traditional feature-based geometric approaches. To verify the effectiveness of the algorithm, we collect an elevator panel dataset of 50 images captured from different angles of view. Experimental results show that the proposed algorithm can accurately estimate camera motions and effectively remove perspective distortions.
96.A Closer Look at Mobile App Usage as a Persistent Biometric: A Small Case Study ⬇️
In this paper, we explore mobile app use as a behavioral biometric identifier. While several efforts have also taken on this challenge, many have alluded to the inconsistency in human behavior, resulting in updating the biometric template frequently and periodically. Here, we represent app usage as simple images wherein each pixel value provides some information about the user's app usage. Then, we feed use these images to train a deep learning network (convolutional neural net) to classify the user's identity. Our contribution lies in the random order in which the images are fed to the classifier, thereby presenting novel evidence that there are some aspects of app usage that are indeed persistent. Our results yield a 96.8%
$F$ -score without any updates to the template data.
97.Multiple Pretext-Task for Self-Supervised Learning via Mixing Multiple Image Transformations ⬇️
Self-supervised learning is one of the most promising approaches to learn representations capturing semantic features in images without any manual annotation cost. To learn useful representations, a self-supervised model solves a pretext-task, which is defined by data itself. Among a number of pretext-tasks, the rotation prediction task (Rotation) achieves better representations for solving various target tasks despite its simplicity of the implementation. However, we found that Rotation can fail to capture semantic features related to image textures and colors. To tackle this problem, we introduce a learning technique called multiple pretext-task for self-supervised learning (MP-SSL), which solves multiple pretext-task in addition to Rotation simultaneously. In order to capture features of textures and colors, we employ the transformations of image enhancements (e.g., sharpening and solarizing) as the additional pretext-tasks. MP-SSL efficiently trains a model by leveraging a Frank-Wolfe based multi-task training algorithm. Our experimental results show MP-SSL models outperform Rotation on multiple standard benchmarks and achieve state-of-the-art performance on Places-205.
98.Effective Data Augmentation with Multi-Domain Learning GANs ⬇️
For deep learning applications, the massive data development (e.g., collecting, labeling), which is an essential process in building practical applications, still incurs seriously high costs. In this work, we propose an effective data augmentation method based on generative adversarial networks (GANs), called Domain Fusion. Our key idea is to import the knowledge contained in an outer dataset to a target model by using a multi-domain learning GAN. The multi-domain learning GAN simultaneously learns the outer and target dataset and generates new samples for the target tasks. The simultaneous learning process makes GANs generate the target samples with high fidelity and variety. As a result, we can obtain accurate models for the target tasks by using these generated samples even if we only have an extremely low volume target dataset. We experimentally evaluate the advantages of Domain Fusion in image classification tasks on 3 target datasets: CIFAR-100, FGVC-Aircraft, and Indoor Scene Recognition. When trained on each target dataset reduced the samples to 5,000 images, Domain Fusion achieves better classification accuracy than the data augmentation using fine-tuned GANs. Furthermore, we show that Domain Fusion improves the quality of generated samples, and the improvements can contribute to higher accuracy.
99.Fluid segmentation in Neutrosophic domain ⬇️
Optical coherence tomography (OCT) as retina imaging technology is currently used by ophthalmologist as a non-invasive and non-contact method for diagnosis of agerelated degeneration (AMD) and diabetic macular edema (DME) diseases. Fluid regions in OCT images reveal the main signs of AMD and DME. In this paper, an efficient and fast clustering in neutrosophic (NS) domain referred as neutrosophic C-means is adapted for fluid segmentation. For this task, a NCM cost function in NS domain is adapted for fluid segmentation and then optimized by gradient descend methods which leads to binary segmentation of OCT Bscans to fluid and tissue regions. The proposed method is evaluated in OCT datasets of subjects with DME abnormalities. Results showed that the proposed method outperforms existing fluid segmentation methods by 6% in dice coefficient and sensitivity criteria.
100.Parallel optimization of fiber bundle segmentation for massive tractography datasets ⬇️
We present an optimized algorithm that performs automatic classification of white matter fibers based on a multi-subject bundle atlas. We implemented a parallel algorithm that improves upon its previous version in both execution time and memory usage. Our new version uses the local memory of each processor, which leads to a reduction in execution time. Hence, it allows the analysis of bigger subject and/or atlas datasets. As a result, the segmentation of a subject of 4,145,000 fibers is reduced from about 14 minutes in the previous version to about 6 minutes, yielding an acceleration of 2.34. In addition, the new algorithm reduces the memory consumption of the previous version by a factor of 0.79.
101.Self-adaption grey DBSCAN clustering ⬇️
Clustering analysis, a classical issue in data mining, is widely used in various research areas. This article aims at proposing a self-adaption grey DBSCAN clustering (SAG-DBSCAN) algorithm. First, the grey relational matrix is used to obtain the grey local density indicator, and then this indicator is applied to make self-adapting noise identification for obtaining a dense subset of clustering dataset, finally, the DBSCAN which automatically selects parameters is utilized to cluster the dense subset. Several frequently-used datasets were used to demonstrate the performance and effectiveness of the proposed clustering algorithm and to compare the results with those of other state-of-the-art algorithms. The comprehensive comparisons indicate that our method has advantages over other compared methods.
102.Digital filters with vanishing moments for shape analysis ⬇️
Shape- and scale-selective digital-filters, with steerable finite/infinite impulse responses (FIR/IIRs) and non-recursive/recursive realizations, that are separable in both spatial dimensions and adequately isotropic, are derived. The filters are conveniently designed in the frequency domain via derivative constraints at dc, which guarantees orthogonality and monomial selectivity in the pixel domain (i.e. vanishing moments), unlike more commonly used FIR filters derived from Gaussian functions. A two-stage low-pass/high-pass architecture, for blur/derivative operations, is recommended. Expressions for the coefficients of a low-order IIR blur filter with repeated poles are provided, as a function of scale; discrete Butterworth (IIR), and colored Savitzky-Golay (FIR), blurs are also examined. Parallel software implementations on central processing units (CPUs) and graphics processing units (GPUs), for scale-selective blob-detection in aerial surveillance imagery, are analyzed. It is shown that recursive IIR filters are significantly faster than non-recursive FIR filters when detecting large objects at coarse scales, i.e. using filters with long impulse responses; however, the margin of outperformance decreases as the degree of parallelization increases.