ArXiv cs.CV --Tue, 1 Oct 2019

1.Deep learning tools for the measurement of animal behavior in neuroscience ⬇️

Recent advances in computer vision have made accurate, fast and robust measurement of animal behavior a reality. In the past years new tools specifically designed to aid the measurement of behavior in laboratories have come to fruition. Here we discuss how capturing the postures of animals over time is a key step in transforming videos into lower dimensional representations of behavior. We envision that the fast-paced development of new deep learning tools will rapidly change the landscape of realizable real-world neuroscience experiments.

2.XNOR-Net++: Improved Binary Neural Networks ⬇️

This paper proposes an improved training algorithm for binary neural networks in which both weights and activations are binary numbers. A key but fairly overlooked feature of the current state-of-the-art method of XNOR-Net is the use of analytically calculated real-valued scaling factors for re-weighting the output of binary convolutions. We argue that analytic calculation of these factors is sub-optimal. Instead, in this work, we make the following contributions: (a) we propose to fuse the activation and weight scaling factors into a single one that is learned discriminatively via backpropagation. (b) More importantly, we explore several ways of constructing the shape of the scale factors while keeping the computational budget fixed. (c) We empirically measure the accuracy of our approximations and show that they are significantly more accurate than the analytically calculated one. (d) We show that our approach significantly outperforms XNOR-Net within the same computational budget when tested on the challenging task of ImageNet classification, offering up to 6% accuracy gain.

3.Unsupervised Pose Flow Learning for Pose Guided Synthesis ⬇️

Pose guided synthesis aims to generate a new image in an arbitrary target pose while preserving the appearance details from the source image. Existing approaches rely on either hard-coded spatial transformations or 3D body modeling. They often overlook complex non-rigid pose deformation or unmatched occluded regions, thus fail to effectively preserve appearance information. In this paper, we propose an unsupervised pose flow learning scheme that learns to transfer the appearance details from the source image. Based on such learned pose flow, we proposed GarmentNet and SynthesisNet, both of which use multi-scale feature-domain alignment for coarse-to-fine synthesis. Experiments on the DeepFashion, MVC dataset and additional real-world datasets demonstrate that our approach compares favorably with the state-of-the-art methods and generalizes to unseen poses and clothing styles.

4.wMAN: Weakly-supervised Moment Alignment Network for Text-based Video Segment Retrieval ⬇️

Given a video and a sentence, the goal of weakly-supervised video moment retrieval is to locate the video segment which is described by the sentence without having access to temporal annotations during training. Instead, a model must learn how to identify the correct segment (i.e. moment) when only being provided with video-sentence pairs. Thus, an inherent challenge is automatically inferring the latent correspondence between visual and language representations. To facilitate this alignment, we propose our Weakly-supervised Moment Alignment Network (wMAN) which exploits a multi-level co-attention mechanism to learn richer multimodal representations. The aforementioned mechanism is comprised of a Frame-By-Word interaction module as well as a novel Word-Conditioned Visual Graph (WCVG). Our approach also incorporates a novel application of positional encodings, commonly used in Transformers, to learn visual-semantic representations that contain contextual information of their relative positions in the temporal sequence through iterative message-passing. Comprehensive experiments on the DiDeMo and Charades-STA datasets demonstrate the effectiveness of our learned representations: our combined wMAN model not only outperforms the state-of-the-art weakly-supervised method by a significant margin but also does better than strongly-supervised state-of-the-art methods on some metrics.

5.MLSL: Multi-Level Self-Supervised Learning for Domain Adaptation with Spatially Independent and Semantically Consistent Labeling ⬇️

Most of the recent Deep Semantic Segmentation algorithms suffer from large generalization errors, even when powerful hierarchical representation models based on convolutional neural networks have been employed. This could be attributed to limited training data and large distribution gap in train and test domain datasets. In this paper, we propose a multi-level self-supervised learning model for domain adaptation of semantic segmentation. Exploiting the idea that an object (and most of the stuff given context) should be labeled consistently regardless of its location, we generate spatially independent and semantically consistent (SISC) pseudo-labels by segmenting multiple sub-images using base model and designing an aggregation strategy. Image level pseudo weak-labels, PWL, are computed to guide domain adaptation by capturing global context similarity in source and domain at latent space level. Thus helping latent space learn the representation even when there are very few pixels belonging to the domain category (small object for example) compared to rest of the image. Our multi-level Self-supervised learning (MLSL) outperforms existing state-of art (self or adversarial learning) algorithms. Specifically, keeping all setting similar and employing MLSL we obtain an mIoU gain of 5:1% on GTA-V to Cityscapes adaptation and 4:3% on SYNTHIA to Cityscapes adaptation compared to existing state-of-art method.

6.IPC-Net: 3D point-cloud segmentation using deep inter-point convolutional layers ⬇️

Over the last decade, the demand for better segmentation and classification algorithms in 3D spaces has significantly grown due to the popularity of new 3D sensor technologies and advancements in the field of robotics. Point-clouds are one of the most popular representations to store a digital description of 3D shapes. However, point-clouds are stored in irregular and unordered structures, which limits the direct use of segmentation algorithms such as Convolutional Neural Networks. The objective of our work is twofold: First, we aim to provide a full analysis of the PointNet architecture to illustrate which features are being extracted from the point-clouds. Second, to propose a new network architecture called IPC-Net to improve the state-of-the-art point cloud architectures. We show that IPC-Net extracts a larger set of unique features allowing the model to produce more accurate segmentations compared to the PointNet architecture. In general, our approach outperforms PointNet on every family of 3D geometries on which the models were tested. A high generalisation improvement was observed on every 3D shape, especially on the rockets dataset. Our experiments demonstrate that our main contribution, inter-point activation on the network's layers, is essential to accurately segment 3D point-clouds.

7.RandAugment: Practical data augmentation with no separate search ⬇️

Recent work has shown that data augmentation has the potential to significantly improve the generalization of deep learning models. Recently, learned augmentation strategies have led to state-of-the-art results in image classification and object detection. While these strategies were optimized for improving validation accuracy, they also led to state-of-the-art results in semi-supervised learning and improved robustness to common corruptions of images. One obstacle to a large-scale adoption of these methods is a separate search phase which significantly increases the training complexity and may substantially increase the computational cost. Additionally, due to the separate search phase, these learned augmentation approaches are unable to adjust the regularization strength based on model or dataset size. Learned augmentation policies are often found by training small models on small datasets and subsequently applied to train larger models. In this work, we remove both of these obstacles. RandAugment may be trained on the model and dataset of interest with no need for a separate proxy task. Furthermore, due to the parameterization, the regularization strength may be tailored to different model and dataset sizes. RandAugment can be used uniformly across different tasks and datasets and works out of the box, matching or surpassing all previous learned augmentation approaches on CIFAR-10, CIFAR-100, SVHN, and ImageNet. On the ImageNet dataset we achieve 85.0% accuracy, a 0.6% increase over the previous state-of-the-art and 1.0% increase over baseline augmentation. On object detection, RandAugment leads to 1.0-1.3% improvement over baseline augmentation, and is within 0.3% mAP of AutoAugment on COCO. Finally, due to its interpretable hyperparameter, RandAugment may be used to investigate the role of data augmentation with varying model and dataset size.

8.Depth Estimation in Nighttime using Stereo-Consistent Cyclic Translations ⬇️

Most existing methods of depth from stereo are designed for daytime scenes, where the lighting can be assumed to be sufficiently bright and more or less uniform. Unfortunately, this assumption does not hold for nighttime scenes, causing the existing methods to be erroneous when deployed in nighttime. Nighttime is not only about low light, but also about glow, glare, non-uniform distribution of light, etc. One of the possible solutions is to train a network on nighttime images in a fully supervised manner. However, to obtain proper disparity ground-truths that are dense, independent from glare/glow, and can have sufficiently far depth ranges is extremely intractable. In this paper, to address the problem of depth from stereo in nighttime, we introduce a joint translation and stereo network that is robust to nighttime conditions. Our method uses no direct supervision and does not require ground-truth disparities of the nighttime training images. First, we utilize a translation network that can render realistic nighttime stereo images from given daytime stereo images. Second, we train a stereo network on the rendered nighttime images using the available disparity supervision from the daytime images, and simultaneously also train the translation network to gradually improve the rendered nighttime images. We introduce a stereo-consistency constraint into our translation network to ensure that the translated pairs are stereo-consistent. Our experiments show that our joint translation-stereo network outperforms the state-of-the-art methods.

9.Style Transfer by Rigid Alignment in Neural Net Feature Space ⬇️

Arbitrary style transfer is an important problem in computer vision that aims to transfer style patterns from an arbitrary style image to a given content image. However, current methods either rely on slow iterative optimization or fast pre-determined feature transformation, but at the cost of compromised visual quality of the styled image; especially, distorted content structure. In this work, we present an effective and efficient approach for arbitrary style transfer that seamlessly transfers style patterns as well as keep content structure intact in the styled image. We achieve this by aligning style features to content features using rigid alignment; thus modifying style features, unlike the existing methods that do the opposite. We demonstrate the effectiveness of the proposed approach by generating high-quality stylized images and compare the results with the current state-of-the-art techniques for arbitrary style transfer.

10.Multi-view PointNet for 3D Scene Understanding ⬇️

Fusion of 2D images and 3D point clouds is important because information from dense images can enhance sparse point clouds. However, fusion is challenging because 2D and 3D data live in different spaces. In this work, we propose MVPNet (Multi-View PointNet), where we aggregate 2D multi-view image features into 3D point clouds, and then use a point based network to fuse the features in 3D canonical space to predict 3D semantic labels. To this end, we introduce view selection along with a 2D-3D feature aggregation module. Extensive experiments show the benefit of leveraging features from dense images and reveal superior robustness to varying point cloud density compared to 3D-only methods. On the ScanNetV2 benchmark, our MVPNet significantly outperforms prior point cloud based approaches on the task of 3D Semantic Segmentation. It is much faster to train than the large networks of the sparse voxel approach. We provide solid ablation studies to ease the future design of 2D-3D fusion methods and their extension to other tasks, as we showcase for 3D instance segmentation.

11.Domain Adaptation for Semantic Segmentation with Maximum Squares Loss ⬇️

Deep neural networks for semantic segmentation always require a large number of samples with pixel-level labels, which becomes the major difficulty in their real-world applications. To reduce the labeling cost, unsupervised domain adaptation (UDA) approaches are proposed to transfer knowledge from labeled synthesized datasets to unlabeled real-world datasets. Recently, some semi-supervised learning methods have been applied to UDA and achieved state-of-the-art performance. One of the most popular approaches in semi-supervised learning is the entropy minimization method. However, when applying the entropy minimization to UDA for semantic segmentation, the gradient of the entropy is biased towards samples that are easy to transfer. To balance the gradient of well-classified target samples, we propose the maximum squares loss. Our maximum squares loss prevents the training process being dominated by easy-to-transfer samples in the target domain. Besides, we introduce the image-wise weighting ratio to alleviate the class imbalance in the unlabeled target domain. Both synthetic-to-real and cross-city adaptation experiments demonstrate the effectiveness of our proposed approach. The code is released at https://github. com/ZJULearning/MaxSquareLoss.

12.Towards Good Practices for Video Object Segmentation ⬇️

Semi-supervised video object segmentation is an interesting yet challenging task in machine learning. In this work, we conduct a series of refinements with the propagation-based video object segmentation method and empirically evaluate their impact on the final model performance through ablation study. By taking all the refinements, we improve the space-time memory networks to achieve a Overall of 79.1 on the Youtube-VOS Challenge 2019.

13.Meta-learning algorithms for Few-Shot Computer Vision ⬇️

Few-Shot Learning is the challenge of training a model with only a small amount of data. Many solutions to this problem use meta-learning algorithms, i.e. algorithms that learn to learn. By sampling few-shot tasks from a larger dataset, we can teach these algorithms to solve new, unseen tasks. This document reports my work on meta-learning algorithms for Few-Shot Computer Vision. This work was done during my internship at Sicara, a French company building image recognition solutions for businesses. It contains: 1. an extensive review of the state-of-the-art in few-shot computer vision; 2. a benchmark of meta-learning algorithms for few-shot image classification; 3. the introduction to a novel meta-learning algorithm for few-shot object detection, which is still in development.

14.Enhancing Object Detection in Adverse Conditions using Thermal Imaging ⬇️

Autonomous driving relies on deriving understanding of objects and scenes through images. These images are often captured by sensors in the visible spectrum. For improved detection capabilities we propose the use of thermal sensors to augment the vision capabilities of an autonomous vehicle. In this paper, we present our investigations on the fusion of visible and thermal spectrum images using a publicly available dataset, and use it to analyze the performance of object recognition on other known driving datasets. We present an comparison of object detection in night time imagery and qualitatively demonstrate that thermal images significantly improve detection accuracy.

15.EdgeCNN: Convolutional Neural Network Classification Model with small inputs for Edge Computing ⬇️

With the development of Internet of Things (IoT), data is increasingly appearing on the edge of the network. Processing tasks on the edge of the network can effectively solve the problems of personal privacy leaks and server overload. As a result, it has attracted a great deal of attention and made substantial progress. This progress includes efficient convolutional neural network (CNN) models such as MobileNet and ShuffleNet. However, all of these networks appear as a common network model and they usually need to identify multiple targets when applied. So the size of the input is very large. In some specific cases, only the target needs to be classified. Therefore, a small input network can be designed to reduce computation. In addition, other efficient neural network models are primarily designed for mobile phones. Mobile phones have faster memory access, which allows them to use group convolution. In particular, this paper finds that the recently widely used group convolution is not suitable for devices with very slow memory access. Therefore, the EdgeCNN of this paper is designed for edge computing devices with low memory access speed and low computing resources. EdgeCNN has been run successfully on the Raspberry Pi 3B+ at a speed of 1.37 frames per second. The accuracy of facial expression classification for the FER-2013 and RAF-DB datasets outperforms other proposed networks that are compatible with the Raspberry Pi 3B+. The implementation of EdgeCNN is available at this https URL

16.CullNet: Calibrated and Pose Aware Confidence Scores for Object Pose Estimation ⬇️

We present a new approach for a single view, image-based object pose estimation. Specifically, the problem of culling false positives among several pose proposal estimates is addressed in this paper. Our proposed approach targets the problem of inaccurate confidence values predicted by CNNs which is used by many current methods to choose a final object pose prediction. We present a network called CullNet, solving this task. CullNet takes pairs of pose masks rendered from a 3D model and cropped regions in the original image as input. This is then used to calibrate the confidence scores of the pose proposals. This new set of confidence scores is found to be significantly more reliable for accurate object pose estimation as shown by our results. Our experimental results on multiple challenging datasets (LINEMOD and Occlusion LINEMOD) reflects the utility of our proposed method. Our overall pose estimation pipeline outperforms state-of-the-art object pose estimation methods on these standard object pose estimation datasets. Our code is publicly available on this https URL.

17.Spatio-Temporal FAST 3D Convolutions for Human Action Recognition ⬇️

Effective processing of video input is essential for the recognition of temporally varying events such as human actions. Motivated by the often distinctive temporal characteristics of actions in either horizontal or vertical direction, we introduce a novel convolution block for CNN architectures with video input. Our proposed Fractioned Adjacent Spatial and Temporal (FAST) 3D convolutions are a natural decomposition of a regular 3D convolution. Each convolution block consist of three sequential convolution operations: a 2D spatial convolution followed by spatio-temporal convolutions in the horizontal and vertical direction, respectively. Additionally, we introduce a FAST variant that treats horizontal and vertical motion in parallel. Experiments on benchmark action recognition datasets UCF-101 and HMDB-51 with ResNet architectures demonstrate consistent increased performance of FAST 3D convolution blocks over traditional 3D convolutions. The lower validation loss indicates better generalization, especially for deeper networks. We also evaluate the performance of CNN architectures with similar memory requirements, based either on Two-stream networks or with 3D convolution blocks. DenseNet-121 with FAST 3D convolutions was shown to perform best, giving further evidence of the merits of the decoupled spatio-temporal convolutions.

18.On Incorporating Semantic Prior Knowlegde in Deep Learning Through Embedding-Space Constraints ⬇️

The knowledge that humans hold about a problem often extends far beyond a set of training data and output labels. While the success of deep learning mostly relies on supervised training, important properties cannot be inferred efficiently from end-to-end annotations alone, for example causal relations or domain-specific invariances. We present a general technique to supplement supervised training with prior knowledge expressed as relations between training instances. We illustrate the method on the task of visual question answering to exploit various auxiliary annotations, including relations of equivalence and of logical entailment between questions. Existing methods to use these annotations, including auxiliary losses and data augmentation, cannot guarantee the strict inclusion of these relations into the model since they require a careful balancing against the end-to-end objective. Our method uses these relations to shape the embedding space of the model, and treats them as strict constraints on its learned representations. In the context of VQA, this approach brings significant improvements in accuracy and robustness, in particular over the common practice of incorporating the constraints as a soft regularizer. We also show that incorporating this type of prior knowledge with our method brings consistent improvements, independently from the amount of supervised data used. It demonstrates the value of an additional training signal that is otherwise difficult to extract from end-to-end annotations alone.

19.Residual Attention Graph Convolutional Network for Geometric 3D Scene Classification ⬇️

Geometric 3D scene classification is a very challenging task. Current methodologies extract the geometric information using only a depth channel provided by an RGB-D sensor. These kinds of methodologies introduce possible errors due to missing local geometric context in the depth channel. This work proposes a novel Residual Attention Graph Convolutional Network that exploits the intrinsic geometric context inside a 3D space without using any kind of point features, allowing the use of organized or unorganized 3D data. Experiments are done in NYU Depth v1 and SUN-RGBD datasets to study the different configurations and to demonstrate the effectiveness of the proposed method. Experimental results show that the proposed method outperforms current state-of-the-art in geometric 3D scene classification tasks.

20.Random Bias Initialization Improving Binary Neural Network Training ⬇️

Edge intelligence especially binary neural network (BNN) has attracted considerable attention of the artificial intelligence community recently. BNNs significantly reduce the computational cost, model size, and memory footprint. However, there is still a performance gap between the successful full-precision neural network with ReLU activation and BNNs. We argue that the accuracy drop of BNNs is due to their geometry. We analyze the behaviour of the full-precision neural network with ReLU activation and compare it with its binarized counterpart. This comparison suggests random bias initialization as a remedy to activation saturation in full-precision networks and leads us towards an improved BNN training. Our numerical experiments confirm our geometric intuition.

21.Single-Network Whole-Body Pose Estimation ⬇️

We present the first single-network approach for 2Dwhole-body pose estimation, which entails simultaneous localization of body, face, hands, and feet keypoints. Due to the bottom-up formulation, our method maintains constant real-time performance regardless of the number of people in the image. The network is trained in a single stage using multi-task learning, through an improved architecture which can handle scale differences between body/foot and face/hand keypoints. Our approach considerably improves upon OpenPose\cite{cao2018openpose}, the only work so far capable of whole-body pose estimation, both in terms of speed and global accuracy. Unlike OpenPose, our method does not need to run an additional network for each hand and face candidate, making it substantially faster for multi-person scenarios. This work directly results in a reduction of computational complexity for applications that require 2D whole-body information (e.g., VR/AR, re-targeting). In addition, it yields higher accuracy, especially for occluded, blurry, and low resolution faces and hands. For code, trained models, and validation benchmarks, visit our project page: this https URL.

22.SymmetricNet: A mesoscale eddy detection method based on multivariate fusion data ⬇️

Mesoscale eddies play a significant role in marine energy transport, marine biological environment and marine climate. Due to their huge impact on the ocean, mesoscale eddy detection has become a hot research area in recent years. Therefore, more and more people are entering the field of mesoscale eddy detection. However, the existing detection methods mainly based on traditional detection methods typically only use Sea Surface Height (SSH) as a variable to detect, resulting in inaccurate performance. In this paper, we propose a mesoscale eddy detection method based on multivariate fusion data to solve this problem. We not only use the SSH variable, but also add the two variables: Sea Surface Temperature (SST) and velocity of flow, achieving a multivariate information fusion input. We design a novel symmetric network, which merges low-level feature maps from the downsampling pathway and high-level feature maps from the upsampling pathway by lateral connection. In addition, we apply dilated convolutions to network structure to increase the receptive field and obtain more contextual information in the case of constant parameter. In the end, we demonstrate the effectiveness of our method on dataset provided by us, achieving the test set performance of 97.06% , greatly improved the performance of previous methods of mesoscale eddy detection.

23.REQ-YOLO: A Resource-Aware, Efficient Quantization Framework for Object Detection on FPGAs ⬇️

Deep neural networks (DNNs), as the basis of object detection, will play a key role in the development of future autonomous systems with full autonomy. The autonomous systems have special requirements of real-time, energy-efficient implementations of DNNs on a power-constrained system. Two research thrusts are dedicated to performance and energy efficiency enhancement of the inference phase of DNNs. The first one is model compression techniques while the second is efficient hardware implementation. Recent works on extremely-low-bit CNNs such as the binary neural network (BNN) and XNOR-Net replace the traditional floating-point operations with binary bit operations which significantly reduces the memory bandwidth and storage requirement. However, it suffers from non-negligible accuracy loss and underutilized digital signal processing (DSP) blocks of FPGAs. To overcome these limitations, this paper proposes REQ-YOLO, a resource-aware, systematic weight quantization framework for object detection, considering both algorithm and hardware resource aspects in object detection. We adopt the block-circulant matrix method and propose a heterogeneous weight quantization using the Alternating Direction Method of Multipliers (ADMM), an effective optimization technique for general, non-convex optimization problems. To achieve real-time, highly-efficient implementations on FPGA, we present the detailed hardware implementation of block circulant matrices on CONV layers and develop an efficient processing element (PE) structure supporting the heterogeneous weight quantization, CONV dataflow and pipelining techniques, design optimization, and a template-based automatic synthesis framework to optimally exploit hardware resource. Experimental results show that our proposed REQ-YOLO framework can significantly compress the YOLO model while introducing very small accuracy degradation.

24.SteReFo: Efficient Image Refocusing with Stereo Vision ⬇️

Whether to attract viewer attention to a particular object, give the impression of depth or simply reproduce human-like scene perception, shallow depth of field images are used extensively by professional and amateur photographers alike. To this end, high quality optical systems are used in DSLR cameras to focus on a specific depth plane while producing visually pleasing bokeh. We propose a physically motivated pipeline to mimic this effect from all-in-focus stereo images, typically retrieved by mobile cameras. It is capable to change the focal plane a posteriori at 76 FPS on KITTI images to enable real-time applications. As our portmanteau suggests, SteReFo interrelates stereo-based depth estimation and refocusing efficiently. In contrast to other approaches, our pipeline is simultaneously fully differentiable, physically motivated, and agnostic to scene content. It also enables computational video focus tracking for moving objects in addition to refocusing of static images. We evaluate our approach on the publicly available datasets SceneFlow, KITTI, CityScapes and quantify the quality of architectural changes.

25.End-to-End Deep Convolutional Active Contours for Image Segmentation ⬇️

The Active Contour Model (ACM) is a standard image analysis technique whose numerous variants have attracted an enormous amount of research attention across multiple fields. Incorrectly, however, the ACM's differential-equation-based formulation and prototypical dependence on user initialization have been regarded as being largely incompatible with the recently popular deep learning approaches to image segmentation. This paper introduces the first tight unification of these two paradigms. In particular, we devise Deep Convolutional Active Contours (DCAC), a truly end-to-end trainable image segmentation framework comprising a Convolutional Neural Network (CNN) and an ACM with learnable parameters. The ACM's Eulerian energy functional includes per-pixel parameter maps predicted by the backbone CNN, which also initializes the ACM. Importantly, both the CNN and ACM components are fully implemented in TensorFlow, and the entire DCAC architecture is end-to-end automatically differentiable and backpropagation trainable without user intervention. As a challenging test case, we tackle the problem of building instance segmentation in aerial images and evaluate DCAC on two publicly available datasets, Vaihingen and Bing Huts. Our reseults demonstrate that, for building segmentation, the DCAC establishes a new state-of-the-art performance by a wide margin.

26.Exploiting Geometric Constraints on Dense Trajectories for Motion Saliency ⬇️

The existing approaches for salient motion segmentation are unable to explicitly learn geometric cues and often give false detections on prominent static objects. We exploit multiview geometric constraints to avoid such mistakes. To handle nonrigid background like sea, we also propose a robust fusion mechanism between motion and appearance-based features. We find dense trajectories, covering every pixel in the video, and propose trajectory-based epipolar distances to distinguish between background and foreground regions. Trajectory epipolar distances are data-independent and can be readily computed given a few features' correspondences in the images. We show that by combining epipolar distances with optical flow, a powerful motion network can be learned. Enabling the network to leverage both of these information, we propose a simple mechanism, we call input-dropout. We outperform the previous motion network on DAVIS-2016 dataset by 5.2% in mean IoU score. By robustly fusing our motion network with an appearance network using the proposed input-dropout, we also outperform the previous methods on DAVIS-2016, 2017 and Segtrackv2 dataset.

27.Learning to Align Multi-Camera Domain for Unsupervised Video Person Re-Identification ⬇️

Most video person re-identification (re-ID) methods are mainly based on supervised learning, which requires laborious cross-camera ID labeling. Due to this limit, it is difficult to increase the number of cameras for constructing a large camera network. In this paper, we address the person ID labeling issue by presenting novel deep representation learning without ID information across multiple cameras. Specifically, our method consists of both inter- and intra camera feature learning techniques. We maximize feature distances between people within a camera. At the same time, considering each camera as a different domain, we apply domain adversarial learning across multiple camera views for minimizing camera domain discrepancy. To further enhance our approach, we propose person part-level adaptation to effectively perform multi-camera domain invariant feature learning at different spatial regions. We carry out comprehensive experiments on four public re-ID datasets (i.e., PRID-2011, iLIDS-VID, MARS, and Market1501). Our method outperforms state-of-the-art methods by a large margin of about 20% in terms of rank-1 accuracy on the large-scale MARS dataset.

28.RPM-Net: Robust Pixel-Level Matching Networks for Self-Supervised Video Object Segmentation ⬇️

In this paper, we introduce a self-supervised approach for video object segmentation without human labeled data.Specifically, we present Robust Pixel-level Matching Net-works (RPM-Net), a novel deep architecture that matches pixels between adjacent frames, using only color information from unlabeled videos for training. Technically, RPM-Net can be separated in two main modules. The embed-ding module first projects input images into high dimensional embedding space. Then the matching module with deformable convolution layers matches pixels between reference and target frames based on the embedding features.Unlike previous methods using deformable convolution, our matching module adopts deformable convolution to focus on similar features in spatio-temporally neighboring pixels.Our experiments show that the selective feature sampling improves the robustness to challenging problems in video object segmentation such as camera shake, fast motion, deformation, and occlusion. Also, we carry out comprehensive experiments on three public datasets (i.e., DAVIS-2017,SegTrack-v2, and Youtube-Objects) and achieve state-of-the-art performance on self-supervised video object seg-mentation. Moreover, we significantly reduce the performance gap between self-supervised and fully-supervised video object segmentation (41.0% vs. 52.5% on DAVIS-2017 validation set)

29.Spatiotemporal Co-attention Recurrent Neural Networks for Human-Skeleton Motion Prediction ⬇️

Human motion prediction aims to generate future motions based on the observed human motions. Witnessing the success of Recurrent Neural Networks (RNN) in modeling the sequential data, recent works utilize RNN to model human-skeleton motion on the observed motion sequence and predict future human motions. However, these methods did not consider the existence of the spatial coherence among joints and the temporal evolution among skeletons, which reflects the crucial characteristics of human motion in spatiotemporal space. To this end, we propose a novel Skeleton-joint Co-attention Recurrent Neural Networks (SC-RNN) to capture the spatial coherence among joints, and the temporal evolution among skeletons simultaneously on a skeleton-joint co-attention feature map in spatiotemporal space. First, a skeleton-joint feature map is constructed as the representation of the observed motion sequence. Second, we design a new Skeleton-joint Co-Attention (SCA) mechanism to dynamically learn a skeleton-joint co-attention feature map of this skeleton-joint feature map, which can refine the useful observed motion information to predict one future motion. Third, a variant of GRU embedded with SCA collaboratively models the human-skeleton motion and human-joint motion in spatiotemporal space by regarding the skeleton-joint co-attention feature map as the motion context. Experimental results on human motion prediction demonstrate the proposed method outperforms the related methods.

30.Salient Instance Segmentation via Subitizing and Clustering ⬇️

The goal of salient region detection is to identify the regions of an image that attract the most attention. Many methods have achieved state-of-the-art performance levels on this task. Recently, salient instance segmentation has become an even more challenging task than traditional salient region detection; however, few of the existing methods have concentrated on this underexplored problem. Unlike the existing methods, which usually employ object proposals to roughly count and locate object instances, our method applies salient objects subitizing to predict an accurate number of instances for salient instance segmentation. In this paper, we propose a multitask densely connected neural network (MDNN) to segment salient instances in an image. In contrast to existing approaches, our framework is proposal-free and category-independent. The MDNN contains two parallel branches: the first is a densely connected subitizing network (DSN) used for subitizing prediction; the second is a densely connected fully convolutional network (DFCN) used for salient region detection. The MDNN simultaneously outputs saliency maps and salient object subitizing. Then, an adaptive deep feature-based spectral clustering operation segments the salient regions into instances based on the subitizing and saliency maps. The experimental results on both salient region detection and salient instance segmentation datasets demonstrate the satisfactory performance of our framework. Notably, its APr@0.5 and Apr@0.7 reaches 73.46% and 60.14% in the salient instance dataset, substantially higher than the results achieved by the state-of-the-art algorithm.

31.Learning Efficient Convolutional Networks through Irregular Convolutional Kernels ⬇️

As deep neural networks are increasingly used in applications suited for low-power devices, a fundamental dilemma becomes apparent: the trend is to grow models to absorb increasing data that gives rise to memory intensive; however low-power devices are designed with very limited memory that can not store large models. Parameters pruning is critical for deep model deployment on low-power devices. Existing efforts mainly focus on designing highly efficient structures or pruning redundant connections for networks. They are usually sensitive to the tasks or relay on dedicated and expensive hashing storage strategies. In this work, we introduce a novel approach for achieving a lightweight model from the views of reconstructing the structure of convolutional kernels and efficient storage. Our approach transforms a traditional square convolution kernel to line segments, and automatically learn a proper strategy for equipping these line segments to model diverse features. The experimental results indicate that our approach can massively reduce the number of parameters (pruned 69% on DenseNet-40) and calculations (pruned 59% on DenseNet-40) while maintaining acceptable performance (only lose less than 2% accuracy).

32.PolarMask: Single Shot Instance Segmentation with Polar Representation ⬇️

In this paper, we introduce an anchor-box free and single shot instance segmentation method, which is conceptually simple, fully convolutional and can be used as a mask prediction module for instance segmentation, by easily embedding it into most off-the-shelf detection methods. Our method, termed PolarMask, formulates the instance segmentation problem as instance center classification and dense distance regression in a polar coordinate. Moreover, we propose two effective approaches to deal with sampling high-quality center examples and optimization for dense distance regression, respectively, which can significantly improve the performance and simplify the training process. Without any bells and whistles, PolarMask achieves 32.9% in mask mAP with single-model and single-scale training/testing on challenging COCO dataset. For the first time, we demonstrate a much simpler and flexible instance segmentation framework achieving competitive accuracy. We hope that the proposed PolarMask framework can serve as a fundamental and strong baseline for single shot instance segmentation tasks. Code is available at: this http URL.

33.Self-Supervised Learning of Depth and Ego-motion with Differentiable Bundle Adjustment ⬇️

Learning to predict scene depth and camera motion from RGB inputs only is a challenging task. Most existing learning based methods deal with this task in a supervised manner which require ground-truth data that is expensive to acquire. More recent approaches explore the possibility of estimating scene depth and camera pose in a self-supervised learning framework. Despite encouraging results are shown, current methods either learn from monocular videos for depth and pose and typically do so without enforcing multi-view geometry constraints between scene structure and camera motion, or require stereo sequences as input where the ground-truth between-frame motion parameters need to be known. In this paper we propose to jointly optimize the scene depth and camera motion via incorporating differentiable Bundle Adjustment (BA) layer by minimizing the feature-metric error, and then form the photometric consistency loss with view synthesis as the final supervisory signal. The proposed approach only needs unlabeled monocular videos as input, and extensive experiments on the KITTI and Cityscapes dataset show that our method achieves state-of-the-art results in self-supervised approaches using monocular videos as input, and even gains advantage to the line of methods that learns from calibrated stereo sequences (i.e. with pose supervision).

34.Weakly Supervised Energy-Based Learning for Action Segmentation ⬇️

This paper is about labeling video frames with action classes under weak supervision in training, where we have access to a temporal ordering of actions, but their start and end frames in training videos are unknown. Following prior work, we use an HMM grounded on a Gated Recurrent Unit (GRU) for frame labeling. Our key contribution is a new constrained discriminative forward loss (CDFL) that we use for training the HMM and GRU under weak supervision. While prior work typically estimates the loss on a single, inferred video segmentation, our CDFL discriminates between the energy of all valid and invalid frame labelings of a training video. A valid frame labeling satisfies the ground-truth temporal ordering of actions, whereas an invalid one violates the ground truth. We specify an efficient recursive algorithm for computing the CDFL in terms of the logadd function of the segmentation energy. Our evaluation on action segmentation and alignment gives superior results to those of the state of the art on the benchmark Breakfast Action, Hollywood Extended, and 50Salads datasets.

35.Feature Weighting and Boosting for Few-Shot Segmentation ⬇️

This paper is about few-shot segmentation of foreground objects in images. We train a CNN on small subsets of training images, each mimicking the few-shot setting. In each subset, one image serves as the query and the other(s) as support image(s) with ground-truth segmentation. The CNN first extracts feature maps from the query and support images. Then, a class feature vector is computed as an average of the support's feature maps over the known foreground. Finally, the target object is segmented in the query image by using a cosine similarity between the class feature vector and the query's feature map. We make two contributions by: (1) Improving discriminativeness of features so their activations are high on the foreground and low elsewhere; and (2) Boosting inference with an ensemble of experts guided with the gradient of loss incurred when segmenting the support images in testing. Our evaluations on the PASCAL-$5^i$ and COCO-$20^i$ datasets demonstrate that we significantly outperform existing approaches.

36.Facial Expression Recognition Using Disentangled Adversarial Learning ⬇️

The representation used for Facial Expression Recognition (FER) usually contain expression information along with other variations such as identity and illumination. In this paper, we propose a novel Disentangled Expression learning-Generative Adversarial Network (DE-GAN) to explicitly disentangle facial expression representation from identity information. In this learning by reconstruction method, facial expression representation is learned by reconstructing an expression image employing an encoder-decoder based generator. This expression representation is disentangled from identity component by explicitly providing the identity code to the decoder part of DE-GAN. The process of expression image reconstruction and disentangled expression representation learning is improved by performing expression and identity classification in the discriminator of DE-GAN. The disentangled facial expression representation is then used for facial expression recognition employing simple classifiers like SVM or MLP. The experiments are performed on publicly available and widely used face expression databases (CK+, MMI, Oulu-CASIA). The experimental results show that the proposed technique produces comparable results with state-of-the-art methods.

37.Grouped Spatial-Temporal Aggregation for Efficient Action Recognition ⬇️

Temporal reasoning is an important aspect of video analysis. 3D CNN shows good performance by exploring spatial-temporal features jointly in an unconstrained way, but it also increases the computational cost a lot. Previous works try to reduce the complexity by decoupling the spatial and temporal filters. In this paper, we propose a novel decomposition method that decomposes the feature channels into spatial and temporal groups in parallel. This decomposition can make two groups focus on static and dynamic cues separately. We call this grouped spatial-temporal aggregation (GST). This decomposition is more parameter-efficient and enables us to quantitatively analyze the contributions of spatial and temporal features in different layers. We verify our model on several action recognition tasks that require temporal reasoning and show its effectiveness.

38.Feature Level Fusion from Facial Attributes for Face Recognition ⬇️

We introduce a deep convolutional neural networks (CNN) architecture to classify facial attributes and recognize face images simultaneously via a shared learning paradigm to improve the accuracy for facial attribute prediction and face recognition performance. In this method, we use facial attributes as an auxiliary source of information to assist CNN features extracted from the face images to improve the face recognition performance. Specifically, we use a shared CNN architecture that jointly predicts facial attributes and recognize face images simultaneously via a shared learning parameters, and then we use facial attribute features an an auxiliary source of information concatenated by face features to increase the discrimination of the CNN for face recognition. This process assists the CNN classifier to better recognize face images. The experimental results show that our model increases both the face recognition and facial attribute prediction performance, especially for the identity attributes such as gender and race. We evaluated our method on several standard datasets labeled by identities and face attributes and the results show that the proposed method outperforms state-of-the-art face recognition models.

39.GLA-Net: An Attention Network with Guided Loss for Mismatch Removal ⬇️

Mismatch removal is a critical prerequisite in many feature-based tasks. Recent attempts cast the mismatch removal task as a binary classification problem and solve it through deep learning based methods. In these methods, the imbalance between positive and negative classes is important, which affects network performance, i.e., Fn-score. To establish the link between Fn-score and loss, we propose to guide the loss with the Fn-score directly. We theoretically demonstrate the direct link between our Guided Loss and Fn-score during training. Moreover, we discover that outliers often impair global context in mismatch removal networks. To address this issue, we introduce the attention mechanism to mismatch removal task and propose a novel Inlier Attention Block (IA Block). To evaluate the effectiveness of our loss and IA Block, we design an end-to-end network for mismatch removal, called GLA-Net \footnote{Our code will be available in Github later.}. Experiments have shown that our network achieves the state-of-the-art performance on benchmark datasets.

40.On Generalizing Detection Models for Unconstrained Environments ⬇️

Object detection has seen tremendous progress in recent years. However, current algorithms don't generalize well when tested on diverse data distributions. We address the problem of incremental learning in object detection on the India Driving Dataset (IDD). Our approach involves using multiple domain-specific classifiers and effective transfer learning techniques focussed on avoiding catastrophic forgetting. We evaluate our approach on the IDD and BDD100K dataset. Results show the effectiveness of our domain adaptive approach in the case of domain shifts in environments.

41.Training convolutional neural networks with cheap convolutions and online distillation ⬇️

The large memory and computation consumption in convolutional neural networks (CNNs) has been one of the main barriers for deploying them on resource-limited systems. To this end, most cheap convolutions (e.g., group convolution, depth-wise convolution, and shift convolution) have recently been used for memory and computation reduction but with the specific architecture designing. Furthermore, it results in a low discriminability of the compressed networks by directly replacing the standard convolution with these cheap ones. In this paper, we propose to use knowledge distillation to improve the performance of the compact student networks with cheap convolutions. In our case, the teacher is a network with the standard convolution, while the student is a simple transformation of the teacher architecture without complicated redesigning. In particular, we propose a novel online distillation method, which online constructs the teacher network without pre-training and conducts mutual learning between the teacher and student network, to improve the performance of the student model. Extensive experiments demonstrate that the proposed approach achieves superior performance to simultaneously reduce memory and computation overhead of cutting-edge CNNs on different datasets, including CIFAR-10/100 and ImageNet ILSVRC 2012, compared to the state-of-the-art CNN compression and acceleration methods. The codes are publicly available at this https URL.

42.Frame and Feature-Context Video Super-Resolution ⬇️

For video super-resolution, current state-of-the-art approaches either process multiple low-resolution (LR) frames to produce each output high-resolution (HR) frame separately in a sliding window fashion or recurrently exploit the previously estimated HR frames to super-resolve the following frame. The main weaknesses of these approaches are: 1) separately generating each output frame may obtain high-quality HR estimates while resulting in unsatisfactory flickering artifacts, and 2) combining previously generated HR frames can produce temporally consistent results in the case of short information flow, but it will cause significant jitter and jagged artifacts because the previous super-resolving errors are constantly accumulated to the subsequent frames. In this paper, we propose a fully end-to-end trainable frame and feature-context video super-resolution (FFCVSR) network that consists of two key sub-networks: local network and context network, where the first one explicitly utilizes a sequence of consecutive LR frames to generate local feature and local SR frame, and the other combines the outputs of local network and the previously estimated HR frames and features to super-resolve the subsequent frame. Our approach takes full advantage of the inter-frame information from multiple LR frames and the context information from previously predicted HR frames, producing temporally consistent high-quality results while maintaining real-time speed by directly reusing previous features and frames. Extensive evaluations and comparisons demonstrate that our approach produces state-of-the-art results on a standard benchmark dataset, with advantages in terms of accuracy, efficiency, and visual quality over the existing approaches.

43.DeepUSPS: Deep Robust Unsupervised Saliency Prediction With Self-Supervision ⬇️

Deep neural network (DNN) based salient object detection in images based on high-quality labels is expensive. Alternative unsupervised approaches rely on careful selection of multiple handcrafted saliency methods to generate noisy pseudo-ground-truth this http URL this work, we propose a two-stage mechanism for robust unsupervised object saliency prediction, where the first stage involves refinement of the noisy pseudo-labels generated from different handcrafted methods.Each handcrafted method is substituted by a deep network that learns to generate the pseudo-labels. These labels are refined incrementally in multiple iterations via our proposed self-supervision technique. In the second stage, the refined labels produced from multiple networks representing multiple saliency methods are used to train the actual saliency detection network. We show that this self-learning procedure outperforms all the existing unsupervised methods over different datasets. Results are even comparable to those of fully-supervised state-of-the-art approaches.

44.Meta Learning with Differentiable Closed-form Solver for Fast Video Object Segmentation ⬇️

This paper tackles the problem of video object segmentation. We are specifically concerned with the task of segmenting all pixels of a target object in all frames, given the annotation mask in the first frame. Even when such annotation is available this remains a challenging problem because of the changing appearance and shape of the object over time. In this paper, we tackle this task by formulating it as a meta-learning problem, where the base learner grasping the semantic scene understanding for a general type of objects, and the meta learner quickly adapting the appearance of the target object with a few examples. Our proposed meta-learning method uses a closed form optimizer, the so-called "ridge regression", which has been shown to be conducive for fast and better training convergence. Moreover, we propose a mechanism, named "block splitting", to further speed up the training process as well as to reduce the number of learning parameters. In comparison with the-state-of-the art methods, our proposed framework achieves significant boost up in processing speed, while having very competitive performance compared to the best performing methods on the widely used datasets.

45.Meta R-CNN : Towards General Solver for Instance-level Low-shot Learning ⬇️

Resembling the rapid learning capability of human, low-shot learning empowers vision systems to understand new concepts by training with few samples. Leading approaches derived from meta-learning on images with a single visual object. Obfuscated by a complex background and multiple objects in one image, they are hard to promote the research of low-shot object detection/segmentation. In this work, we present a flexible and general methodology to achieve these tasks. Our work extends Faster /Mask R-CNN by proposing meta-learning over RoI (Region-of-Interest) features instead of a full image feature. This simple spirit disentangles multi-object information merged with the background, without bells and whistles, enabling Faster /Mask R-CNN turn into a meta-learner to achieve the tasks. Specifically, we introduce a Predictor-head Remodeling Network (PRN) that shares its main backbone with Faster /Mask R-CNN. PRN receives images containing low-shot objects with their bounding boxes or masks to infer their class attentive vectors. The vectors take channel-wise soft-attention on RoI features, remodeling those R-CNN predictor heads to detect or segment the objects that are consistent with the classes these vectors represent. In our experiments, Meta R-CNN yields the state of the art in low-shot object detection and improves low-shot object segmentation by Mask R-CNN.

46.Semantic Example Guided Image-to-Image Translation ⬇️

Many image-to-image (I2I) translation problems are in nature of high diversity that a single input may have various counterparts. Prior works proposed the multi-modal network that can build a many-to-many mapping between two visual domains. However, most of them are guided by sampled noises. Some others encode the reference images into a latent vector, by which the semantic information of the reference image will be washed away. In this work, we aim to provide a solution to control the output based on references semantically. Given a reference image and an input in another domain, a semantic matching is first performed between the two visual contents and generates the auxiliary image, which is explicitly encouraged to preserve semantic characteristics of the reference. A deep network then is used for I2I translation and the final outputs are expected to be semantically similar to both the input and the reference; however, no such paired data can satisfy that dual-similarity in a supervised fashion, so we build up a self-supervised framework to serve the training purpose. We improve the quality and diversity of the outputs by employing non-local blocks and a multi-task architecture. We assess the proposed method through extensive qualitative and quantitative evaluations and also presented comparisons with several state-of-art models.

47.Learning Category Correlations for Multi-label Image Recognition with Graph Networks ⬇️

Multi-label image recognition is a task that predicts a set of object labels in an image. As the objects co-occur in the physical world, it is desirable to model label dependencies. Previous existing methods resort to either recurrent networks or pre-defined label correlation graphs for this purpose. In this paper, instead of using a pre-defined graph which is inflexible and may be sub-optimal for multi-label classification, we propose the A-GCN, which leverages the popular Graph Convolutional Networks with an Adaptive label correlation graph to model label dependencies. Specifically, we introduce a plug-and-play Label Graph (LG) module to learn label correlations with word embeddings, and then utilize traditional GCN to map this graph into label-dependent object classifiers which are further applied to image features. The basic LG module incorporates two 1x1 convolutional layers and uses the dot product to generate label graphs. In addition, we propose a sparse correlation constraint to enhance the LG module and also explore different LG architectures. We validate our method on two diverse multi-label datasets: MS-COCO and Fashion550K. Experimental results show that our A-GCN significantly improves baseline methods and achieves performance superior or comparable to the state of the art.

48.Distributed Iterative Gating Networks for Semantic Segmentation ⬇️

In this paper, we present a canonical structure for controlling information flow in neural networks with an efficient feedback routing mechanism based on a strategy of Distributed Iterative Gating (DIGNet). The structure of this mechanism derives from a strong conceptual foundation and presents a light-weight mechanism for adaptive control of computation similar to recurrent convolutional neural networks by integrating feedback signals with a feed-forward architecture. In contrast to other RNN formulations, DIGNet generates feedback signals in a cascaded manner that implicitly carries information from all the layers above. This cascaded feedback propagation by means of the propagator gates is found to be more effective compared to other feedback mechanisms that use feedback from the output of either the corresponding stage or from the previous stage. Experiments reveal the high degree of capability that this recurrent approach with cascaded feedback presents over feed-forward baselines and other recurrent models for pixel-wise labeling problems on three challenging datasets, PASCAL VOC 2012, COCO-Stuff, and ADE20K.

49.A closer look at network resolution for efficient network design ⬇️

There is growing interest in designing lightweight neural networks for mobile and embedded vision applications. Previous works typically reduce computations from the structure level. For example, group convolution based methods reduce computations by factorizing a vanilla convolution into depth-wise and point-wise convolutions. Pruning based methods prune redundant connections in the network structure. In this paper, we explore the importance of network input for achieving optimal accuracy-efficiency trade-off. Reducing input scale is a simple yet effective way to reduce computational cost. It does not require careful network module design, specific hardware optimization and network retraining after pruning. Moreover, different input scales contain different representations to learn. We propose a framework to mutually learn from different input resolutions and network widths. With the shared knowledge, our framework is able to find better width-resolution balance and capture multi-scale representations. It achieves consistently better ImageNet top-1 accuracy over US-Net under different computation constraints, and outperforms the best compound scale model of EfficientNet by 1.5%. The superiority of our framework is also validated on COCO object detection and instance segmentation as well as transfer learning.

50.Visual Explanation for Deep Metric Learning ⬇️

This work explores the visual explanation for deep metric learning and its applications. As an important problem for learning representation, metric learning has attracted much attention recently, while the interpretation of such model is not as well studied as classification. To this end, we propose an intuitive idea to show where contributes the most to the overall similarity of two input images by decomposing the final activation. Instead of only providing the overall activation map of each image, we propose to generate point-to-point activation intensity between two images so that the relationship between different regions is uncovered. We show that the proposed framework can be directly deployed to a large range of metric learning applications and provides valuable information for understanding the model. Furthermore, our experiments show its effectiveness on two potential applications, i.e. cross-view pattern discovery and interactive retrieval.

51.Towards Object Detection from Motion ⬇️

We present a novel approach to weakly supervised object detection. Instead of annotated images, our method only requires two short videos to learn to detect a new object: 1) a video of a moving object and 2) one or more "negative" videos of the scene without the object. The key idea of our algorithm is to train the object detector to produce physically plausible object motion when applied to the first video and to not detect anything in the second video. With this approach, our method learns to locate objects without any object location annotations. Once the model is trained, it performs object detection on single images. We evaluate our method in three robotics settings that afford learning objects from motion: observing moving objects, watching demonstrations of object manipulation, and physically interacting with objects (see a video summary at this https URL).

52.Video Skimming: Taxonomy and Comprehensive Survey ⬇️

Video skimming, also known as dynamic video summarization, generates a temporally abridged version of a given video. Skimming can be achieved by identifying significant components either in uni-modal or multi-modal features extracted from the video. Being dynamic in nature, video skimming, through temporal connectivity, allows better understanding of the video from its summary. Having this obvious advantage, recently, video skimming has drawn the focus of many researchers benefiting from the easy availability of the required computing resources. In this paper, we provide a comprehensive survey on video skimming focusing on the substantial amount of literature from the past decade. We present a taxonomy of video skimming approaches, and discuss their evolution highlighting key advances. We also provide a study on the components required for the evaluation of a video skimming performance.

53.ViLiVO: Virtual LiDAR-Visual Odometry for an Autonomous Vehicle with a Multi-Camera System ⬇️

In this paper, we present a multi-camera visual odometry (VO) system for an autonomous vehicle. Our system mainly consists of a virtual LiDAR and a pose tracker. We use a perspective transformation method to synthesize a surround-view image from undistorted fisheye camera images. With a semantic segmentation model, the free space can be extracted. The scans of the virtual LiDAR are generated by discretizing the contours of the free space. As for the pose tracker, we propose a visual odometry system fusing both the feature matching and the virtual LiDAR scan matching results. Only those feature points located in the free space area are utilized to ensure the 2D-2D matching for pose estimation. Furthermore, bundle adjustment (BA) is performed to minimize the feature points reprojection error and scan matching error. We apply our system to an autonomous vehicle equipped with four fisheye cameras. The testing scenarios include an outdoor parking lot as well as an indoor garage. Experimental results demonstrate that our system achieves a more robust and accurate performance comparing with a fisheye camera based monocular visual odometry system.

54.EPOSIT: An Absolute Pose Estimation Method for Pinhole and Fish-Eye Cameras ⬇️

This paper presents a generic 6DOF camera pose estimation method, which can be used for both the pinhole camera and the fish-eye camera. Different from existing methods, relative positions of 3D points rather than absolute coordinates in the world coordinate system are employed in our method, and it has a unique solution. The application scope of POSIT (Pose from Orthography and Scaling with Iteration) algorithm is generalized to fish-eye cameras by combining with the radially symmetric projection model. The image point relationship between the pinhole camera and the fish-eye camera is derived based on their projection model. The general pose expression which fits for different cameras can be acquired by four noncoplanar object points and their corresponding image points. Accurate estimation results are calculated iteratively. Experimental results on synthetic and real data show that the pose estimation results of our method are more stable and accurate than state-of-the-art methods. The source code is available at this https URL.

55.Handwritten Amharic Character Recognition Using a Convolutional Neural Network ⬇️

Amharic is the official language of the Federal Democratic Republic of Ethiopia. There are lots of historic Amharic and Ethiopic handwritten documents addressing various relevant issues including governance, science, religious, social rules, cultures and art works which are very reach indigenous knowledge. The Amharic language has its own alphabet derived from Ge'ez which is currently the liturgical language in Ethiopia. Handwritten character recognition for non Latin scripts like Amharic is not addressed especially using the advantages of the state of the art techniques. This research work designs for the first time a model for Amharic handwritten character recognition using a convolutional neural network. The dataset was organized from collected sample handwritten documents and data augmentation was applied for machine learning. The model was further enhanced using multi-task learning from the relationships of the characters. Promising results are observed from the later model which can further be applied to word prediction.

56.DashNet: A Hybrid Artificial and Spiking Neural Network for High-speed Object Tracking ⬇️

Computer-science-oriented artificial neural networks (ANNs) have achieved tremendous success in a variety of scenarios via powerful feature extraction and high-precision data operations. It is well known, however, that ANNs usually suffer from expensive processing resources and costs. In contrast, neuroscience-oriented spiking neural networks (SNNs) are promising for energy-efficient information processing benefit from the event-driven spike activities, whereas, they are yet be evidenced to achieve impressive effectiveness on real complicated tasks. How to combine the advantage of these two model families is an open question of great interest. Two significant challenges need to be addressed: (1) lack of benchmark datasets including both ANN-oriented (frames) and SNN-oriented (spikes) signal resources; (2) the difficulty in jointly processing the synchronous activation from ANNs and event-driven spikes from SNNs. In this work, we proposed a hybrid paradigm, named as DashNet, to demonstrate the advantages of combining ANNs and SNNs in a single model. A simulator and benchmark dataset NFS-DAVIS is built, and a temporal complementary filter (TCF) and attention module are designed to address the two mentioned challenges, respectively. In this way, it is shown that DashNet achieves the record-breaking speed of 2083FPS on neuromorphic chips and the best tracking performance on NFS-DAVIS and PRED18 datasets. To the best of our knowledge, DashNet is the first framework that can integrate and process ANNs and SNNs in a hybrid paradigm, which provides a novel solution to achieve both effectiveness and efficiency for high-speed object tracking.

57.A weakly supervised adaptive triplet loss for deep metric learning ⬇️

We address the problem of distance metric learning in visual similarity search, defined as learning an image embedding model which projects images into Euclidean space where semantically and visually similar images are closer and dissimilar images are further from one another. We present a weakly supervised adaptive triplet loss (ATL) capable of capturing fine-grained semantic similarity that encourages the learned image embedding models to generalize well on cross-domain data. The method uses weakly labeled product description data to implicitly determine fine grained semantic classes, avoiding the need to annotate large amounts of training data. We evaluate on the Amazon fashion retrieval benchmark and DeepFashion in-shop retrieval data. The method boosts the performance of triplet loss baseline by 10.6% on cross-domain data and out-performs the state-of-art model on all evaluation metrics.

58.Unsupervised Segmentation of Fire and Smoke from Infra-Red Videos ⬇️

This paper proposes a vision-based fire and smoke segmentation system which use spatial, temporal and motion information to extract the desired regions from the video frames. The fusion of information is done using multiple features such as optical flow, divergence and intensity values. These features extracted from the images are used to segment the pixels into different classes in an unsupervised way. A comparative analysis is done by using multiple clustering algorithms for segmentation. Here the Markov Random Field performs more accurately than other segmentation algorithms since it characterizes the spatial interactions of pixels using a finite number of parameters. It builds a probabilistic image model that selects the most likely labeling using the maximum a posteriori (MAP) estimation. This unsupervised approach is tested on various images and achieves a frame-wise fire detection rate of 95.39%. Hence this method can be used for early detection of fire in real-time and it can be incorporated into an indoor or outdoor surveillance system.

59.6D Pose Estimation with Correlation Fusion ⬇️

6D object pose estimation is widely applied in robotic tasks such as grasping and manipulation. Prior methods using RGB-only images are vulnerable to heavy occlusion and poor illumination, so it is important to complement them with depth information. However, existing methods using RGB-D data don't adequately exploit consistent and complementary information between two modalities. In this paper, we present a novel method to effectively consider the correlation within and across RGB and depth modalities with attention mechanism to learn discriminative multi-modal features. Then, effective fusion strategies for intra- and inter-correlation modules are explored to ensure efficient information flow between RGB and depth. To the best of our knowledge, this is the first work to explore effective intra- and inter-modality fusion in 6D pose estimation and experimental results show that our method can help achieve the state-of-the-art performance on LineMOD and YCB-Video datasets as well as benefit robot grasping task.

60.Responsible Facial Recognition and Beyond ⬇️

Facial recognition is changing the way we live in and interact with our society. Here we discuss the two sides of facial recognition, summarizing potential risks and current concerns. We introduce current policies and regulations in different countries. Very importantly, we point out that the risks and concerns are not only from facial recognition, but also realistically very similar to other biometric recognition technology, including but not limited to gait recognition, iris recognition, fingerprint recognition, voice recognition, etc. To create a responsible future, we discuss possible technological moves and efforts that should be made to keep facial recognition (and biometric recognition in general) developing for social good.

61.Subtractive Perceptrons for Learning Images: A Preliminary Report ⬇️

In recent years, artificial neural networks have achieved tremendous success for many vision-based tasks. However, this success remains within the paradigm of \emph{weak AI} where networks, among others, are specialized for just one given task. The path toward \emph{strong AI}, or Artificial General Intelligence, remains rather obscure. One factor, however, is clear, namely that the feed-forward structure of current networks is not a realistic abstraction of the human brain. In this preliminary work, some ideas are proposed to define a \textit{subtractive Perceptron} (s-Perceptron), a graph-based neural network that delivers a more compact topology to learn one specific task. In this preliminary study, we test the s-Perceptron with the MNIST dataset, a commonly used image archive for digit recognition. The proposed network achieves excellent results compared to the benchmark networks that rely on more complex topologies.

62.BUDA.ART: A Multimodal Content-Based Analysis and Retrieval System for Buddha Statues ⬇️

We introduce BUDA.ART, a system designed to assist researchers in Art History, to explore and analyze an archive of pictures of Buddha statues. The system combines different CBIR and classical retrieval techniques to assemble 2D pictures, 3D statue scans and meta-data, that is focused on the Buddha facial characteristics. We build the system from an archive of 50,000 Buddhism pictures, identify unique Buddha statues, extract contextual information, and provide specific facial embedding to first index the archive. The system allows for mobile, on-site search, and to explore similarities of statues in the archive. In addition, we provide search visualization and 3D analysis of the statues

63.Self-Paced Video Data Augmentation with Dynamic Images Generated by Generative Adversarial Networks ⬇️

There is an urgent need for an effective video classification method by means of a small number of samples. The deficiency of samples could be effectively alleviated by generating samples through Generative Adversarial Networks (GAN), but the generation of videos on a typical category remains to be underexplored since the complex actions and the changeable viewpoints are difficult to simulate. In this paper, we propose a generative data augmentation method for temporal stream of the Temporal Segment Networks with the dynamic image. The dynamic image compresses the motion information of video into a still image, removing the interference factors such as the background. Thus it is easier to generate images with categorical motion information using GAN. We use the generated dynamic images to enhance the features, with regularization achieved as well, thereby to achieve the effect of video augmentation. In order to deal with the uneven quality of generated images, we propose a Self-Paced Selection (SPS) method, which automatically selects the high-quality generated samples to be added to the network training. Our method is verified on two benchmark datasets, HMDB51 and UCF101. The experimental results show that the method can improve the accuracy of video classification under the circumstance of sample insufficiency and sample imbalance.

64.Toward Robust Image Classification ⬇️

Neural networks are frequently used for image classification, but can be vulnerable to misclassification caused by adversarial images. Attempts to make neural network image classification more robust have included variations on preprocessing (cropping, applying noise, blurring), adversarial training, and dropout randomization. In this paper, we implemented a model for adversarial detection based on a combination of two of these techniques: dropout randomization with preprocessing applied to images within a given Bayesian uncertainty. We evaluated our model on the MNIST dataset, using adversarial images generated using Fast Gradient Sign Method (FGSM), Jacobian-based Saliency Map Attack (JSMA) and Basic Iterative Method (BIM) attacks. Our model achieved an average adversarial image detection accuracy of 97%, with an average image classification accuracy, after discarding images flagged as adversarial, of 99%. Our average detection accuracy exceeded that of recent papers using similar techniques.

65.End-to-End Deep Residual Learning with Dilated Convolutions for Myocardial Infarction Detection and Localization ⬇️

In this report, I investigate the use of end-to-end deep residual learning with dilated convolutions for myocardial infarction (MI) detection and localization from electrocardiogram (ECG) signals. Although deep residual learning has already been applied to MI detection and localization, I propose a more accurate system that distinguishes among a higher number (i.e., six) of MI locations. Inspired by speech waveform processing with neural networks, I found a more robust front-end than directly arranging the multi-lead ECG signal into an input matrix consisting of the use of a single one-dimensional convolutional layer per ECG lead to extract a pseudo-time-frequency representation and create a compact and discriminative input feature volume. As a result, I end up with a system achieving an MI detection and localization accuracy of 99.99% on the well-known Physikalisch-Technische Bundesanstalt (PTB) database.

66.Historical and Modern Features for Buddha Statue Classification ⬇️

While Buddhism has spread along the Silk Roads, many pieces of art have been displaced. Only a few experts may identify these works, subjectively to their experience. The construction of Buddha statues was taught through the definition of canon rules, but the applications of those rules greatly varies across time and space. Automatic art analysis aims at supporting these challenges. We propose to automatically recover the proportions induced by the construction guidelines, in order to use them and compare between different deep learning features for several classification tasks, in a medium size but rich dataset of Buddha statues, collected with experts of Buddhism art history.

67.HR-CAM: Precise Localization of Pathology Using Multi-level Learning in CNNs ⬇️

We propose a CNN based technique that aggregates feature maps from its multiple layers that can localize abnormalities with greater details as well as predict pathology under consideration. Existing class activation mapping (CAM) techniques extract feature maps from either the final layer or a single intermediate layer to create the discriminative maps and then interpolate to upsample to the original image resolution. In this case, the subject specific localization is coarse and is unable to capture subtle abnormalities. To mitigate this, our method builds a novel CNN based discriminative localization model that we call high resolution CAM (HR-CAM), which accounts for layers from each resolution, therefore facilitating a comprehensive map that can delineate the pathology for each subject by combining low-level, intermediate as well as high-level features from the CNN. Moreover, our model directly provides the discriminative map in the resolution of the original image facilitating finer delineation of abnormalities. We demonstrate the working of our model on a simulated abnormalities data where we illustrate how the model captures finer details in the final discriminative maps as compared to current techniques. We then apply this technique: (1) to classify ependymomas from grade IV glioblastoma on T1-weighted contrast enhanced (T1-CE) MRI and (2) to predict Parkinson's disease from neuromelanin sensitive MRI. In all these cases we demonstrate that our model not only predicts pathologies with high accuracies, but also creates clinically interpretable subject specific high resolution discriminative localizations. Overall, the technique can be generalized to any CNN and carries high relevance in a clinical setting.

68.A Lightweight Deep Learning Model for Human Activity Recognition on Edge Devices ⬇️

Human Activity Recognition (HAR) using wearable and mobile sensors has gained momentum in last few years, in various fields, such as, healthcare, surveillance, education, entertainment. Nowadays, Edge Computing has emerged to reduce communication latency and network traffic.Edge devices are resource constrained devices and cannot support high computation. In literature, various models have been developed for HAR. In recent years, deep learning algorithms have shown high performance in HAR, but these algorithms require lot of computation making them inefficient to be deployed on edge devices. This paper, proposes a Lightweight Deep Learning Model for HAR requiring less computational power, making it suitable to be deployed on edge devices. The performance of proposed model is tested on the participants six daily activities data. Results show that the proposed model outperforms many of the existing machine learning and deep learning techniques.

69.Semantic and Visual Similarities for Efficient Knowledge Transfer in CNN Training ⬇️

In recent years, representation learning approaches have disrupted many multimedia computing tasks. Among those approaches, deep convolutional neural networks (CNNs) have notably reached human level expertise on some constrained image classification tasks. Nonetheless, training CNNs from scratch for new task or simply new data turns out to be complex and time-consuming. Recently, transfer learning has emerged as an effective methodology for adapting pre-trained CNNs to new data and classes, by only retraining the last classification layer. This paper focuses on improving this process, in order to better transfer knowledge between CNN architectures for faster trainings in the case of fine tuning for image classification. This is achieved by combining and transfering supplementary weights, based on similarity considerations between source and target classes. The study includes a comparison between semantic and content-based similarities, and highlights increased initial performances and training speed, along with superior long term performances when limited training samples are available.

70.Student Engagement Detection Using Emotion Analysis, Eye Tracking and Head Movement with Machine Learning ⬇️

With the increase of distance learning, in general, and e-learning, in particular, having a system capable of determining the engagement of students is of primordial importance, and one of the biggest challenges, both for teachers, researchers and policy makers. Here, we present a system to detect the engagement level of the students. It uses only information provided by the typical built-in web-camera present in a laptop computer, and was designed to work in real time. We combine information about the movements of the eyes and head, and facial emotions to produce a concentration index with three classes of engagement: "very engaged", "nominally engaged" and "not engaged at all". The system was tested in a typical e-learning scenario, and the results show that it correctly identifies each period of time where students were "very engaged", "nominally engaged" and "not engaged at all". Additionally, the results also show that the students with best scores also have higher concentration indexes.

71.Graph Neural Networks for Image Understanding Based on Multiple Cues: Group Emotion Recognition and Event Recognition as Use Cases ⬇️

A graph neural network (GNN) for image understanding based on multiple cues is proposed in this paper. Compared to traditional feature and decision fusion approaches that neglect the fact that features can interact and exchange information, the proposed GNN is able to pass information among features extracted from different models. Two image understanding tasks, namely group-level emotion recognition (GER) and event recognition, which are highly semantic and require the interaction of several deep models to synthesize multiple cues, were selected to validate the performance of the proposed method. It is shown through experiments that the proposed method achieves state-of-the-art performance on the selected image understanding tasks. In addition, a new group-level emotion recognition database is introduced and shared in this paper.

72.Deeply Matting-based Dual Generative Adversarial Network for Image and Document Label Supervision ⬇️

Although many methods have been proposed to deal with nature image super-resolution (SR) and get impressive performance, the text images SR is not good due to their ignorance of document images. In this paper, we propose a matting-based dual generative adversarial network (mdGAN) for document image SR. Firstly, the input image is decomposed into document text, foreground and background layers using deep image matting. Then two parallel branches are constructed to recover text boundary information and color information respectively. Furthermore, in order to improve the restoration accuracy of characters in output image, we use the input image's corresponding ground truth text label as extra supervise information to refine the two-branch networks during training. Experiments on real text images demonstrate that our method outperforms several state-of-the-art methods quantitatively and qualitatively.

73.Beyond Top-Grasps Through Scene Completion ⬇️

Current end-to-end grasp planning methods propose grasps in the order of (milli)seconds that attain high grasp success rates on a diverse set of objects, but often by constraining the workspace to top-grasps. In this work, we present a method that allows end-to-end top grasp planning methods to generate full six-degree-of-freedom grasps using a single RGB-D view as input. This is achieved by estimating the complete shape of the object to be grasped, then simulating different viewpoints of the object, passing the simulated viewpoints to an end-to-end grasp generation method, and finally executing the overall best grasp. The method was experimentally validated on a Franka Emika Panda by comparing 429 grasps generated by the state-of-the-art Fully Convolutional Grasp Quality CNN, both on simulated and real camera viewpoints. The results show statistically significant improvements in terms of grasp success rate when using simulated viewpoints over real camera viewpoints, especially when the real camera viewpoint is angled.

74.A Quotient Space Formulation for Statistical Analysis of Graphical Data ⬇️

Complex analyses involving multiple, dependent random quantities often lead to graphical models: a set of nodes denoting variables of interest, and corresponding edges denoting statistical interactions between nodes. To develop statistical analyses for graphical data, one needs mathematical representations and metrics for matching and comparing graphs, and other geometrical tools, such as geodesics, means, and covariances, on representation spaces of graphs. This paper utilizes a quotient structure to develop efficient algorithms for computing these quantities, leading to useful statistical tools, including principal component analysis, linear dimension reduction, and analytical statistical modeling. The efficacy of this framework is demonstrated using datasets taken from several problem areas, including alphabets, video summaries, social networks, and biochemical structures.

75.Meta Reinforcement Learning for Sim-to-real Domain Adaptation ⬇️

Modern reinforcement learning methods suffer from low sample efficiency and unsafe exploration, making it infeasible to train robotic policies entirely on real hardware. In this work, we propose to address the problem of sim-to-real domain transfer by using meta learning to train a policy that can adapt to a variety of dynamic conditions, and using a task-specific trajectory generation model to provide an action space that facilitates quick exploration. We evaluate the method by performing domain adaptation in simulation and analyzing the structure of the latent space during adaptation. We then deploy this policy on a KUKA LBR 4+ robot and evaluate its performance on a task of hitting a hockey puck to a target. Our method shows more consistent and stable domain adaptation than the baseline, resulting in better overall performance.

76.Interpreting Distortions in Dimensionality Reduction by Superimposing Neighbourhood Graphs ⬇️

To perform visual data exploration, many dimensionality reduction methods have been developed. These tools allow data analysts to represent multidimensional data in a 2D or 3D space, while preserving as much relevant information as possible. Yet, they cannot preserve all structures simultaneously and they induce some unavoidable distortions. Hence, many criteria have been introduced to evaluate a map's overall quality, mostly based on the preservation of neighbourhoods. Such global indicators are currently used to compare several maps, which helps to choose the most appropriate mapping method and its hyperparameters. However, those aggregated indicators tend to hide the local repartition of distortions. Thereby, they need to be supplemented by local evaluation to ensure correct interpretation of maps. In this paper, we describe a new method, called MING, for `Map Interpretation using Neighbourhood Graphs'. It offers a graphical interpretation of pairs of map quality indicators, as well as local evaluation of the distortions. This is done by displaying on the map the nearest neighbours graphs computed in the data space and in the embedding. Shared and unshared edges exhibit reliable and unreliable neighbourhood information conveyed by the mapping. By this mean, analysts may determine whether proximity (or remoteness) of points on the map faithfully represents similarity (or dissimilarity) of original data, within the meaning of a chosen map quality criteria. We apply this approach to two pairs of widespread indicators: precision/recall and trustworthiness/continuity, chosen for their wide use in the community, which will allow an easy handling by users.

77.A Topological Nomenclature for 3D Shape Analysis in Connectomics ⬇️

An essential task in nano-scale connectomics is the morphology analysis of neurons and organelles like mitochondria to shed light on their biological properties. However, these biological objects often have tangled parts or complex branching patterns, which makes it hard to abstract, categorize, and manipulate their morphology. Here we propose a topological nomenclature to name these objects like chemical compounds for neuroscience analysis. To this end, we convert the volumetric representation into the topology-preserving reduced graph, develop nomenclature rules for pyramidal neurons and mitochondria from the reduced graph, and learn the feature embedding for shape manipulation. In ablation studies, we show that the proposed reduced graph extraction method yield graphs better in accord with the perception of experts. On 3D shape retrieval and decomposition tasks, we show that the encoded topological nomenclature features achieve better results than state-of-the-art shape descriptors. To advance neuroscience, we will release a 3D mesh dataset of mitochondria and pyramidal neurons reconstructed from a 100{\mu}m cube electron microscopy (EM) volume. Code is publicly available at this https URL.

78.Efficient Bimanual Manipulation Using Learned Task Schemas ⬇️

We address the problem of effectively composing skills to solve sparse-reward tasks in the real world. Given a set of parameterized skills (such as exerting a force or doing a top grasp at a location), our goal is to learn policies that invoke these skills to efficiently solve such tasks. Our insight is that for many tasks, the learning process can be decomposed into learning a state-independent task schema (a sequence of skills to execute) and a policy to choose the parameterizations of the skills in a state-dependent manner. For such tasks, we show that explicitly modeling the schema's state-independence can yield significant improvements in sample efficiency for model-free reinforcement learning algorithms. Furthermore, these schemas can be transferred to solve related tasks, by simply re-learning the parameterizations with which the skills are invoked. We find that doing so enables learning to solve sparse-reward tasks on real-world robotic systems very efficiently. We validate our approach experimentally over a suite of robotic bimanual manipulation tasks, both in simulation and on real hardware. See videos at this http URL .

79.Geometric Brain Surface Network For Brain Cortical Parcellation ⬇️

A large number of surface-based analyses on brain imaging data adopt some specific brain atlases to better assess structural and functional changes in one or more brain regions. In these analyses, it is necessary to obtain an anatomically correct surface parcellation scheme in an individual brain by referring to the given atlas. Traditional ways to accomplish this goal are through a designed surface-based registration or hand-crafted surface features, although both of them are time-consuming. A recent deep learning approach depends on a regular spherical parameterization of the mesh, which is computationally prohibitive in some cases and may also demand further post-processing to refine the network output. Therefore, an accurate and fully-automatic cortical surface parcellation scheme directly working on the original brain surfaces would be highly advantageous. In this study, we propose an end-to-end deep brain cortical parcellation network, called \textbf{DBPN}. Through intrinsic and extrinsic graph convolution kernels, DBPN dynamically deciphers neighborhood graph topology around each vertex and encodes the deciphered knowledge into node features. Eventually, a non-linear mapping between the node features and parcellation labels is constructed. Our model is a two-stage deep network which contains a coarse parcellation network with a U-shape structure and a refinement network to fine-tune the coarse results. We evaluate our model in a large public dataset and our work achieves superior performance than state-of-the-art baseline methods in both accuracy and efficiency

80.Coarse-to-Fine Registration of Airborne LiDAR Data and Optical Imagery on Urban Scenes ⬇️

Applications based on synergistic integration of optical imagery and LiDAR data are receiving a growing interest from the remote sensing community. However, a misaligned integration between these datasets may fail to fully profit the potential of both sensors. In this regard, an optimum fusion of optical imagery and LiDAR data requires an accurate registration. This is a complex problem since a versatile solution is still missing, especially when considering the context where data are collected at different times, from different platforms, under different acquisition configurations. This paper presents a coarse-to-fine registration method of aerial/satellite optical imagery with airborne LiDAR data acquired in such context. Firstly, a coarse registration involves extracting and matching of buildings from LiDAR data and optical imagery. Then, a Mutual Information-based fine registration is carried out. It involves a super-resolution approach applied to LiDAR data, and a local approach of transformation model estimation. The proposed method succeeds at overcoming the challenges associated with the aforementioned difficult context. Considering the experimented airborne LiDAR (2011) and orthorectified aerial imagery (2016) datasets, their spatial shift is reduced by 48.15% after the proposed coarse registration. Moreover, the incompatibility of size and spatial resolution is addressed by the mentioned super-resolution. Finally, a high accuracy of dataset alignment is also achieved, highlighted by a 40-cm error based on a check-point assessment and a 64-cm error based on a check-pair-line assessment. These promising results enable further research for a complete versatile fusion methodology between airborne LiDAR and optical imagery data in this challenging context.

81.X-ray and Visible Spectra Circular Motion Images Dataset ⬇️

We present the collections of images of the same rotating plastic object made in X-ray and visible spectra. Both parts of the dataset contain 400 images. The images are maid every 0.5 degrees of the object axial rotation. The collection of images is designed for evaluation of the performance of circular motion estimation algorithms as well as for the study of X-ray nature influence on the image analysis algorithms such as keypoints detection and description. The dataset is available at this https URL.

82.Towards Multimodal Understanding of Passenger-Vehicle Interactions in Autonomous Vehicles: Intent/Slot Recognition Utilizing Audio-Visual Data ⬇️

Understanding passenger intents from spoken interactions and car's vision (both inside and outside the vehicle) are important building blocks towards developing contextual dialog systems for natural interactions in autonomous vehicles (AV). In this study, we continued exploring AMIE (Automated-vehicle Multimodal In-cabin Experience), the in-cabin agent responsible for handling certain multimodal passenger-vehicle interactions. When the passengers give instructions to AMIE, the agent should parse such commands properly considering available three modalities (language/text, audio, video) and trigger the appropriate functionality of the AV system. We had collected a multimodal in-cabin dataset with multi-turn dialogues between the passengers and AMIE using a Wizard-of-Oz scheme via realistic scavenger hunt game. In our previous explorations, we experimented with various RNN-based models to detect utterance-level intents (set destination, change route, go faster, go slower, stop, park, pull over, drop off, open door, and others) along with intent keywords and relevant slots (location, position/direction, object, gesture/gaze, time-guidance, person) associated with the action to be performed in our AV scenarios. In this recent work, we propose to discuss the benefits of multimodal understanding of in-cabin utterances by incorporating verbal/language input (text and speech embeddings) together with the non-verbal/acoustic and visual input from inside and outside the vehicle (i.e., passenger gestures and gaze from in-cabin video stream, referred objects outside of the vehicle from the road view camera stream). Our experimental results outperformed text-only baselines and with multimodality, we achieved improved performances for utterance-level intent detection and slot filling.

83.Using machine learning to construct velocity fields from OH-PLIF images ⬇️

This work utilizes data-driven methods to morph a series of time-resolved experimental OH-PLIF images into corresponding three-component planar PIV fields in the closed domain of a premixed swirl combustor. The task is carried out with a fully convolutional network, which is a type of convolutional neural network (CNN) used in many applications in machine learning, alongside an existing experimental dataset which consists of simultaneous OH-PLIF and PIV measurements in both attached and detached flame regimes. Two types of models are compared: 1) a global CNN which is trained using images from the entire domain, and 2) a set of local CNNs, which are trained only on individual sections of the domain. The locally trained models show improvement in creating mappings in the detached regime over the global models. A comparison between model performance in attached and detached regimes shows that the CNNs are much more accurate across the board in creating velocity fields for attached flames. Inclusion of time history in the PLIF input resulted in small noticeable improvement on average, which could imply a greater physical role of instantaneous spatial correlations in the decoding process over temporal dependencies from the perspective of the CNN. Additionally, the performance of local models trained to produce mappings in one section of the domain is tested on other, unexplored sections of the domain. Interestingly, local CNN performance on unseen domain regions revealed the models' ability to utilize symmetry and antisymmetry in the velocity field. Ultimately, this work shows the powerful ability of the CNN to decode the three-dimensional PIV fields from input OH-PLIF images, providing a potential groundwork for a very useful tool for experimental configurations in which accessibility of forms of simultaneous measurements are limited.

84.Interpretations are useful: penalizing explanations to align neural networks with prior knowledge ⬇️

For an explanation of a deep learning model to be effective, it must provide both insight into a model and suggest a corresponding action in order to achieve some objective. Too often, the litany of proposed explainable deep learning methods stop at the first step, providing practitioners with insight into a model, but no way to act on it. In this paper, we propose contextual decomposition explanation penalization (CDEP), a method which enables practitioners to leverage existing explanation methods in order to increase the predictive accuracy of deep learning models. In particular, when shown that a model has incorrectly assigned importance to some features, CDEP enables practitioners to correct these errors by directly regularizing the provided explanations. Using explanations provided by contextual decomposition (CD) (Murdoch et al., 2018), we demonstrate the ability of our method to increase performance on an array of toy and real datasets.

85.Imagine That! Leveraging Emergent Affordances for Tool Synthesis in Reaching Tasks ⬇️

In this paper we investigate an artificial agent's ability to perform task-focused tool synthesis via imagination. Our motivation is to explore the richness of information captured by the latent space of an object-centric generative model -- and how to exploit it. In particular, our approach employs activation maximisation of a task-based performance predictor to optimise the latent variable of a structured latent-space model in order to generate tool geometries appropriate for the task at hand. We evaluate our model using a novel dataset of synthetic reaching tasks inspired by the cognitive sciences and behavioural ecology. In doing so we examine the model's ability to imagine tools for increasingly complex scenario types, beyond those seen during training. Our experiments demonstrate that the synthesis process modifies emergent, task-relevant object affordances in a targeted and deliberate way: the agents often specifically modify aspects of the tools which relate to meaningful (yet implicitly learned) concepts such as a tool's length, width and configuration. Our results therefore suggest that task relevant object affordances are implicitly encoded as directions in a structured latent space shaped by experience.

86.DSRGAN: Explicitly Learning Disentangled Representation of Underlying Structure and Rendering for Image Generation without Tuple Supervision ⬇️

We focus on explicitly learning disentangled representation for natural image generation, where the underlying spatial structure and the rendering on the structure can be independently controlled respectively, yet using no tuple supervision. The setting is significant since tuple supervision is costly and sometimes even unavailable. However, the task is highly unconstrained and thus ill-posed. To address this problem, we propose to introduce an auxiliary domain which shares a common underlying-structure space with the target domain, and we make a partially shared latent space assumption. The key idea is to encourage the partially shared latent variable to represent the similar underlying spatial structures in both domains, while the two domain-specific latent variables will be unavoidably arranged to present renderings of two domains respectively. This is achieved by designing two parallel generative networks with a common Progressive Rendering Architecture (PRA), which constrains both generative networks' behaviors to model shared underlying structure and to model spatially dependent relation between rendering and underlying structure. Thus, we propose DSRGAN (GANs for Disentangling Underlying Structure and Rendering) to instantiate our method. We also propose a quantitative criterion (the Normalized Disentanglability) to quantify disentanglability. Comparison to the state-of-the-art methods shows that DSRGAN can significantly outperform them in disentanglability.

87.Robust Data Association for Object-level Semantic SLAM ⬇️

Simultaneous mapping and localization (SLAM) in an real indoor environment is still a challenging task. Traditional SLAM approaches rely heavily on low-level geometric constraints like corners or lines, which may lead to tracking failure in textureless surroundings or cluttered world with dynamic objects. In this paper, a compact semantic SLAM framework is proposed, with utilization of both geometric and object-level semantic constraints jointly, a more consistent mapping result, and more accurate pose estimation can be obtained. Two main contributions are presented int the paper, a) a robust and efficient SLAM data association and optimization framework is proposed, it models both discrete semantic labeling and continuous pose. b) a compact map representation, combining 2D Lidar map with object detection is presented. Experiments on public indoor datasets, TUM-RGBD, ICL-NUIM, and our own collected datasets prove the improving of SLAM robustness and accuracy compared to other popular SLAM systems, meanwhile a map maintenance efficiency can be achieved.

88.Predicting Responses to a Robot's Future Motion using Generative Recurrent Neural Networks ⬇️

Robotic navigation through crowds or herds requires the ability to both predict the future motion of nearby individuals and understand how these predictions might change in response to a robot's future action. State of the art trajectory prediction models using Recurrent Neural Networks (RNNs) do not currently account for a planned future action of a robot, and so cannot predict how an individual will move in response to a robot's planned path. We propose an approach that adapts RNNs to use a robot's next planned action as an input alongside the current position of nearby individuals. This allows the model to learn the response of individuals with regards to a robot's motion from real world observations. By linking a robot's actions to the response of those around it in training, we show that we are able to not only improve prediction accuracy in close range interactions, but also to predict the likely response of surrounding individuals to simulated actions. This allows the use of the model to simulate state transitions, without requiring any assumptions on agent interaction. We apply this model to varied datasets, including crowds of pedestrians interacting with vehicles and bicycles, and livestock interacting with a robotic vehicle.

89.Re-learning of Child Model for Misclassified data by using KL Divergence in AffectNet: A Database for Facial Expression ⬇️

AffectNet contains more than 1,000,000 facial images which manually annotated for the presence of eight discrete facial expressions and the intensity of valence and arousal. Adaptive structural learning method of DBN (Adaptive DBN) is positioned as a top Deep learning model of classification capability for some large image benchmark databases. The Convolutional Neural Network and Adaptive DBN were trained for AffectNet and classification capability was compared. Adaptive DBN showed higher classification ratio. However, the model was not able to classify some test cases correctly because human emotions contain many ambiguous features or patterns leading wrong answer which includes the possibility of being a factor of adversarial examples, due to two or more annotators answer different subjective judgment for an image. In order to distinguish such cases, this paper investigated a re-learning model of Adaptive DBN with two or more child models, where the original trained model can be seen as a parent model and then new child models are generated for some misclassified cases. In addition, an appropriate child model was generated according to difference between two models by using KL divergence. The generated child models showed better performance to classify two emotion categories: Disgust' and Anger'.

90.An Object Detection by using Adaptive Structural Learning of Deep Belief Network ⬇️

Deep learning forms a hierarchical network structure for representation of multiple input features. The adaptive structural learning method of Deep Belief Network (DBN) can realize a high classification capability while searching the optimal network structure during the training. The method can find the optimal number of hidden neurons for given input data in a Restricted Boltzmann Machine (RBM) by neuron generation-annihilation algorithm. Moreover, it can generate a new hidden layer in DBN by the layer generation algorithm to actualize a deep data representation. The proposed method showed higher classification accuracy for image benchmark data sets than several deep learning methods including well-known CNN methods. In this paper, a new object detection method for the DBN architecture is proposed for localization and category of objects. The method is a task for finding semantic objects in images as Bounding Box (B-Box). To investigate the effectiveness of the proposed method, the adaptive structural learning of DBN and the object detection were evaluated on the Chest X-ray image benchmark data set (CXR8), which is one of the most commonly accessible radio-logical examination for many lung diseases. The proposed method showed higher performance for both classification (more than 94.5% classification for test data) and localization (more than 90.4% detection for test data) than the other CNN methods.

91.Lane Attention: Predicting Vehicles' Moving Trajectories by Learning Their Attention over Lanes ⬇️

Accurately forecasting the future movements of surrounding vehicles is essential for safe and efficient operations of autonomous driving cars. This task is difficult because a vehicle's moving trajectory is greatly determined by its driver's intention, which is often hard to estimate. By leveraging attention mechanisms along with long short-term memory (LSTM) networks, this work learns the relation between a driver's intention and the vehicle's changing positions relative to road infrastructures, and uses it to guide the prediction. Different from other state-of-the-art solutions, our work treats the on-road lanes as non-Euclidean structures, unfolds the vehicle's moving history to form a spatio-temporal graph, and uses methods from Graph Neural Networks to solve the problem. Not only is our approach a pioneering attempt in using non-Euclidean methods to process static environmental features around a predicted object, our model also outperforms other state-of-the-art models in several metrics. The practicability and interpretability analysis of the model shows great potential for large-scale deployment in various autonomous driving systems in addition to our own.

92.Strong Baseline Defenses Against Clean-Label Poisoning Attacks ⬇️

Targeted clean-label poisoning is a type of adversarial attack on machine learning systems where the adversary injects a few correctly-labeled, minimally-perturbed samples into the training data thus causing the deployed model to misclassify a particular test sample during inference. Although defenses have been proposed for general poisoning attacks (those which aim to reduce overall test accuracy), no reliable defense for clean-label attacks has been demonstrated, despite the attacks' effectiveness and their realistic use cases. In this work, we propose a set of simple, yet highly-effective defenses against these attacks. We test our proposed approach against two recently published clean-label poisoning attacks, both of which use the CIFAR-10 dataset. After reproducing their experiments, we demonstrate that our defenses are able to detect over 99% of poisoning examples in both attacks and remove them without any compromise on model performance. Our simple defenses show that current clean-label poisoning attack strategies can be annulled, and serve as strong but simple-to-implement baseline defense for which to test future clean-label poisoning attacks.

93.Pixel-Wise PolSAR Image Classification via a Novel Complex-Valued Deep Fully Convolutional Network ⬇️

Although complex-valued (CV) neural networks have shown better classification results compared to their real-valued (RV) counterparts for polarimetric synthetic aperture radar (PolSAR) classification, the extension of pixel-level RV networks to the complex domain has not yet thoroughly examined. This paper presents a novel complex-valued deep fully convolutional neural network (CV-FCN) designed for PolSAR image classification. Specifically, CV-FCN uses PolSAR CV data that includes the phase information and utilizes the deep FCN architecture that performs pixel-level labeling. It integrates the feature extraction module and the classification module in a united framework. Technically, for the particularity of PolSAR data, a dedicated complex-valued weight initialization scheme is defined to initialize CV-FCN. It considers the distribution of polarization data to conduct CV-FCN training from scratch in an efficient and fast manner. CV-FCN employs a complex downsampling-then-upsampling scheme to extract dense features. To enrich discriminative information, multi-level CV features that retain more polarization information are extracted via the complex downsampling scheme. Then, a complex upsampling scheme is proposed to predict dense CV labeling. It employs complex max-unpooling layers to greatly capture more spatial information for better robustness to speckle noise. In addition, to achieve faster convergence and obtain more precise classification results, a novel average cross-entropy loss function is derived for CV-FCN optimization. Experiments on real PolSAR datasets demonstrate that CV-FCN achieves better classification performance than other state-of-art methods.

94.Test-Time Training for Out-of-Distribution Generalization ⬇️

We introduce a general approach, called test-time training, for improving the performance of predictive models when test and training data come from different distributions. Test-time training turns a single unlabeled test instance into a self-supervised learning problem, on which we update the model parameters before making a prediction on the test sample. We show that this simple idea leads to surprising improvements on diverse image classification benchmarks aimed at evaluating robustness to distribution shifts. Theoretical investigations on a convex model reveal helpful intuitions for when we can expect our approach to help.

95.Policy Message Passing: A New Algorithm for Probabilistic Graph Inference ⬇️

A general graph-structured neural network architecture operates on graphs through two core components: (1) complex enough message functions; (2) a fixed information aggregation process. In this paper, we present the Policy Message Passing algorithm, which takes a probabilistic perspective and reformulates the whole information aggregation as stochastic sequential processes. The algorithm works on a much larger search space, utilizes reasoning history to perform inference, and is robust to noisy edges. We apply our algorithm to multiple complex graph reasoning and prediction tasks and show that our algorithm consistently outperforms state-of-the-art graph-structured models by a significant margin.

96.Plasmodium Detection Using Simple CNN and Clustered GLCM Features ⬇️

Malaria is a serious disease caused by the Plasmodium parasite that transmitted through the bite of a female Anopheles mosquito and invades human erythrocytes. Malaria must be recognized precisely in order to treat the patient in time and to prevent further spread of infection. The standard diagnostic technique using microscopic examination is inefficient, the quality of the diagnosis depends on the quality of blood smears and experience of microscopists in classifying and counting infected and non-infected cells. Convolutional Neural Networks (CNN) is one of deep learning class that able to automate feature engineering and learn effective features that could be very effective in diagnosing malaria. This study proposes an intelligent system based on simple CNN for detecting malaria parasites through images of thin blood smears. The CNN model obtained high sensitivity of 97% and relatively high PPV of 81%. This study also proposes a false positive reduction method using feature clustering extracted from the gray level co-occurrence matrix (GLCM) from the Region of Interests (ROIs). Adding the GLCM feature can significantly reduce false positives. However, this technique requires manual set up of silhouette and euclidean distance limits to ensure cluster quality, so it does not adversely affect sensitivity.

97.Wasserstein-2 Generative Networks ⬇️

Modern generative learning is mainly associated with Generative Adversarial Networks (GANs). Training such networks is always hard due to the minimax nature of the optimization objective. In this paper we propose a novel algorithm for training generative models, which gets rid of mini-max GAN objective, thus significantly simplified model training. The proposed algorithm uses the variational approximation of Wasserstein-2 distances by Input Convex Neural Networks. We also provide the results of computational experiments, which confirms the efficiency of our algorithm in application to latent spaces optimal transport and image-to-image style transfer.

98.Regression Planning Networks ⬇️

Recent learning-to-plan methods have shown promising results on planning directly from observation space. Yet, their ability to plan for long-horizon tasks is limited by the accuracy of the prediction model. On the other hand, classical symbolic planners show remarkable capabilities in solving long-horizon tasks, but they require predefined symbolic rules and symbolic states, restricting their real-world applicability. In this work, we combine the benefits of these two paradigms and propose a learning-to-plan method that can directly generate a long-term symbolic plan conditioned on high-dimensional observations. We borrow the idea of regression (backward) planning from classical planning literature and introduce Regression Planning Networks (RPN), a neural network architecture that plans backward starting at a task goal and generates a sequence of intermediate goals that reaches the current observation. We show that our model not only inherits many favorable traits from symbolic planning, e.g., the ability to solve previously unseen tasks but also can learn from visual inputs in an end-to-end manner. We evaluate the capabilities of RPN in a grid world environment and a simulated 3D kitchen environment featuring complex visual scenes and long task horizons, and show that it achieves near-optimal performance in completely new task instances.

99.Implicit Discriminator in Variational Autoencoder ⬇️

Recently generative models have focused on combining the advantages of variational autoencoders (VAE) and generative adversarial networks (GAN) for good reconstruction and generative abilities. In this work we introduce a novel hybrid architecture, Implicit Discriminator in Variational Autoencoder (IDVAE), that combines a VAE and a GAN, which does not need an explicit discriminator network. The fundamental premise of the IDVAE architecture is that the encoder of a VAE and the discriminator of a GAN utilize common features and therefore can be trained as a shared network, while the decoder of the VAE and the generator of the GAN can be combined to learn a single network. This results in a simple two-tier architecture that has the properties of both a VAE and a GAN. The qualitative and quantitative experiments on real-world benchmark datasets demonstrates that IDVAE perform better than the state of the art hybrid approaches. We experimentally validate that IDVAE can be easily extended to work in a conditional setting and demonstrate its performance on complex datasets.

100.A Dual Camera System for High Spatiotemporal Resolution Video Acquisition ⬇️

This paper presents a dual camera system for high spatiotemporal resolution (HSTR) video acquisition, where one camera shoots a video with high spatial resolution and low frame rate (HSR-LFR) and another one captures a low spatial resolution and high frame rate (LSR-HFR) video. Our main goal is to combine videos from LSR-HFR and HSR-LFR cameras to create an HSTR video. We propose an end-to-end learning framework, AWnet, mainly consisting of a FlowNet and a FusionNet that learn an adaptive weighting function in pixel domain to combine inputs in a frame recurrent fashion. To improve the reconstruction quality for cameras used in reality, we also introduce noise regularization under the same framework. Our method has demonstrated noticeable performance gains in terms of both objective PSNR measurement in simulation with different publicly available video and light-field datasets and subjective evaluation with real data captured by dual iPhone 7 and Grasshopper3 cameras. Ablation studies are further conducted to investigate and explore various aspects (such as noise regularization, camera parallax, exposure time, multiscale synthesis, etc) of our system to fully understand its capability for potential applications.

101.Feature Fusion Detector for Semantic Cognition of Remote Sensing ⬇️

The value of remote sensing images is of vital importance in many areas and needs to be refined by some cognitive approaches. The remote sensing detection is an appropriate way to achieve the semantic cognition. However, such detection is a challenging issue for scale diversity, diversity of views, small objects, sophisticated light and shadow backgrounds. In this article, inspired by the state-of-the-art detection framework FPN, we propose a novel approach for constructing a feature fusion module that optimizes feature context utilization in detection, calling our system LFFN for Layer-weakening Feature Fusion Network. We explore the inherent relevance of different layers to the final decision, and the incentives of higher-level features to lower-level features. More importantly, we explore the characteristics of different backbone networks in the mining of basic features and the correlation utilization of convolutional channels, and call our upgraded version as advanced LFFN. Based on experiments on the remote sensing dataset from Google Earth, our LFFN has proved effective and practical for the semantic cognition of remote sensing, achieving 89% mAP which is 4.1% higher than that of FPN. Moreover, in terms of the generalization performance, LFFN achieves 79.9% mAP on VOC 2007 and achieves 73.0% mAP on VOC 2012 test, and advacned LFFN obtains the mAP values of 80.7% and 74.4% on VOC 2007 and 2012 respectively, outperforming the comparable state-of-the-art SSD and Faster R-CNN models.

102.Genetic Programming and Gradient Descent: A Memetic Approach to Binary Image Classification ⬇️

Image classification is an essential task in computer vision, which aims to categorise a set of images into different groups based on some visual criteria. Existing methods, such as convolutional neural networks, have been successfully utilised to perform image classification. However, such methods often require human intervention to design a model. Furthermore, such models are difficult to interpret and it is challenging to analyse the patterns of different classes. This paper presents a hybrid (memetic) approach combining genetic programming (GP) and Gradient-based optimisation for image classification to overcome the limitations mentioned. The performance of the proposed method is compared to a baseline version (without local search) on four binary classification image datasets to provide an insight into the usefulness of local search mechanisms for enhancing the performance of GP.

103.Celeb-DF: A New Dataset for DeepFake Forensics ⬇️

AI-synthesized face swapping videos, commonly known as the DeepFakes, have become an emerging problem recently. Correspondingly, there is an increasing interest in developing algorithms that can detect such synthesized videos. However, existing dataset of DeepFake videos suffer from low visual quality and abundant artifacts that do not reflect the reality of synthesized videos circulated on the Internet. In this work, we present the {\em DeepFake Forensics} (Celeb-DF) dataset with synthesized videos of high visual quality for the development and evaluation of DeepFake detection algorithms. The Celeb-DF dataset is generated using a refined synthesis algorithm that reduces the visual artifacts observed in existing datasets. Based on the Celeb-DF dataset, we also benchmark existing DeepFake detection algorithms.

104.Deep neural networks for automated classification of colorectal polyps on histopathology slides: A multi-institutional evaluation ⬇️

Histological classification of colorectal polyps plays a critical role in both screening for colorectal cancer and care of affected patients. In this study, we developed a deep neural network for classification of four major colorectal polyp types on digitized histopathology slides and compared its performance to local pathologists' diagnoses at the point-of-care retrieved from corresponding pathology labs. We evaluated the deep neural network on an internal dataset of 157 histopathology slides from the Dartmouth-Hitchcock Medical Center (DHMC) in New Hampshire, as well as an external dataset of 513 histopathology slides from 24 different institutions spanning 13 states in the United States. For the internal evaluation, the deep neural network had a mean accuracy of 93.5% (95% CI 89.6%-97.4%), compared with local pathologists' accuracy of 91.4% (95% CI 87.0%-95.8%). On the external test set, the deep neural network achieved an accuracy of 85.7% (95% CI 82.7%-88.7%), significantly outperforming the accuracy of local pathologists at 80.9% (95% CI 77.5%-84.3%, p<0.05) at the point-of-care. If confirmed in clinical settings, our model could assist pathologists by improving the diagnostic efficiency, reproducibility, and accuracy of colorectal cancer screenings.

105.Encoding CT Anatomy Knowledge for Unpaired Chest X-ray Image Decomposition ⬇️

Although chest X-ray (CXR) offers a 2D projection with overlapped anatomies, it is widely used for clinical diagnosis. There is clinical evidence supporting that decomposing an X-ray image into different components (e.g., bone, lung and soft tissue) improves diagnostic value. We hereby propose a decomposition generative adversarial network (DecGAN) to anatomically decompose a CXR image but with unpaired data. We leverage the anatomy knowledge embedded in CT, which features a 3D volume with clearly visible anatomies. Our key idea is to embed CT priori decomposition knowledge into the latent space of unpaired CXR autoencoder. Specifically, we train DecGAN with a decomposition loss, adversarial losses, cycle-consistency losses and a mask loss to guarantee that the decomposed results of the latent space preserve realistic body structures. Extensive experiments demonstrate that DecGAN provides superior unsupervised CXR bone suppression results and the feasibility of modulating CXR components by latent space disentanglement. Furthermore, we illustrate the diagnostic value of DecGAN and demonstrate that it outperforms the state-of-the-art approaches in terms of predicting 11 out of 14 common lung diseases.

106.The impact of patient clinical information on automated skin cancer detection ⬇️

Skin cancer is one of the most common types of cancer around the world. For this reason, over the past years, different approaches have been proposed to assist detect it. Nonetheless, most of them are based only on dermoscopy images and do not take into account the patient clinical information. In this work, first, we present a new dataset that contains clinical images, acquired from smartphones, and patient clinical information of the skin lesions. Next, we introduce a straightforward approach to combine the clinical data and the images using different well-known deep learning models. These models are applied to the presented dataset using only the images and combining them with the patient clinical information. We present a comprehensive study to show the impact of the clinical data on the final predictions. The results obtained by combining both sets of information show a general improvement of around 7% in the balanced accuracy for all models. In addition, the statistical test indicates significant differences between the models with and without considering both data. The improvement achieved shows the potential of using patient clinical information in skin cancer detection and indicates that this piece of information is important to leverage skin cancer detection systems.

107.Brain-wise Tumor Segmentation and Patient Overall Survival Prediction ⬇️

Past few years have witnessed the prevalence of deep learning in many application scenarios, among which is medical image processing. Diagnosis and treatment of brain tumors require a delicate segmentation of brain tumors as a prerequisite. However, such kind of work conventionally costs cerebral surgeons a lot of precious time. Computer vision techniques could provide surgeons a relief from the tedious marking procedure. In this paper, a 3D U-net based deep learning model has been trained with the help of brain-wise normalization and patching strategies for the brain tumor segmentation task in BraTS 2019 competition. Dice coefficients for enhancing tumor, tumor core, and the whole tumor are 0.737, 0.807 and 0.894 respectively on validation dataset. Furthermore, numerical features extracted from predicted tumor labels have been used for the overall survival days prediction task. The prediction accuracy on validation dataset is 0.448.

108.SegMap: Segment-based mapping and localization using data-driven descriptors ⬇️

Precisely estimating a robot's pose in a prior, global map is a fundamental capability for mobile robotics, e.g. autonomous driving or exploration in disaster zones. This task, however, remains challenging in unstructured, dynamic environments, where local features are not discriminative enough and global scene descriptors only provide coarse information. We therefore present SegMap: a map representation solution for localization and mapping based on the extraction of segments in 3D point clouds. Working at the level of segments offers increased invariance to view-point and local structural changes, and facilitates real-time processing of large-scale 3D data. SegMap exploits a single compact data-driven descriptor for performing multiple tasks: global localization, 3D dense map reconstruction, and semantic information extraction. The performance of SegMap is evaluated in multiple urban driving and search and rescue experiments. We show that the learned SegMap descriptor has superior segment retrieval capabilities, compared to state-of-the-art handcrafted descriptors. In consequence, we achieve a higher localization accuracy and a 6% increase in recall over state-of-the-art. These segment-based localizations allow us to reduce the open-loop odometry drift by up to 50%. SegMap is open-source available along with easy to run demonstrations.

Files

20191001.md

Latest commit

History

20191001.md

File metadata and controls

ArXiv cs.CV --Tue, 1 Oct 2019

1.Deep learning tools for the measurement of animal behavior in neuroscience ⬇️

2.XNOR-Net++: Improved Binary Neural Networks ⬇️

3.Unsupervised Pose Flow Learning for Pose Guided Synthesis ⬇️

4.wMAN: Weakly-supervised Moment Alignment Network for Text-based Video Segment Retrieval ⬇️

5.MLSL: Multi-Level Self-Supervised Learning for Domain Adaptation with Spatially Independent and Semantically Consistent Labeling ⬇️

6.IPC-Net: 3D point-cloud segmentation using deep inter-point convolutional layers ⬇️

7.RandAugment: Practical data augmentation with no separate search ⬇️

8.Depth Estimation in Nighttime using Stereo-Consistent Cyclic Translations ⬇️

9.Style Transfer by Rigid Alignment in Neural Net Feature Space ⬇️

10.Multi-view PointNet for 3D Scene Understanding ⬇️

11.Domain Adaptation for Semantic Segmentation with Maximum Squares Loss ⬇️

12.Towards Good Practices for Video Object Segmentation ⬇️

13.Meta-learning algorithms for Few-Shot Computer Vision ⬇️

14.Enhancing Object Detection in Adverse Conditions using Thermal Imaging ⬇️

15.EdgeCNN: Convolutional Neural Network Classification Model with small inputs for Edge Computing ⬇️

16.CullNet: Calibrated and Pose Aware Confidence Scores for Object Pose Estimation ⬇️

17.Spatio-Temporal FAST 3D Convolutions for Human Action Recognition ⬇️

18.On Incorporating Semantic Prior Knowlegde in Deep Learning Through Embedding-Space Constraints ⬇️

19.Residual Attention Graph Convolutional Network for Geometric 3D Scene Classification ⬇️

20.Random Bias Initialization Improving Binary Neural Network Training ⬇️

21.Single-Network Whole-Body Pose Estimation ⬇️

22.SymmetricNet: A mesoscale eddy detection method based on multivariate fusion data ⬇️

23.REQ-YOLO: A Resource-Aware, Efficient Quantization Framework for Object Detection on FPGAs ⬇️

24.SteReFo: Efficient Image Refocusing with Stereo Vision ⬇️

25.End-to-End Deep Convolutional Active Contours for Image Segmentation ⬇️

26.Exploiting Geometric Constraints on Dense Trajectories for Motion Saliency ⬇️

27.Learning to Align Multi-Camera Domain for Unsupervised Video Person Re-Identification ⬇️

28.RPM-Net: Robust Pixel-Level Matching Networks for Self-Supervised Video Object Segmentation ⬇️

29.Spatiotemporal Co-attention Recurrent Neural Networks for Human-Skeleton Motion Prediction ⬇️

30.Salient Instance Segmentation via Subitizing and Clustering ⬇️

31.Learning Efficient Convolutional Networks through Irregular Convolutional Kernels ⬇️

32.PolarMask: Single Shot Instance Segmentation with Polar Representation ⬇️

33.Self-Supervised Learning of Depth and Ego-motion with Differentiable Bundle Adjustment ⬇️

34.Weakly Supervised Energy-Based Learning for Action Segmentation ⬇️

35.Feature Weighting and Boosting for Few-Shot Segmentation ⬇️

36.Facial Expression Recognition Using Disentangled Adversarial Learning ⬇️

37.Grouped Spatial-Temporal Aggregation for Efficient Action Recognition ⬇️

38.Feature Level Fusion from Facial Attributes for Face Recognition ⬇️

39.GLA-Net: An Attention Network with Guided Loss for Mismatch Removal ⬇️

40.On Generalizing Detection Models for Unconstrained Environments ⬇️

41.Training convolutional neural networks with cheap convolutions and online distillation ⬇️

42.Frame and Feature-Context Video Super-Resolution ⬇️

43.DeepUSPS: Deep Robust Unsupervised Saliency Prediction With Self-Supervision ⬇️

44.Meta Learning with Differentiable Closed-form Solver for Fast Video Object Segmentation ⬇️

45.Meta R-CNN : Towards General Solver for Instance-level Low-shot Learning ⬇️

46.Semantic Example Guided Image-to-Image Translation ⬇️

47.Learning Category Correlations for Multi-label Image Recognition with Graph Networks ⬇️

48.Distributed Iterative Gating Networks for Semantic Segmentation ⬇️

49.A closer look at network resolution for efficient network design ⬇️

50.Visual Explanation for Deep Metric Learning ⬇️

51.Towards Object Detection from Motion ⬇️

52.Video Skimming: Taxonomy and Comprehensive Survey ⬇️

53.ViLiVO: Virtual LiDAR-Visual Odometry for an Autonomous Vehicle with a Multi-Camera System ⬇️

54.EPOSIT: An Absolute Pose Estimation Method for Pinhole and Fish-Eye Cameras ⬇️

55.Handwritten Amharic Character Recognition Using a Convolutional Neural Network ⬇️

56.DashNet: A Hybrid Artificial and Spiking Neural Network for High-speed Object Tracking ⬇️

57.A weakly supervised adaptive triplet loss for deep metric learning ⬇️

58.Unsupervised Segmentation of Fire and Smoke from Infra-Red Videos ⬇️

59.6D Pose Estimation with Correlation Fusion ⬇️

60.Responsible Facial Recognition and Beyond ⬇️

61.Subtractive Perceptrons for Learning Images: A Preliminary Report ⬇️

62.BUDA.ART: A Multimodal Content-Based Analysis and Retrieval System for Buddha Statues ⬇️

63.Self-Paced Video Data Augmentation with Dynamic Images Generated by Generative Adversarial Networks ⬇️

64.Toward Robust Image Classification ⬇️

65.End-to-End Deep Residual Learning with Dilated Convolutions for Myocardial Infarction Detection and Localization ⬇️

66.Historical and Modern Features for Buddha Statue Classification ⬇️

67.HR-CAM: Precise Localization of Pathology Using Multi-level Learning in CNNs ⬇️

68.A Lightweight Deep Learning Model for Human Activity Recognition on Edge Devices ⬇️

69.Semantic and Visual Similarities for Efficient Knowledge Transfer in CNN Training ⬇️

70.Student Engagement Detection Using Emotion Analysis, Eye Tracking and Head Movement with Machine Learning ⬇️

71.Graph Neural Networks for Image Understanding Based on Multiple Cues: Group Emotion Recognition and Event Recognition as Use Cases ⬇️

72.Deeply Matting-based Dual Generative Adversarial Network for Image and Document Label Supervision ⬇️

73.Beyond Top-Grasps Through Scene Completion ⬇️