Skip to content
This repository has been archived by the owner on Apr 21, 2024. It is now read-only.

Latest commit

 

History

History
139 lines (139 loc) · 89.8 KB

20190731.md

File metadata and controls

139 lines (139 loc) · 89.8 KB

ArXiv cs.CV --Wed, 31 Jul 2019

1.Deblurring Face Images using Uncertainty Guided Multi-Stream Semantic Networks ⬇️

We propose a novel multi-stream architecture and training methodology that exploits semantic labels for facial image deblurring. The proposed Uncertainty Guided Multi-Stream Semantic Network (UMSN) processes regions belonging to each semantic class independently and learns to combine their outputs into the final deblurred result. Pixel-wise semantic labels are obtained using a segmentation network. A predicted confidence measure is used during training to guide the network towards challenging regions of the human face such as the eyes and nose. The entire network is trained in an end-to-end fashion. Comprehensive experiments on three different face datasets demonstrate that the proposed method achieves significant improvements over the recent state-of-the-art face deblurring methods. Code is available at: this https URL

2.Deformable Filter Convolution for Point Cloud Reasoning ⬇️

Point clouds are the native output of many real-world 3D sensors. To borrow the success of 2D convolutional network architectures, a majority of popular 3D perception models voxelize the points, which can result in a loss of local geometric details that cannot be recovered. In this paper, we propose a novel learnable convolution layer for processing 3D point cloud data directly. Instead of discretizing points into fixed voxels, we deform our learnable 3D filters to match with the point cloud shape. We propose to combine voxelized backbone networks with our deformable filter layer at 1) the network input stream and 2) the output prediction layers to enhance point level reasoning. We obtain state-of-the-art results on LiDAR semantic segmentation and producing a significant gain in performance on LiDAR object detection.

3.Pay attention to the activations: a modular attention mechanism for fine-grained image recognition ⬇️

Fine-grained image recognition is central to many multimedia tasks such as search, retrieval and captioning. Unfortunately, these tasks are still challenging since the appearance of samples of the same class can be more different than those from different classes. Attention has been typically implemented in neural networks by selecting the most informative regions of the image that improve classification. In contrast, in this paper, attention is not applied at the image level but to the convolutional feature activations. In essence, with our approach, the neural model learns to attend to lower-level feature activations without requiring part annotations and uses those activations to update and rectify the output likelihood distribution. The proposed mechanism is modular, architecture-independent and efficient in terms of both parameters and computation required. Experiments demonstrate that well-known networks such as Wide Residual Networks and ResNeXt, when augmented with our approach, systematically improve their classification accuracy and become more robust to changes in deformation and pose and to the presence of clutter. As a result, our proposal reaches state-of-the-art classification accuracies in CIFAR-10, the Adience gender recognition task, Stanford Dogs, and UEC-Food100 while obtaining competitive performance in ImageNet, CIFAR-100, CUB200 Birds, and Stanford Cars. In addition, we analyze the different components of our model, showing that the proposed attention modules succeed in finding the most discriminative regions of the image. Finally, as a proof of concept, we demonstrate that with only local predictions, an augmented neural network can successfully classify an image before reaching any fully connected layer, thus reducing the computational amount up to 10%.

4.Grid Saliency for Context Explanations of Semantic Segmentation ⬇️

Recently, there has been a growing interest in developing saliency methods that provide visual explanations of network predictions. Still, the usability of existing methods is limited to image classification models. To overcome this limitation, we extend the existing approaches to generate grid saliencies, which provide spatially coherent visual explanations for (pixel-level) dense prediction networks. As the proposed grid saliency allows to spatially disentangle the object and its context, we specifically explore its potential to produce context explanations for semantic segmentation networks, discovering which context most influences the class predictions inside a target object area. We investigate the effectiveness of grid saliency on a synthetic dataset with an artificially induced bias between objects and their context as well as on the real-world Cityscapes dataset using state-of-the-art segmentation networks. Our results show that grid saliency can be successfully used to provide easily interpretable context explanations and, moreover, can be employed for detecting and localizing contextual biases present in the data.

5.Weakly Supervised Body Part Parsing with Pose based Part Priors ⬇️

Human body part parsing refers to the task of predicting the semantic segmentation mask for each body part. Fully supervised body part parsing methods achieve good performances, but require an enormous amount of effort to annotate part masks for training. In contrast to high annotation costs required for a limited number of part mask annotations, a large number of weak labels such as poses and full body masks already exist and contain relevant information. Motivated by the possibility of using existing weak labels, we propose the first weakly supervised body part parsing framework. The basic idea is to train a parsing network with pose generated part priors that has blank uncertain regions on estimated boundaries, and use an iterative refinement module to generate new supervision and predictions on these regions. When sufficient extra weak supervisions are available, our weakly-supervised results (62.0% mIoU) on Pascal-Person-Part are comparable to the fully supervised state-of-the-art results (63.6% mIoU). Furthermore, in the extended semi-supervised setting, the proposed framework outperforms the state-of-art methods. In addition, we show that the proposed framework can be extended to other keypoint-supervised part parsing tasks such as face parsing.

6.Efficient Method for Categorize Animals in the Wild ⬇️

Automatic species classification in camera traps would greatly help the biodiversity monitoring and species analysis in the earth. In order to accelerate the development of automatic species classification task, "Microsoft AI for Earth" have prepared a challenge in FGVC6 workshop at CVPR 2019, which called "iWildCam 2019 competition". In this work, we propose an efficient method for categorizing animals in the wild. We transfer the state-of-the-art ImagaNet pretrained models to the problem. To improve the generalization and robustness of the model, we utilize efficient image augmentation and regularization strategies, like cutout, mixup and label-smoothing. Finally, we use ensemble learning to increase the performance of the model. Thanks to advanced regularization strategies and ensemble learning, we got top 7/336 places in the final leaderboard. Source code of this work is available at this https URL

7.SkeleMotion: A New Representation of Skeleton Joint Sequences Based on Motion Information for 3D Action Recognition ⬇️

Due to the availability of large-scale skeleton datasets, 3D human action recognition has recently called the attention of computer vision community. Many works have focused on encoding skeleton data as skeleton image representations based on spatial structure of the skeleton joints, in which the temporal dynamics of the sequence is encoded as variations in columns and the spatial structure of each frame is represented as rows of a matrix. To further improve such representations, we introduce a novel skeleton image representation to be used as input of Convolutional Neural Networks (CNNs), named SkeleMotion. The proposed approach encodes the temporal dynamics by explicitly computing the magnitude and orientation values of the skeleton joints. Different temporal scales are employed to compute motion values to aggregate more temporal dynamics to the representation making it able to capture longrange joint interactions involved in actions as well as filtering noisy motion values. Experimental results demonstrate the effectiveness of the proposed representation on 3D action recognition outperforming the state-of-the-art on NTU RGB+D 120 dataset.

8.Bilateral Operators for Functional Maps ⬇️

A majority of shape correspondence frameworks are based on devising pointwise and pairwise constraints on the correspondence map. The functional maps framework allows for formulating these constraints in the spectral domain. In this paper, we develop a functional map framework for the shape correspondence problem by constructing pairwise constraints using point-wise descriptors. Our core observation is that, every point-wise descriptor allows for the construction a pairwise kernel operator whose low frequency eigenfunctions depict regions of similar descriptor values at various scales of frequency. By aggregating the pairwise information from the descriptor and the intrinsic geometry of the surface encoded in the heat kernel, we construct a hybrid kernel and call it the bilateral operator. Analogous to the edge preserving bilateral filter in image processing, the action of the bilateral operator on a function defined over the manifold yields a descriptor dependent local smoothing of that function. By forcing the correspondence map to commute with the Bilateral operator, we show that we can maximally exploit the information from a given set of pointwise descriptors in a functional map framework.

9.Deep Learning architectures for generalized immunofluorescence based nuclear image segmentation ⬇️

Separating and labeling each instance of a nucleus (instance-aware segmentation) is the key challenge in segmenting single cell nuclei on fluorescence microscopy images. Deep Neural Networks can learn the implicit transformation of a nuclear image into a probability map indicating the class membership of each pixel (nucleus or background), but the use of post-processing steps to turn the probability map into a labeled object mask is error-prone. This especially accounts for nuclear images of tissue sections and nuclear images across varying tissue preparations. In this work, we aim to evaluate the performance of state-of-the-art deep learning architectures to segment nuclei in fluorescence images of various tissue origins and sample preparation types without post-processing. We compare architectures that operate on pixel to pixel translation and an architecture that operates on object detection and subsequent locally applied segmentation. In addition, we propose a novel strategy to create artificial images to extend the training set. We evaluate the influence of ground truth annotation quality, image scale and segmentation complexity on segmentation performance. Results show that three out of four deep learning architectures (U-Net, U-Net with ResNet34 backbone, Mask R-CNN) can segment fluorescent nuclear images on most of the sample preparation types and tissue origins with satisfactory segmentation performance. Mask R-CNN, an architecture designed to address instance aware segmentation tasks, outperforms other architectures. Equal nuclear mean size, consistent nuclear annotations and the use of artificially generated images result in overall acceptable precision and recall across different tissues and sample preparation types.

10.FingerNet: Pushing The Limits of Fingerprint Recognition Using Convolutional Neural Network ⬇️

Fingerprint recognition has been utilized for cellphone authentication, airport security and beyond. Many different features and algorithms have been proposed to improve fingerprint recognition. In this paper, we propose an end-to-end deep learning framework for fingerprint recognition using convolutional neural networks (CNNs) which can jointly learn the feature representation and perform recognition. We train our model on a large-scale fingerprint recognition dataset, and improve over previous approaches in terms of accuracy. Our proposed model is able to achieve a very high recognition accuracy on a well-known fingerprint dataset. We believe this framework can be widely used for biometrics recognition tasks, making more scalable and accurate systems possible. We have also used a visualization technique to highlight the important areas in an input fingerprint image, that mostly impact the recognition results.

11.Preterm infants' limb-pose estimation from depth images using convolutional neural networks ⬇️

Preterm infants' limb-pose estimation is a crucial but challenging task, which may improve patients' care and facilitate clinicians in infant's movements monitoring. Work in the literature either provides approaches to whole-body segmentation and tracking, which, however, has poor clinical value, or retrieve a posteriori limb pose from limb segmentation, increasing computational costs and introducing inaccuracy sources. In this paper, we address the problem of limb-pose estimation under a different point of view. We proposed a 2D fully-convolutional neural network for roughly detecting limb joints and joint connections, followed by a regression convolutional neural network for accurate joint and joint-connection position estimation. Joints from the same limb are then connected with a maximum bipartite matching approach. Our analysis does not require any prior modeling of infants' body structure, neither any manual interventions. For developing and testing the proposed approach, we built a dataset of four videos (video length = 90 s) recorded with a depth sensor in a neonatal intensive care unit (NICU) during the actual clinical practice, achieving median root mean square distance [pixels] of 10.790 (right arm), 10.542 (left arm), 8.294 (right leg), 11.270 (left leg) with respect to the ground-truth limb pose. The idea of estimating limb pose directly from depth images may represent a future paradigm for addressing the problem of preterm-infants' movement monitoring and offer all possible support to clinicians in NICUs.

12.RNN-based Online Handwritten Character Recognition Using Accelerometer and Gyroscope Data ⬇️

This abstract explores an RNN-based approach to online handwritten recognition problem. Our method uses data from an accelerometer and a gyroscope mounted on a handheld pen-like device to train and run a character pre-diction model. We have built a dataset of timestamped gyroscope and accelerometer data gathered during the manual process of handwriting Latin characters, labeled with the character being written; in total, the dataset con-sists of 1500 gyroscope and accelerometer data sequenc-es for 8 characters of the Latin alphabet from 6 different people, and 20 characters, each 1500 samples from Georgian alphabet from 5 different people. with each sequence containing the gyroscope and accelerometer data captured during the writing of a particular character sampled once every 10ms. We train an RNN-based neural network architecture on this dataset to predict the character being written. The model is optimized with categorical cross-entropy loss and RMSprop optimizer and achieves high accuracy on test data.

13.Weakly Supervised Object Localization using Min-Max Entropy: an Interpretable Framework ⬇️

Weakly supervised object localization (WSOL) models aim to locate objects of interest in an image after being trained only on data with coarse image level labels. Deep learning models for WSOL rely typically on convolutional attention maps with no constraints on the regions of interest which allows them to select any region, making them vulnerable to false positive regions. This issue occurs in many application domains, e.g., medical image analysis, where interpretability is central to the prediction. In order to improve the localization reliability, we propose a deep learning framework for WSOL with pixel level localization. It is composed of two sequential sub-networks: a localizer that localizes regions of interest; followed by a classifier that classifies them. Within its end-to-end training, we incorporate the prior knowledge stating that in an agnostic-class setup an image is more likely to contain relevant --object of interest-- and irrelevant regions --noise--. Based on the conditional entropy (CE) measured at the classifier, the localizer is driven to spot relevant regions (low CE), and irrelevant regions (high CE). Our framework is able to recover large discriminative regions using our recursive erasing algorithm that we incorporate within the backpropagation during training. Moreover, the framework handles intrinsically multi-instances. Experimental results on public datasets with medical images (GlaS colon cancer) and natural images (Caltech-UCSD Birds-200-2011) show that, compared to state of the art WSOL methods, our framework can provide significant improvements in terms of image-level classification, pixel-level localization, and robustness to overfitting when dealing with few training samples. A public reproducible PyTorch implementation is provided in: this https URL .

14.Object as Distribution ⬇️

Object detection is a critical part of visual scene understanding. The representation of the object in the detection task has important implications on the efficiency and feasibility of annotation, robustness to occlusion, pose, lighting, and other visual sources of semantic uncertainty, and effectiveness in real-world applications (e.g., autonomous driving). Popular object representations include 2D and 3D bounding boxes, polygons, splines, pixels, and voxels. Each have their strengths and weakness. In this work, we propose a new representation of objects based on the bivariate normal distribution. This distribution-based representation has the benefit of robust detection of highly-overlapping objects and the potential for improved downstream tracking and instance segmentation tasks due to the statistical representation of object edges. We provide qualitative evaluation of this representation for the object detection task and quantitative evaluation of its use in a baseline algorithm for the instance segmentation task.

15.Improved Super-Resolution Convolution Neural Network for Large Images ⬇️

Single image super-resolution (SISR) is a very popular topic nowadays, which has both research value and practical value. In daily life, we crop a large image into sub-images to do super-resolution and then merge them together. Although convolution neural network performs very well in the research field, if we use it to do super-resolution, we can easily observe cutting lines from merged pictures. To address these problems, in this paper, we propose a refined architecture of SRCNN with 'Symmetric padding', 'Random learning' and 'Residual learning'. Moreover, we have done a lot of experiments to prove our model performs best among a lot of the state-of-art methods.

16.Unifying Structure Analysis and Surrogate-driven Function Regression for Glaucoma OCT Image Screening ⬇️

Optical Coherence Tomography (OCT) imaging plays an important role in glaucoma diagnosis in clinical practice. Early detection and timely treatment can prevent glaucoma patients from permanent vision loss. However, only a dearth of automated methods has been developed based on OCT images for glaucoma study. In this paper, we present a novel framework to effectively classify glaucoma OCT images from normal ones. A semi-supervised learning strategy with smoothness assumption is applied for surrogate assignment of missing function regression labels. Besides, the proposed multi-task learning network is capable of exploring the structure and function relationship from the OCT image and visual field measurement simultaneously, which contributes to classification performance boosting. Essentially, we are the first to unify the structure analysis and function regression for glaucoma screening. It is also worth noting that we build the largest glaucoma OCT image dataset involving 4877 volumes to develop and evaluate the proposed method. Extensive experiments demonstrate that our framework outperforms the baseline methods and two glaucoma experts by a large margin, achieving 93.2%, 93.2% and 97.8% on accuracy, F1 score and AUC, respectively.

17.Distill-to-Label: Weakly Supervised Instance Labeling Using Knowledge Distillation ⬇️

Weakly supervised instance labeling using only image-level labels, in lieu of expensive fine-grained pixel annotations, is crucial in several applications including medical image analysis. In contrast to conventional instance segmentation scenarios in computer vision, the problems that we consider are characterized by a small number of training images and non-local patterns that lead to the diagnosis. In this paper, we explore the use of multiple instance learning (MIL) to design an instance label generator under this weakly supervised setting. Motivated by the observation that an MIL model can handle bags of varying sizes, we propose to repurpose an MIL model originally trained for bag-level classification to produce reliable predictions for single instances, i.e., bags of size $1$. To this end, we introduce a novel regularization strategy based on virtual adversarial training for improving MIL training, and subsequently develop a knowledge distillation technique for repurposing the trained MIL model. Using empirical studies on colon cancer and breast cancer detection from histopathological images, we show that the proposed approach produces high-quality instance-level prediction and significantly outperforms state-of-the MIL methods.

18.Look Further to Recognize Better: Learning Shared Topics and Category-Specific Dictionaries for Open-Ended 3D Object Recognition ⬇️

Service robots are expected to operate effectively in human-centric environments for long periods of time. In such realistic scenarios, fine-grained object categorization is as important as basic-level object categorization. We tackle this problem by proposing an open-ended object recognition approach which concurrently learns both the object categories and the local features for encoding objects. In this work, each object is represented using a set of general latent visual topics and category-specific dictionaries. The general topics encode the common patterns of all categories, while the category-specific dictionary describes the content of each category in details. The proposed approach discovers both sets of general and specific representations in an unsupervised fashion and updates them incrementally using new object views. Experimental results show that our approach yields significant improvements over the previous state-of-the-art approaches concerning scalability and object classification performance. Moreover, our approach demonstrates the capability of learning from very few training examples in a real-world setting. Regarding computation time, the best result was obtained with a Bag-of-Words method followed by a variant of the Latent Dirichlet Allocation approach.

19.Evaluation of Distance Measures for Feature based Image Registration using AlexNet ⬇️

Image registration is a classic problem of computer vision with several applications across areas like defence, remote sensing, medicine etc. Feature based image registration methods traditionally used hand-crafted feature extraction algorithms, which detect key points in an image and describe them using a region around the point. Such features are matched using a threshold either on distances or ratio of distances computed between the feature descriptors. Evolution of deep learning, in particular convolution neural networks, has enabled researchers to address several problems of vision such as recognition, tracking, localization etc. Outputs of convolution layers or fully connected layers of CNN which has been trained for applications like visual recognition are proved to be effective when used as features in other applications such as retrieval. In this work, a deep CNN, AlexNet, is used in the place of handcrafted features for feature extraction in the first stage of image registration. However, there is a need to identify a suitable distance measure and a matching method for effective results. Several distance metrics have been evaluated in the framework of nearest neighbour and nearest neighbour ratio matching methods using benchmark dataset. Evaluation is done by comparing matching and registration performance using metrics computed from ground truth.
Keywords: Distance measures; deep learning; feature detection; feature descriptor; image matching

20.Tracking Holistic Object Representations ⬇️

Recent advances in visual tracking are based on siamese feature extractors and template matching. For this category of trackers, latest research focuses on better feature embeddings and similarity measures. In this work, we focus on building holistic object representations for tracking. We propose a framework that is designed to be used on top of previous trackers without any need for further training of the siamese network. The framework leverages the idea of obtaining additional object templates during the tracking process. Since the number of stored templates is limited, our method only keeps the most diverse ones. We achieve this by providing a new diversity measure in the space of siamese features. The obtained representation contains information beyond the ground truth object location provided to the system. It is then useful for tracking itself but also for further tasks which require a visual understanding of objects. Strong empirical results on tracking benchmarks indicate that our method can improve the performance and robustness of the underlying trackers while barely reducing their speed. In addition, our method is able to match current state-of-the-art results, while using a simpler and older network architecture and running three times faster.

21.Attention Filtering for Multi-person Spatiotemporal Action Detection on Deep Two-Stream CNN Architectures ⬇️

Action detection and recognition tasks have been the target of much focus in the computer vision community due to their many applications, namely, security, robotics and recommendation systems. Recently, datasets like AVA, provide multi-person, multi-label, spatiotemporal action detection and recognition challenges. Being unable to discern which portions of the input to use for classification is a limitation of two-stream CNN approaches, once the vision task involves several people with several labels. We address this limitation and improve the state-of-the-art performance of two-stream CNNs. In this paper we present four contributions: our fovea attention filtering that highlights targets for classification without discarding background; a generalized binary loss function designed for the AVA dataset; miniAVA, a partition of AVA that maintains temporal continuity and class distribution with only one tenth of the dataset size; and ablation studies on alternative attention filters. Our method, using fovea attention filtering and our generalized binary loss, achieves a relative video mAP improvement of 20% over the two-stream baseline in AVA, and is competitive with the state-of-the-art in the UCF101-24. We also show a relative video mAP improvement of 12.6% when using our generalized binary loss over the standard sum-of-sigmoids.

22.EmoCo: Visual Analysis of Emotion Coherence in Presentation Videos ⬇️

Emotions play a key role in human communication and public presentations. Human emotions are usually expressed through multiple modalities. Therefore, exploring multimodal emotions and their coherence is of great value for understanding emotional expressions in presentations and improving presentation skills. However, manually watching and studying presentation videos is often tedious and time-consuming. There is a lack of tool support to help conduct an efficient and in-depth multi-level analysis. Thus, in this paper, we introduce EmoCo, an interactive visual analytics system to facilitate efficient analysis of emotion coherence across facial, text, and audio modalities in presentation videos. Our visualization system features a channel coherence view and a sentence clustering view that together enable users to obtain a quick overview of emotion coherence and its temporal evolution. In addition, a detail view and word view enable detailed exploration and comparison from the sentence level and word level, respectively. We thoroughly evaluate the proposed system and visualization techniques through two usage scenarios based on TED Talk videos and interviews with two domain experts. The results demonstrate the effectiveness of our system in gaining insights into emotion coherence in presentations.

23.Covering up bias in CelebA-like datasets with Markov blankets: A post-hoc cure for attribute prior avoidance ⬇️

Attribute prior avoidance entails subconscious or willful non-modeling of (meta)attributes that datasets are oft born with, such as the 40 semantic facial attributes associated with the CelebA and CelebA-HQ datasets. The consequences of this infirmity, we discover, are especially stark in state-of-the-art deep generative models learned on these datasets that just model the pixel-space measurements, resulting in an inter-attribute bias-laden latent space. This viscerally manifests itself when we perform face manipulation experiments based on latent vector interpolations. In this paper, we address this and propose a post-hoc solution that utilizes an Ising attribute prior learned in the attribute space and showcase its efficacy via qualitative experiments.

24.Reg R-CNN: Lesion Detection and Grading under Noisy Labels ⬇️

For the task of concurrently detecting and categorizing objects, the medical imaging community commonly adopts methods developed on natural images. Current state-of-the-art object detectors are comprised of two stages: the first stage generates region proposals, the second stage subsequently categorizes them. Unlike in natural images, however, for anatomical structures of interest such as tumors, the appearance in the image (e.g., scale or intensity) links to a malignancy grade that lies on a continuous ordinal scale. While classification models discard this ordinal relation between grades by discretizing the continuous scale to an unordered "bag of categories", regression models are trained with distance metrics, which preserve the relation. This advantage becomes all the more important in the setting of label confusions on ambiguous data sets, which is the usual case with medical images. To this end, we propose Reg R-CNN, which replaces the second-stage classification model of a current object detector with a regression model. We show the superiority of our approach on a public data set with 1026 patients and a series of toy experiments. Code will be made publicly available.

25.ISEA: Image Steganalysis using Evolutionary Algorithms ⬇️

NP-hard problems always have been attracting scientists' attentions, and most often seen in the emerging challenging issues. The most interesting NP-hard problems emerging in the world of data science is Curse of dimensionality (CoD). Recently, this problem has penetrated most of high technology domains like advanced image processing, particularly image steganalysis. The universal and smarter steganalysis algorithms provide a huge number of attributes, which make working with data hard to process. In large data sets, finding a pattern which governs whole data takes long time, and yet no guarantee to reach the optimal pattern. In general, the purpose of the researchers in image steganalysis stands for distinguishing stego images from cover images. In this paper, we investigated recent works on detecting stego images, particularly those algorithms that adopted evolutionary algorithms. Thus, our work is categorized as supervised learning which consider ground truth to evaluate the performance of given algorithm. The objective is to provide a comprehensive understanding of evolutionary algorithms which are attempted to solve this NP-hard problems.

26.DANTE: Deep Affinity Network for Clustering Conversational Interactants ⬇️

We propose a data-driven approach to visually detect conversational groups by identifying spatial arrangements typical of these focused social encounters. Our approach uses a novel Deep Affinity Network (DANTE) to predict the likelihood that two individuals in a scene are part of the same conversational group, considering contextual information like the position and orientation of other nearby individuals. The predicted pair-wise affinities are then used in a graph clustering framework to identify both small (e.g., dyads) and bigger groups. The results from our evaluation on two standard benchmarks suggest that the combination of powerful deep learning methods with classical clustering techniques can improve the detection of conversational groups in comparison to prior approaches. Our technique has a wide range of applications from visual scene understanding, e.g., for surveillance, to social robotics.

27.Detecting Spoofing Attacks Using VGG and SincNet: BUT-Omilia Submission to ASVspoof 2019 Challenge ⬇️

In this paper, we present the system description of the joint efforts of Brno University of Technology (BUT) and Omilia -- Conversational Intelligence for the ASVSpoof2019 Spoofing and Countermeasures Challenge. The primary submission for Physical access (PA) is a fusion of two VGG networks, trained on single and two-channels features. For Logical access (LA), our primary system is a fusion of VGG and the recently introduced SincNet architecture. The results on PA show that the proposed networks yield very competitive performance in all conditions and achieved 86:% relative improvement compared to the official baseline. On the other hand, the results on LA showed that although the proposed architecture and training strategy performs very well on certain spoofing attacks, it fails to generalize to certain attacks that are unseen during training.

28.Unsupervised Separation of Dynamics from Pixels ⬇️

We present an approach to learn the dynamics of multiple objects from image sequences in an unsupervised way. We introduce a probabilistic model that first generate noisy positions for each object through a separate linear state-space model, and then renders the positions of all objects in the same image through a highly non-linear process. Such a linear representation of the dynamics enables us to propose an inference method that uses exact and efficient inference tools and that can be deployed to query the model in different ways without retraining.

29.Watch It Twice: Video Captioning with a Refocused Video Encoder ⬇️

With the rapid growth of video data and the increasing demands of various applications such as intelligent video search and assistance toward visually-impaired people, video captioning task has received a lot of attention recently in computer vision and natural language processing fields. The state-of-the-art video captioning methods focus more on encoding the temporal information, while lack of effective ways to remove irrelevant temporal information and also neglecting the spatial details. However, the current RNN encoding module in single time order can be influenced by the irrelevant temporal information, especially the irrelevant temporal information is at the beginning of the encoding. In addition, neglecting spatial information will lead to the relationship confusion of the words and detailed loss. Therefore, in this paper, we propose a novel recurrent video encoding method and a novel visual spatial feature for the video captioning task. The recurrent encoding module encodes the video twice with the predicted key frame to avoid the irrelevant temporal information often occurring at the beginning and the end of a video. The novel spatial features represent the spatial information in different regions of a video and enrich the details of a caption. Experiments on two benchmark datasets show superior performance of the proposed method.

30.Learned Image Downscaling for Upscaling using Content Adaptive Resampler ⬇️

Deep convolutional neural network based image super-resolution (SR) models have shown superior performance in recovering the underlying high resolution (HR) images from low resolution (LR) images obtained from the predefined downscaling methods. In this paper we propose a learned image downscaling method based on content adaptive resampler (CAR) with consideration on the upscaling process. The proposed resampler network generates content adaptive image resampling kernels that are applied to the original HR input to generate pixels on the downscaled image. Moreover, a differentiable upscaling (SR) module is employed to upscale the LR result into its underlying HR counterpart. By back-propagating the reconstruction error down to the original HR input across the entire framework to adjust model parameters, the proposed framework achieves a new state-of-the-art SR performance through upscaling guided image resamplers which adaptively preserve detailed information that is essential to the upscaling. Experimental results indicate that the quality of the generated LR image is comparable to that of the traditional interpolation based method, but the significant SR performance gain is achieved by deep SR models trained jointly with the CAR model. The code is publicly available on: URL this https URL.

31.Data augmentation with Symbolic-to-Real Image Translation GANs for Traffic Sign Recognition ⬇️

Traffic sign recognition is an important component of many advanced driving assistance systems, and it is required for full autonomous driving. Computational performance is usually the bottleneck in using large scale neural networks for this purpose. SqueezeNet is a good candidate for efficient image classification of traffic signs, but in our experiments it does not reach high accuracy, and we believe this is due to lack of data, requiring data augmentation. Generative adversarial networks can learn the high dimensional distribution of empirical data, allowing the generation of new data points. In this paper we apply pix2pix GANs architecture to generate new traffic sign images and evaluate the use of these images in data augmentation. We were motivated to use pix2pix to translate symbolic sign images to real ones due to the mode collapse in Conditional GANs. Through our experiments we found that data augmentation using GAN can increase classification accuracy for circular traffic signs from 92.1% to 94.0%, and for triangular traffic signs from 93.8% to 95.3%, producing an overall improvement of 2%. However some traditional augmentation techniques can outperform GAN data augmentation, for example contrast variation in circular traffic signs (95.5%) and displacement on triangular traffic signs (96.7 %). Our negative results shows that while GANs can be naively used for data augmentation, they are not always the best choice, depending on the problem and variability in the data.

32.Slot Based Image Augmentation System for Object Detection ⬇️

Object Detection has been a significant topic in computer vision. As the continuous development of Deep Learning, many advanced academic and industrial outcomes are established on localising and classifying the target objects, such as instance segmentation, video tracking and robotic vision. As the core concept of Deep Learning, Deep Neural Networks (DNNs) and associated training are highly integrated with task-driven modelling, having great effects on accurate detection. The main focus of improving detection performance is proposing DNNs with extra layers and novel topological connections to extract the desired features from input data. However, training these models can be computationally expensive and laborious progress as the complicated model architecture and enormous parameters. Besides, the dataset is another reason causing this issue and low detection accuracy, because of insufficient data samples or difficult instances. To address these training difficulties, this thesis presents two different approaches to improve the detection performance in the relatively light-weight way. As the intrinsic feature of data-driven in deep learning, the first approach is "slot-based image augmentation" to enrich the dataset with extra foreground and background combinations. Instead of the commonly used image flipping method, the proposed system achieved similar mAP improvement with less extra images which decrease training time. This proposed augmentation system has extra flexibility adapting to various scenarios and the performance-driven analysis provides an alternative aspect of conducting image augmentation

33.A Multi-Scale Mapping Approach Based on a Deep Learning CNN Model for Reconstructing High-Resolution Urban DEMs ⬇️

The shortage of high-resolution urban digital elevation model (DEM) datasets has been a challenge for modelling urban flood and managing its risk. A solution is to develop effective approaches to reconstruct high-resolution DEMs from their low-resolution equivalents that are more widely available. However, the current high-resolution DEM reconstruction approaches mainly focus on natural topography. Few attempts have been made for urban topography which is typically an integration of complex man-made and natural features. This study proposes a novel multi-scale mapping approach based on convolutional neural network (CNN) to deal with the complex characteristics of urban topography and reconstruct high-resolution urban DEMs. The proposed multi-scale CNN model is firstly trained using urban DEMs that contain topographic features at different resolutions, and then used to reconstruct the urban DEM at a specified (high) resolution from a low-resolution equivalent. A two-level accuracy assessment approach is also designed to evaluate the performance of the proposed urban DEM reconstruction method, in terms of numerical accuracy and morphological accuracy. The proposed DEM reconstruction approach is applied to a 121 km2 urbanized area in London, UK. Compared with other commonly used methods, the current CNN based approach produces superior results, providing a cost-effective innovative method to acquire high-resolution DEMs in other data-scarce environments.

34.Safe Augmentation: Learning Task-Specific Transformations from Data ⬇️

Data augmentation is widely used as a part of the training process applied to deep learning models, especially in the computer vision domain. Currently, common data augmentation techniques are designed manually. Therefore they require expert knowledge and time. Moreover, augmentations are dataset-specific, and the optimal augmentations set on a specific dataset has limited transferability to others. We present a simple and explainable method called $\textbf{Safe Augmentation}$ that can learn task-specific data augmentation techniques that do not change the data distribution and improve the generalization of the model. We propose to use safe augmentation in two ways: for model fine-tuning and along with other augmentation techniques. Our method is model-agnostic, easy to implement, and achieves better accuracy on CIFAR-10, CIFAR-100, SVHN, Tiny ImageNet, and Cityscapes datasets comparing to baseline augmentation techniques. The code is available at $\href{this https URL}{this https URL}$.

35.Increasing Shape Bias in ImageNet-Trained Networks Using Transfer Learning and Domain-Adversarial Methods ⬇️

Convolutional Neural Networks (CNNs) have become the state-of-the-art method to learn from image data. However, recent research shows that they may include a texture and colour bias in their representation, contrary to the intuition that they learn the shapes of the image content and to human biological learning. Thus, recent works have attempted to increase the shape bias in CNNs in order to train more robust and accurate networks on tasks. One such approach uses style-transfer in order to remove texture clues from the data. This work reproduces this methodology on four image classification datasets, as well as extends the method to use domain-adversarial training in order to further increase the shape bias in the learned representation. The results show the proposed method increases the robustness and shape bias of the CNNs, while it does not provide a gain in accuracy.

36.4X4 Census Transform ⬇️

This paper proposes a 4X4 Census Transform (4X4CT) to encourage further research in computer vision and visual computing. Unlike the traditional 3X3 CT which uses a nine pixels kernel, the proposed 4X4CT uses a sixteen pixels kernel with four overlapped groups of 3X3 kernel size. In each overlapping group, a reference input pixel profits from its nearest eight pixels to produce an eight bits binary string convertible to a grayscale integer of the 4X4CT's output pixel. Preliminary experiments demonstrated more image textural crispness and contrast than the CT as well as alternativeness to enable meaningful solutions to be achieved.

37.CoachAI: A Project for Microscopic Badminton Match Data Collection and Tactical Analysis ⬇️

Computer vision based object tracking has been used to annotate and augment sports video. For sports learning and training, video replay is often used in post-match review and training review for tactical analysis and movement analysis. For automatically and systematically competition data collection and tactical analysis, a project called CoachAI has been supported by the Ministry of Science and Technology, Taiwan. The proposed project also includes research of data visualization, connected training auxiliary devices, and data warehouse. Deep learning techniques will be used to develop video-based real-time microscopic competition data collection based on broadcast competition video. Machine learning techniques will be used to develop a tactical analysis. To reveal data in more understandable forms and to help in pre-match training, AR/VR techniques will be used to visualize data, tactics, and so on. In addition, training auxiliary devices including smart badminton rackets and connected serving machines will be developed based on the IoT technology to further utilize competition data and tactical data and boost training efficiency. Especially, the connected serving machines will be developed to perform specified tactics and to interact with players in their training.

38.End-to-end Recurrent Multi-Object Tracking and Trajectory Prediction with Relational Reasoning ⬇️

The majority of contemporary object-tracking approaches used in autonomous vehicles do not model interactions between objects. This contrasts with the fact that objects' paths are not independent: a cyclist might abruptly deviate from a previously planned trajectory in order to avoid colliding with a car. Building upon HART, a neural, class-agnostic single-object tracker, we introduce a multi-object tracking method MOHART capable of relational reasoning. Importantly, the entire system, including the understanding of interactions and relations between objects, is class-agnostic and learned simultaneously in an end-to-end fashion. We find that the addition of relational-reasoning capabilities to HART leads to consistent performance gains in tracking as well as future trajectory prediction on several real-world datasets (MOTChallenge, UA-DETRAC, and Stanford Drone dataset), particularly in the presence of ego-motion, occlusions, crowded scenes, and faulty sensor inputs. Finally, based on controlled simulations, we propose that a comparison of MOHART and HART may be used as a novel way to measure the degree to which the objects in a video depend upon each other as they move together through time.

39.2D and 3D Segmentation of uncertain local collagen fiber orientations in SHG microscopy ⬇️

Collagen fiber orientations in bones, visible with Second Harmonic Generation (SHG) microscopy, represent the inner structure and its alteration due to influences like cancer. While analyses of these orientations are valuable for medical research, it is not feasible to analyze the needed large amounts of local orientations manually. Since we have uncertain borders for these local orientations only rough regions can be segmented instead of a pixel-wise segmentation. We analyze the effect of these uncertain borders on human performance by a user study. Furthermore, we compare a variety of 2D and 3D methods such as classical approaches like Fourier analysis with state-of-the-art deep neural networks for the classification of local fiber orientations. We present a general way to use pretrained 2D weights in 3D neural networks, such as Inception-ResNet-3D a 3D extension of Inception-ResNet-v2. In a 10 fold cross-validation our two stage segmentation based on Inception-ResNet-3D and transferred 2D ImageNet weights achieves a human comparable accuracy.

40.Open Set Domain Adaptation for Image and Action Recognition ⬇️

Since annotating and curating large datasets is very expensive, there is a need to transfer the knowledge from existing annotated datasets to unlabelled data. Data that is relevant for a specific application, however, usually differs from publicly available datasets since it is sampled from a different domain. While domain adaptation methods compensate for such a domain shift, they assume that all categories in the target domain are known and match the categories in the source domain. Since this assumption is violated under real-world conditions, we propose an approach for open set domain adaptation where the target domain contains instances of categories that are not present in the source domain. The proposed approach achieves state-of-the-art results on various datasets for image classification and action recognition. Since the approach can be used for open set and closed set domain adaptation, as well as unsupervised and semi-supervised domain adaptation, it is a versatile tool for many applications.

41.LEAF-QA: Locate, Encode & Attend for Figure Question Answering ⬇️

We introduce LEAF-QA, a comprehensive dataset of $250,000$ densely annotated figures/charts, constructed from real-world open data sources, along with ~2 million question-answer (QA) pairs querying the structure and semantics of these charts. LEAF-QA highlights the problem of multimodal QA, which is notably different from conventional visual QA (VQA), and has recently gained interest in the community. Furthermore, LEAF-QA is significantly more complex than previous attempts at chart QA, viz. FigureQA and DVQA, which present only limited variations in chart data. LEAF-QA being constructed from real-world sources, requires a novel architecture to enable question answering. To this end, LEAF-Net, a deep architecture involving chart element localization, question and answer encoding in terms of chart elements, and an attention network is proposed. Different experiments are conducted to demonstrate the challenges of QA on LEAF-QA. The proposed architecture, LEAF-Net also considerably advances the current state-of-the-art on FigureQA and DVQA.

42.ColorMapGAN: Unsupervised Domain Adaptation for Semantic Segmentation Using Color Mapping Generative Adversarial Networks ⬇️

Due to the various reasons such as atmospheric effects and differences in acquisition, it is often the case that there exists a large difference between spectral bands of satellite images collected from different geographic locations. The large shift between spectral distributions of training and test data causes the current state of the art supervised learning approaches to output poor maps. We present a novel end to end semantic segmentation framework that is robust to such shift. The key component of the proposed framework is Color Mapping Generative Adversarial Networks (ColorMapGAN), which can generate fake training images that are semantically exactly the same as training images, but whose spectral distribution is similar to the distribution of the test images. We then use the fake images and the ground-truth for the training images to fine-tune the already trained classifier. Contrary to the existing Generative Adversarial Networks (GAN), the generator in ColorMapGAN does not have any convolutional or pooling layers. It learns to transform the colors of the training data to the colors of the test data by performing only one element-wise matrix multiplication and one matrix addition operations. Thanks to the architecturally simple but powerful design of ColorMapGAN, the proposed framework outperforms the existing approaches with a large margin in terms of both accuracy and computational complexity.

43.Orientation-aware Semantic Segmentation on Icosahedron Spheres ⬇️

We address semantic segmentation on omnidirectional images, to leverage a holistic understanding of the surrounding scene for applications like autonomous driving systems. For the spherical domain, several methods recently adopt an icosahedron mesh, but systems are typically rotation invariant or require significant memory and parameters, thus enabling execution only at very low resolutions. In our work, we propose an orientation-aware CNN framework for the icosahedron mesh. Our representation allows for fast network operations, as our design simplifies to standard network operations of classical CNNs, but under consideration of north-aligned kernel convolutions for features on the sphere. We implement our representation and demonstrate its memory efficiency up-to a level-8 resolution mesh (equivalent to 640 x 1024 equirectangular images). Finally, since our kernels operate on the tangent of the sphere, standard feature weights, pretrained on perspective data, can be directly transferred with only small need for weight refinement. In our evaluation our orientation-aware CNN becomes a new state of the art for the recent 2D3DS dataset, and our Omni-SYNTHIA version of SYNTHIA. Rotation invariant classification and segmentation tasks are additionally presented for comparison to prior art.

44.Towards Pure End-to-End Learning for Recognizing Multiple Text Sequences from an Image ⬇️

Here we address a challenging problem: recognizing multiple text sequences from an image by pure end-to-end learning. It is twofold: 1) Multiple text sequences recognition. Each image may contain multiple text sequences of different content, location and orientation, and we try to recognize all the text sequences contained in the image. 2) Pure end-to-end (PEE) learning.We solve the problem in a pure end-to-end learning way where each training image is labeled by only text transcripts of all contained sequences, without any geometric annotations. Most existing works recognize multiple text sequences from an image in a non-end-to-end (NEE) or quasi-end-to-end (QEE) way, in which each image is trained with both text transcripts and text locations.Only recently, a PEE method was proposed to recognize text sequences from an image where the text sequence was split to several lines in the image. However, it cannot be directly applied to recognizing multiple text sequences from an image. So in this paper, we propose a pure end-to-end learning method to recognize multiple text sequences from an image. Our method directly learns multiple sequences of probability distribution conditioned on each input image, and outputs multiple text transcripts with a well-designed decoding this http URL evaluate the proposed method, we constructed several datasets mainly based on an existing public dataset andtwo real application scenarios. Experimental results show that the proposed method can effectively recognize multiple text sequences from images, and outperforms CTC-based and attention-based baseline methods.

45.An Empirical Study of Propagation-based Methods for Video Object Segmentation ⬇️

While propagation-based approaches have achieved state-of-the-art performance for video object segmentation, the literature lacks a fair comparison of different methods using the same settings. In this paper, we carry out an empirical study for propagation-based methods. We view these approaches from a unified perspective and conduct detailed ablation study for core methods, input cues, multi-object combination and training strategies. With careful designs, our improved end-to-end memory networks achieve a global mean of 76.1 on DAVIS 2017 val set.

46.PointHop: An Explainable Machine Learning Method for Point Cloud Classification ⬇️

An explainable machine learning method for point cloud classification, called the PointHop method, is proposed in this work. The PointHop method consists of two stages: 1) local-to-global attribute building through iterative one-hop information exchange, and 2) classification and ensembles. In the attribute building stage, we address the problem of unordered point cloud data using a space partitioning procedure and developing a robust descriptor that characterizes the relationship between a point and its one-hop neighbor in a PointHop unit. When we put multiple PointHop units in cascade, the attributes of a point will grow by taking its relationship with one-hop neighbor points into account iteratively. Furthermore, to control the rapid dimension growth of the attribute vector associated with a point, we use the Saab transform to reduce the attribute dimension in each PointHop unit. In the classification and ensemble stage, we feed the feature vector obtained from multiple PointHop units to a classifier. We explore ensemble methods to improve the classification performance furthermore. It is shown by experimental results that the PointHop method offers classification performance that is comparable with state-of-the-art methods while demanding much lower training complexity.

47.Temporal Localization of Moments in Video Collections with Natural Language ⬇️

In this paper, we introduce the task of retrieving relevant video moments from a large corpus of untrimmed, unsegmented videos given a natural language query. Our task poses unique challenges as a system must efficiently identify both the relevant videos and localize the relevant moments in the videos. This task is in contrast to prior work that localizes relevant moments in a single video or searches a large collection of already-segmented videos. For our task, we introduce Clip Alignment with Language (CAL), a model that aligns features for a natural language query to a sequence of short video clips that compose a candidate moment in a video. Our approach goes beyond prior work that aggregates video features over a candidate moment by allowing for finer clip alignment. Moreover, our approach is amenable to efficient indexing of the resulting clip-level representations, which makes it suitable for moment localization in large video collections. We evaluate our approach on three recently proposed datasets for temporal localization of moments in video with natural language extended to our video corpus moment retrieval setting: DiDeMo, Charades-STA, and ActivityNet-captions. We show that our CAL model outperforms the recently proposed Moment Context Network (MCN) on all criteria across all datasets on our proposed task, obtaining an 8%-85% and 11%-47% boost for average recall and median rank, respectively, and achieves 5x faster retrieval and 8x smaller index size with a 500K video corpus.

48.Temporal Attentive Alignment for Large-Scale Video Domain Adaptation ⬇️

Although various image-based domain adaptation (DA) techniques have been proposed in recent years, domain shift in videos is still not well-explored. Most previous works only evaluate performance on small-scale datasets which are saturated. Therefore, we first propose two largescale video DA datasets with much larger domain discrepancy: UCF-HMDB_full and Kinetics-Gameplay. Second, we investigate different DA integration methods for videos, and show that simultaneously aligning and learning temporal dynamics achieves effective alignment even without sophisticated DA methods. Finally, we propose Temporal Attentive Adversarial Adaptation Network (TA3N), which explicitly attends to the temporal dynamics using domain discrepancy for more effective domain alignment, achieving state-of-the-art performance on four video DA datasets (e.g. 7.9% accuracy gain over "Source only" from 73.9% to 81.8% on "HMDB --> UCF", and 10.3% gain on "Kinetics --> Gameplay"). The code and data are released at this http URL.

49.Statistical Descriptors-based Automatic Fingerprint Identification: Machine Learning Approaches ⬇️

Identification of a person from fingerprints of good quality has been used by commercial applications and law enforcement agencies for many years, however identification of a person from latent fingerprints is very difficult and challenging. A latent fingerprint is a fingerprint left on a surface by deposits of oils and/or perspiration from the finger. It is not usually visible to the naked eye but may be detected with special techniques such as dusting with fine powder and then lifting the pattern of powder with transparent tape. We have evaluated the quality of machine learning techniques that has been implemented in automatic fingerprint identification. In this paper, we use fingerprints of low quality from database DB1 of Fingerprint Verification Competition (FVC 2002) to conduct our experiments. Fingerprints are processed to find its core point using Poincare index and carry out enhancement using Diffusion coherence filter whose performance is known to be good in the high curvature regions of fingerprints. Grey-level Co-Occurrence Matrix (GLCM) based seven statistical descriptors with four different inter pixel distances are then extracted as features and put forward to train and test REPTree, RandomTree, J48, Decision Stump and Random Forest Machine Learning techniques for personal identification. Experiments are conducted on 80 instances and 28 attributes. Our experiments proved that Random Forests and J48 give good results for latent fingerprints as compared to other machine learning techniques and can help improve the identification accuracy.

50.Deep Learning in Video Multi-Object Tracking: A Survey ⬇️

The problem of Multiple Object Tracking (MOT) consists in following the trajectory of different objects in a sequence, usually a video. In recent years, with the rise of Deep Learning, the algorithms that provide a solution to this problem have benefited from the representational power of deep models. This paper provides a comprehensive survey on works that employ Deep Learning models to solve the task of MOT on single-camera videos. Four main steps in MOT algorithms are identified, and an in-depth review of how Deep Learning was employed in each one of these stages is presented. A complete experimental comparison of the presented works on the three MOTChallenge datasets is also provided, identifying a number of similarities among the top-performing methods and presenting some possible future research directions.

51.Deep Learning For Face Recognition: A Critical Analysis ⬇️

Face recognition is a rapidly developing and widely applied aspect of biometric technologies. Its applications are broad, ranging from law enforcement to consumer applications, and industry efficiency and monitoring solutions. The recent advent of affordable, powerful GPUs and the creation of huge face databases has drawn research focus primarily on the development of increasingly deep neural networks designed for all aspects of face recognition tasks, ranging from detection and preprocessing to feature representation and classification in verification and identification solutions. However, despite these improvements, real-time, accurate face recognition is still a challenge, primarily due to the high computational cost associated with the use of Deep Convolutions Neural Networks (DCNN), and the need to balance accuracy requirements with time and resource constraints. Other significant issues affecting face recognition relate to occlusion, illumination and pose invariance, which causes a notable decline in accuracy in both traditional handcrafted solutions and deep neural networks. This survey will provide a critical analysis and comparison of modern state of the art methodologies, their benefits, and their limitations. It provides a comprehensive coverage of both deep and shallow solutions, as they stand today, and highlight areas requiring future development and improvement. This review is aimed at facilitating research into novel approaches, and further development of current methodologies by scientists and engineers, whilst imparting an informative and analytical perspective on currently available solutions to end users in industry, government and consumer contexts.

52.Propose-and-Attend Single Shot Detector ⬇️

We present a simple yet effective prediction module for a one-stage detector. The main process is conducted in a coarse-to-fine manner. First, the module roughly adjusts the default boxes to well capture the extent of target objects in an image. Second, given the adjusted boxes, the module aligns the receptive field of the convolution filters accordingly, not requiring any embedding layers. Both steps build a propose-and-attend mechanism, mimicking two-stage detectors in a highly efficient manner. To verify its effectiveness, we apply the proposed module to a basic one-stage detector SSD. Our final model achieves an accuracy comparable to that of state-of-the-art detectors while using a fraction of their model parameters and computational overheads. Moreover, we found that the proposed module has two strong applications. 1) The module can be successfully integrated into a lightweight backbone, further pushing the efficiency of the one-stage detector. 2) The module also allows train-from-scratch without relying on any sophisticated base networks as previous methods do.

53.Multi-Angle Point Cloud-VAE: Unsupervised Feature Learning for 3D Point Clouds from Multiple Angles by Joint Self-Reconstruction and Half-to-Half Prediction ⬇️

Unsupervised feature learning for point clouds has been vital for large-scale point cloud understanding. Recent deep learning based methods depend on learning global geometry from self-reconstruction. However, these methods are still suffering from ineffective learning of local geometry, which significantly limits the discriminability of learned features. To resolve this issue, we propose MAP-VAE to enable the learning of global and local geometry by jointly leveraging global and local self-supervision. To enable effective local self-supervision, we introduce multi-angle analysis for point clouds. In a multi-angle scenario, we first split a point cloud into a front half and a back half from each angle, and then, train MAP-VAE to learn to predict a back half sequence from the corresponding front half sequence. MAP-VAE performs this half-to-half prediction using RNN to simultaneously learn each local geometry and the spatial relationship among them. In addition, MAP-VAE also learns global geometry via self-reconstruction, where we employ a variational constraint to facilitate novel shape generation. The outperforming results in four shape analysis tasks show that MAP-VAE can learn more discriminative global or local features than the state-of-the-art methods.

54.Mapping road safety features from streetview imagery: A deep learning approach ⬇️

Each year, around 6 million car accidents occur in the U.S. on average. Road safety features (e.g., concrete barriers, metal crash barriers, rumble strips) play an important role in preventing or mitigating vehicle crashes. Accurate maps of road safety features is an important component of safety management systems for federal or state transportation agencies, helping traffic engineers identify locations to invest on safety infrastructure. In current practice, mapping road safety features is largely done manually (e.g., observations on the road or visual interpretation of streetview imagery), which is both expensive and time consuming. In this paper, we propose a deep learning approach to automatically map road safety features from streetview imagery. Unlike existing Convolutional Neural Networks (CNNs) that classify each image individually, we propose to further add Recurrent Neural Network (Long Short Term Memory) to capture geographic context of images (spatial autocorrelation effect along linear road network paths). Evaluations on real world streetview imagery show that our proposed model outperforms several baseline methods.

55.Camera Exposure Control for Robust Robot Vision with Noise-Aware Image Quality Assessment ⬇️

In this paper, we propose a noise-aware exposure control algorithm for robust robot vision. Our method aims to capture the best-exposed image which can boost the performance of various computer vision and robotics tasks. For this purpose, we carefully design an image quality metric which captures complementary quality attributes and ensures light-weight computation. Specifically, our metric consists of a combination of image gradient, entropy, and noise metrics. The synergy of these measures allows preserving sharp edge and rich texture in the image while maintaining a low noise level. Using this novel metric, we propose a real-time and fully automatic exposure and gain control technique based on the Nelder-Mead method. To illustrate the effectiveness of our technique, a large set of experimental results demonstrates higher qualitative and quantitative performances when compared with conventional approaches.

56.MoBiNet: A Mobile Binary Network for Image Classification ⬇️

MobileNet and Binary Neural Networks are two among the most widely used techniques to construct deep learning models for performing a variety of tasks on mobile and embedded this http URL this paper, we present a simple yet efficient scheme to exploit MobileNet binarization at activation function and model weights. However, training a binary network from scratch with separable depth-wise and point-wise convolutions in case of MobileNet is not trivial and prone to divergence. To tackle this training issue, we propose a novel neural network architecture, namely MoBiNet - Mobile Binary Network in which skip connections are manipulated to prevent information loss and vanishing gradient, thus facilitate the training process. More importantly, while existing binary neural networks often make use of cumbersome backbones such as Alex-Net, ResNet, VGG-16 with float-type pre-trained weights initialization, our MoBiNet focuses on binarizing the already-compressed neural networks like MobileNet without the need of a pre-trained model to start with. Therefore, our proposal results in an effectively small model while keeping the accuracy comparable to existing ones. Experiments on ImageNet dataset show the potential of the MoBiNet as it achieves 54.40% top-1 accuracy and dramatically reduces the computational cost with binary operators.

57.Artistic Domain Generalisation Methods are Limited by their Deep Representations ⬇️

The cross-depiction problem refers to the task of recognising visual objects regardless of their depictions; whether photographed, painted, sketched, {\em etc}. In the past, some researchers considered cross-depiction to be domain adaptation (DA). More recent work considers cross-depiction as domain generalisation (DG), in which algorithms extend recognition from one set of domains (such as photographs and coloured artwork) to another (such as sketches). We show that fixing the last layer of AlexNet to random values provides a performance comparable to state of the art DA and DG algorithms, when tested over the PACS benchmark. With support from background literature, our results lead us to conclude that texture alone is insufficient to support generalisation; rather, higher-order representations such as structure and shape are necessary.

58.Screening Mammogram Classification with Prior Exams ⬇️

Radiologists typically compare a patient's most recent breast cancer screening exam to their previous ones in making informed diagnoses. To reflect this practice, we propose new neural network models that compare pairs of screening mammograms from the same patient. We train and evaluate our proposed models on over 665,000 pairs of images (over 166,000 pairs of exams). Our best model achieves an AUC of 0.866 in predicting malignancy in patients who underwent breast cancer screening, reducing the error rate of the corresponding baseline.

59.GENESIS: Generative Scene Inference and Sampling with Object-Centric Latent Representations ⬇️

Generative models are emerging as promising tools in robotics and reinforcement learning. Yet, even though tasks in these domains typically involve distinct objects, most state-of-the-art methods do not explicitly capture the compositional nature of visual scenes. Two exceptions, MONet and IODINE, decompose scenes into objects in an unsupervised fashion via a set of latent variables. Their underlying generative processes, however, do not account for component interactions. Hence, neither of them allows for principled sampling of coherent scenes. Here we present GENESIS, the first object-centric generative model of visual scenes capable of both decomposing and generating complete scenes by explicitly capturing relationships between scene components. GENESIS parameterises a spatial GMM over pixels which is encoded by component-wise latent variables that are inferred sequentially or sampled from an autoregressive prior. We train GENESIS on two publicly available datasets and probe the information in the latent representations through a set of classification tasks, outperforming several baselines.

60.Lung image segmentation by generative adversarial networks ⬇️

Lung image segmentation plays an important role in computer-aid pulmonary diseases diagnosis and treatment. This paper proposed a lung image segmentation method by generative adversarial networks. We employed a variety of generative adversarial networks and use its capability of image translation to perform image segmentation. The generative adversarial networks was employed to translate the original lung image to the segmented image. The generative adversarial networks based segmentation method was test on real lung image data set. Experimental results shows that the proposed method is effective and outperform state-of-the art method.

61.Synthesis and Inpainting-Based MR-CT Registration for Image-Guided Thermal Ablation of Liver Tumors ⬇️

Thermal ablation is a minimally invasive procedure for treat-ing small or unresectable tumors. Although CT is widely used for guiding ablation procedures, the contrast of tumors against surrounding normal tissues in CT images is often poor, aggravating the difficulty in accurate thermal ablation. In this paper, we propose a fast MR-CT image registration method to overlay a pre-procedural MR (pMR) image onto an intra-procedural CT (iCT) image for guiding the thermal ablation of liver tumors. By first using a Cycle-GAN model with mutual information constraint to generate synthesized CT (sCT) image from the cor-responding pMR, pre-procedural MR-CT image registration is carried out through traditional mono-modality CT-CT image registration. At the intra-procedural stage, a partial-convolution-based network is first used to inpaint the probe and its artifacts in the iCT image. Then, an unsupervised registration network is used to efficiently align the pre-procedural CT (pCT) with the inpainted iCT (inpCT) image. The final transformation from pMR to iCT is obtained by combining the two estimated transformations,i.e., (1) from the pMR image space to the pCT image space (through sCT) and (2) from the pCT image space to the iCT image space (through inpCT). Experimental results confirm that the proposed method achieves high registration accuracy with a very fast computational speed.

62.Inertial nonconvex alternating minimizations for the image deblurring ⬇️

In image processing, Total Variation (TV) regularization models are commonly used to recover blurred images. One of the most efficient and popular methods to solve the convex TV problem is the Alternating Direction Method of Multipliers (ADMM) algorithm, recently extended using the inertial proximal point method. Although all the classical studies focus on only a convex formulation, recent articles are paying increasing attention to the nonconvex methodology due to its good numerical performance and properties. In this paper, we propose to extend the classical formulation with a novel nonconvex Alternating Direction Method of Multipliers with the Inertial technique (IADMM). Under certain assumptions on the parameters, we prove the convergence of the algorithm with the help of the Kurdyka-{\L}ojasiewicz property. We also present numerical simulations on classical TV image reconstruction problems to illustrate the efficiency of the new algorithm and its behavior compared with the well established ADMM method.

63.Attention Guided Network for Retinal Image Segmentation ⬇️

Learning structural information is critical for producing an ideal result in retinal image segmentation. Recently, convolutional neural networks have shown a powerful ability to extract effective representations. However, convolutional and pooling operations filter out some useful structural information. In this paper, we propose an Attention Guided Network (AG-Net) to preserve the structural information and guide the expanding operation. In our AG-Net, the guided filter is exploited as a structure sensitive expanding path to transfer structural information from previous feature maps, and an attention block is introduced to exclude the noise and reduce the negative influence of background further. The extensive experiments on two retinal image segmentation tasks (i.e., blood vessel segmentation, optic disc and cup segmentation) demonstrate the effectiveness of our proposed method.

64.Not All Adversarial Examples Require a Complex Defense: Identifying Over-optimized Adversarial Examples with IQR-based Logit Thresholding ⬇️

Detecting adversarial examples currently stands as one of the biggest challenges in the field of deep learning. Adversarial attacks, which produce adversarial examples, increase the prediction likelihood of a target class for a particular data point. During this process, the adversarial example can be further optimized, even when it has already been wrongly classified with 100% confidence, thus making the adversarial example even more difficult to detect. For this kind of adversarial examples, which we refer to as over-optimized adversarial examples, we discovered that the logits of the model provide solid clues on whether the data point at hand is adversarial or genuine. In this context, we first discuss the masking effect of the softmax function for the prediction made and explain why the logits of the model are more useful in detecting over-optimized adversarial examples. To identify this type of adversarial examples in practice, we propose a non-parametric and computationally efficient method which relies on interquartile range, with this method becoming more effective as the image resolution increases. We support our observations throughout the paper with detailed experiments for different datasets (MNIST, CIFAR-10, and ImageNet) and several architectures.

65.Confounder-Aware Visualization of ConvNets ⬇️

With recent advances in deep learning, neuroimaging studies increasingly rely on convolutional networks (ConvNets) to predict diagnosis based on MR images. To gain a better understanding of how a disease impacts the brain, the studies visualize the salience maps of the ConvNet highlighting voxels within the brain majorly contributing to the prediction. However, these salience maps are generally confounded, i.e., some salient regions are more predictive of confounding variables (such as age) than the diagnosis. To avoid such misinterpretation, we propose in this paper an approach that aims to visualize confounder-free saliency maps that only highlight voxels predictive of the diagnosis. The approach incorporates univariate statistical tests to identify confounding effects within the intermediate features learned by ConvNet. The influence from the subset of confounded features is then removed by a novel partial back-propagation procedure. We use this two-step approach to visualize confounder-free saliency maps extracted from synthetic and two real datasets. These experiments reveal the potential of our visualization in producing unbiased model-interpretation.

66.Exploring large scale public medical image datasets ⬇️

Rationale and Objectives: Medical artificial intelligence systems are dependent on well characterised large scale datasets. Recently released public datasets have been of great interest to the field, but pose specific challenges due to the disconnect they cause between data generation and data usage, potentially limiting the utility of these datasets.
Materials and Methods: We visually explore two large public datasets, to determine how accurate the provided labels are and whether other subtle problems exist. The ChestXray14 dataset contains 112,120 frontal chest films, and the MURA dataset contains 40,561 upper limb radiographs. A subset of around 700 images from both datasets was reviewed by a board-certified radiologist, and the quality of the original labels was determined.
Results: The ChestXray14 labels did not accurately reflect the visual content of the images, with positive predictive values mostly between 10% and 30% lower than the values presented in the original documentation. There were other significant problems, with examples of hidden stratification and label disambiguation failure. The MURA labels were more accurate, but the original normal/abnormal labels were inaccurate for the subset of cases with degenerative joint disease, with a sensitivity of 60% and a specificity of 82%.
Conclusion: Visual inspection of images is a necessary component of understanding large image datasets. We recommend that teams producing public datasets should perform this important quality control procedure and include a thorough description of their findings, along with an explanation of the data generating procedures and labelling rules, in the documentation for their datasets.

67.Particle Swarm Optimisation for Evolving Deep Neural Networks for Image Classification by Evolving and Stacking Transferable Blocks ⬇️

Deep Convolutional Neural Networks (CNNs) have been widely used in image classification tasks, but the process of designing CNN architectures is very complex, so Neural Architecture Search (NAS), automatically searching for optimal CNN architectures, has attracted more and more research interests. However, the computational cost of NAS is often too high to apply NAS on real-life applications. In this paper, an efficient particle swarm optimisation method named EPSOCNN is proposed to evolve CNN architectures inspired by the idea of transfer learning. EPSOCNN successfully reduces the computation cost by minimising the search space to a single block and utilising a small subset of the training set to evaluate CNNs during evolutionary process. Meanwhile, EPSOCNN also keeps very competitive classification accuracy by stacking the evolved block multiple times to fit the whole dataset. The proposed EPSOCNN algorithm is evaluated on CIFAR-10 dataset and compared with 13 peer competitors comprised of deep CNNs crafted by hand, learned by reinforcement learning methods and evolved by evolutionary computation approaches, which shows very promising results by outperforming all of the peer competitors with regard to the classification accuracy, number of parameters and the computational cost.

68.Task Classification Model for Visual Fixation, Exploration, and Search ⬇️

Yarbus' claim to decode the observer's task from eye movements has received mixed reactions. In this paper, we have supported the hypothesis that it is possible to decode the task. We conducted an exploratory analysis on the dataset by projecting features and data points into a scatter plot to visualize the nuance properties for each task. Following this analysis, we eliminated highly correlated features before training an SVM and Ada Boosting classifier to predict the tasks from this filtered eye movements data. We achieve an accuracy of 95.4% on this task classification problem and hence, support the hypothesis that task classification is possible from a user's eye movement data.

69.Automatic Lesion Boundary Segmentation in Dermoscopic Images with Ensemble Deep Learning Methods ⬇️

Early detection of skin cancer, particularly melanoma, is crucial to enable advanced treatment. Due to the rapid growth in the numbers of skin cancers, there is a growing need of computerized analysis for skin lesions. The state-of-the-art public available datasets for skin lesions are often accompanied with very limited amount of segmentation ground truth labeling as it is laborious and expensive. The lesion boundary segmentation is vital to locate the lesion accurately in dermoscopic images and lesion diagnosis of different skin lesion types. In this work, we propose the use of fully automated deep learning ensemble methods for accurate lesion boundary segmentation in dermoscopic images. We trained the Mask-RCNN and DeepLabv3+ methods on ISIC-2017 segmentation training set and evaluate the performance of the ensemble networks on ISIC-2017 testing set. Our results showed that the best proposed ensemble method segmented the skin lesions with Jaccard index of 79.58% for the ISIC-2017 testing set. The proposed ensemble method outperformed FrCN, FCN, U-Net, and SegNet in Jaccard Index by 2.48%, 7.42%, 17.95%, and 9.96% respectively. Furthermore, the proposed ensemble method achieved an accuracy of 95.6% for some representative clinically benign cases, 90.78% for the melanoma cases, and 91.29% for the seborrheic keratosis cases on ISIC-2017 testing set, exhibiting better performance than FrCN, FCN, U-Net, and SegNet.