1.Adversarial Joint Image and Pose Distribution Learning for Camera Pose Regression and Refinement pdf
In this paper we present a deep-learning based framework for direct camera pose regression and refinement using RGB information only. For this aim we introduce a novel framework for camera pose estimation, that regresses the camera pose as well as offers a solely RGB-based solution for camera pose refinement. Utilizing research results of recent camera pose regression methods, we investigate the effect of adversarial networks on convolutional neural networks (CNNs) trained for camera re-localization applications, with the goal to better learn the geometric connection between camera pose and corresponding RGB image. Similar to Generative Adversarial Networks (GANs), in addition to a camera pose regressor, mapping images to poses, we propose to train a discriminator that effectively distinguishes between regressed and ground truth poses. This pose discriminator is conditioned on features extracted from the respective input image to implicitly model the relationship between ground truth or regressed poses, and once learned can be used to update the predicted camera poses and improve the localization accuracy.
2.PifPaf: Composite Fields for Human Pose Estimation pdf
We propose a new bottom-up method for multi-person 2D human pose estimation that is particularly well suited for urban mobility such as self-driving cars and delivery robots. The new method, PifPaf, uses a Part Intensity Field (PIF) to localize body parts and a Part Association Field (PAF) to associate body parts with each other to form full human poses. Our method outperforms previous methods at low resolution and in crowded, cluttered and occluded scenes thanks to (i) our new composite field PAF encoding fine-grained information and (ii) the choice of Laplace loss for regressions which incorporates a notion of uncertainty. Our architecture is based on a fully convolutional, single-shot, box-free design. We perform on par with the existing state-of-the-art bottom-up method on the standard COCO keypoint task and produce state-of-the-art results on a modified COCO keypoint task for the transportation domain.
3.Selective Kernel Networks pdf
In standard Convolutional Neural Networks (CNNs), the receptive fields of artificial neurons in each layer are designed to share the same size. It is well-known in the neuroscience community that the receptive field size of visual cortical neurons are modulated by the stimulus, which has been rarely considered in constructing CNNs. We propose a dynamic selection mechanism in CNNs that allows each neuron to adaptively adjust its receptive field size based on multiple scales of input information. A building block called Selective Kernel (SK) unit is designed, in which multiple branches with different kernel sizes are fused using softmax attention that is guided by the information in these branches. Different attentions on these branches yield different sizes of the effective receptive fields of neurons in the fusion layer. Multiple SK units are stacked to a deep network termed Selective Kernel Networks (SKNets). On the ImageNet and CIFAR benchmarks, we empirically show that SKNet outperforms the existing state-of-the-art architectures with lower model complexity. Detailed analyses show that the neurons in SKNet can capture target objects with different scales, which verifies the capability of neurons for adaptively adjusting their recpeitve field sizes according to the input. The code and models are available at this https URL.
4.Unsupervised and interpretable scene discovery with Discrete-Attend-Infer-Repeat pdf
In this work we present Discrete Attend Infer Repeat (Discrete-AIR), a Recurrent Auto-Encoder with structured latent distributions containing discrete categorical distributions, continuous attribute distributions, and factorised spatial attention. While inspired by the original AIR model andretaining AIR model's capability in identifying objects in an image, Discrete-AIR provides direct interpretability of the latent codes. We show that for Multi-MNIST and a multiple-objects version of dSprites dataset, the Discrete-AIR model needs just one categorical latent variable, one attribute variable (for Multi-MNIST only), together with spatial attention variables, for efficient inference. We perform analysis to show that the learnt categorical distributions effectively capture the categories of objects in the scene for Multi-MNIST and for Multi-Sprites.
5.Inserting Videos into Videos pdf
In this paper, we introduce a new problem of manipulating a given video by inserting other videos into it. Our main task is, given an object video and a scene video, to insert the object video at a user-specified location in the scene video so that the resulting video looks realistic. We aim to handle different object motions and complex backgrounds without expensive segmentation annotations. As it is difficult to collect training pairs for this problem, we synthesize fake training pairs that can provide helpful supervisory signals when training a neural network with unpaired real data. The proposed network architecture can take both real and fake pairs as input and perform both supervised and unsupervised training in an adversarial learning scheme. To synthesize a realistic video, the network renders each frame based on the current input and previous frames. Within this framework, we observe that injecting noise into previous frames while generating the current frame stabilizes training. We conduct experiments on real-world videos in object tracking and person re-identification benchmark datasets. Experimental results demonstrate that the proposed algorithm is able to synthesize long sequences of realistic videos with a given object video inserted.
6.Multi-label Cloud Segmentation Using a Deep Network pdf
Different empirical models have been developed for cloud detection. There is a growing interest in using the ground-based sky/cloud images for this purpose. Several methods exist that perform binary segmentation of clouds. In this paper, we propose to use a deep learning architecture (U-Net) to perform multi-label sky/cloud image segmentation. The proposed approach outperforms recent literature by a large margin.
7.Constrained Mutual Convex Cone Method for Image Set Based Recognition pdf
In this paper, we propose a method for image-set classification based on convex cone models. Image set classification aims to classify a set of images, which were usually obtained from video frames or multi-view cameras, into a target object. To accurately and stably classify a set, it is essential to represent structural information of the set accurately. There are various representative image features, such as histogram based features, HLAC, and Convolutional Neural Network (CNN) features. We should note that most of them have non-negativity and thus can be effectively represented by a convex cone. This leads us to introduce the convex cone representation to image-set classification. To establish a convex cone based framework, we mathematically define multiple angles between two convex cones, and then define the geometric similarity between the cones using the angles. Moreover, to enhance the framework, we introduce a discriminant space that maximizes the between-class variance (gaps) and minimizes the within-class variance of the projected convex cones onto the discriminant space, similar to the Fisher discriminant analysis. Finally, the classification is performed based on the similarity between projected convex cones. The effectiveness of the proposed method is demonstrated experimentally by using five databases: CMU PIE dataset, ETH-80, CMU Motion of Body dataset, Youtube Celebrity dataset, and a private database of multi-view hand shapes.
8.Superpixel Contracted Graph-Based Learning for Hyperspectral Image Classification pdf
A central problem in hyperspectral image classification is obtaining high classification accuracy when using a limited amount of labelled data. In this paper we present a novel graph-based framework, which aims to tackle this problem in the presence of large scale data input. Our approach utilises a novel superpixel method, specifically designed for hyperspectral data, to define meaningful local regions in an image, which with high probability share the same classification label. We then extract spectral and spatial features from these regions and use these to produce a contracted weighted graph-representation, where each node represents a region rather than a pixel. Our graph is then fed into a graph-based semi-supervised classifier which gives the final classification. We show that using superpixels in a graph representation is an effective tool for speeding up graphical classifiers applied to hyperspectral images. We demonstrate through exhaustive quantitative and qualitative results that our proposed method produces accurate classifications when an incredibly small amount of labelled data is used. We show that our approach mitigates the major drawbacks of existing approaches, resulting in our approach outperforming several comparative state-of-the-art techniques.
9.Age prediction using a large chest X-ray dataset pdf
Age prediction based on appearances of different anatomies in medical images has been clinically explored for many decades. In this paper, we used deep learning to predict a persons age on Chest X-Rays. Specifically, we trained a CNN in regression fashion on a large publicly available dataset. Moreover, for interpretability, we explored activation maps to identify which areas of a CXR image are important for the machine (i.e. CNN) to predict a patients age, offering insight. Overall, amongst correctly predicted CXRs, we see areas near the clavicles, shoulders, spine, and mediastinum being most activated for age prediction, as one would expect biologically. Amongst incorrectly predicted CXRs, we have qualitatively identified disease patterns that could possibly make the anatomies appear older or younger than expected. A further technical and clinical evaluation would improve this work. As CXR is the most commonly requested imaging exam, a potential use case for estimating age may be found in the preventative counseling of patient health status compared to their age-expected average, particularly when there is a large discrepancy between predicted age and the real patient age.
10.Alignment Based Matching Networks for One-Shot Classification and Open-Set Recognition pdf
Deep learning for object classification relies heavily on convolutional models. While effective, CNNs are rarely interpretable after the fact. An attention mechanism can be used to highlight the area of the image that the model focuses on thus offering a narrow view into the mechanism of classification. We expand on this idea by forcing the method to explicitly align images to be classified to reference images representing the classes. The mechanism of alignment is learned and therefore does not require that the reference objects are anything like those being classified. Beyond explanation, our exemplar based cross-alignment method enables classification with only a single example per category (one-shot). Our model cuts the 5-way, 1-shot error rate in Omniglot from 2.1% to 1.4% and in MiniImageNet from 53.5% to 46.5% while simultaneously providing point-wise alignment information providing some understanding on what the network is capturing. This method of alignment also enables the recognition of an unsupported class (open-set) in the one-shot setting while maintaining an F1-score of above 0.5 for Omniglot even with 19 other distracting classes while baselines completely fail to separate the open-set class in the one-shot setting.
11.Multi-Representational Learning for Offline Signature Verification using Multi-Loss Snapshot Ensemble of CNNs pdf
Offline Signature Verification (OSV) is a challenging pattern recognition task, especially in presence of skilled forgeries that are not available during training. This study aims to tackle its challenges and meet the substantial need for generalization for OSV by examining different loss functions for Convolutional Neural Network (CNN). We adopt our new approach to OSV by asking two questions: 1. which classification loss provides more generalization for feature learning in OSV? , and 2. How integration of different losses into a unified multi-loss function lead to an improved learning framework? These questions are studied based on analysis of three loss functions, including cross entropy, Cauchy-Schwarz divergence, and hinge loss. According to complementary features of these losses, we combine them into a dynamic multi-loss function and propose a novel ensemble framework for simultaneous use of them in CNN. Our proposed Multi-Loss Snapshot Ensemble (MLSE) consists of several sequential trials. In each trial, a dominant loss function is selected from the multi-loss set, and the remaining losses act as a regularizer. Different trials learn diverse representations for each input based on signature identification task. This multi-representation set is then employed for the verification task. An ensemble of SVMs is trained on these representations, and their decisions are finally combined according to the selection of most generalizable SVM for each user. We conducted two sets of experiments based on two different protocols of OSV, i.e., writer-dependent and writer-independent on three signature datasets: GPDS-Synthetic, MCYT, and UT-SIG. Based on the writer-dependent OSV protocol, we achieved substantial improvements over the best EERs in the literature. The results of the second set of experiments also confirmed the robustness to the arrival of new users enrolled in the OSV system.
12.Bringing Blurry Alive at High Frame-Rate with an Event Camera pdf
Event-based cameras can measure intensity changes (called `{\it events}') with microsecond accuracy under high-speed motion and challenging lighting conditions. With the active pixel sensor (APS), the event camera allows simultaneous output of the intensity frames. However, the output images are captured at a relatively low frame-rate and often suffer from motion blur. A blurry image can be regarded as the integral of a sequence of latent images, while the events indicate the changes between the latent images. Therefore, we are able to model the blur-generation process by associating event data to a latent image. Based on the abundant event data and the low frame-rate easily blurred images, we propose a simple and effective approach to reconstruct a high-quality and high frame-rate shape video. Starting with a single blurry frame and its event data, we propose the \textbf{Event-based Double Integral (EDI)} model. Then, we extend it to \textbf{ multiple Event-based Double Integral (mEDI)} model to get more smooth and convincing results based on multiple images and their events. We also provide an efficient solver to minimize the proposed energy model. By optimizing the energy model, we achieve significant improvements in removing general blurs and reconstructing high temporal resolution video. The video generation is based on solving a simple non-convex optimization problem in a single scalar variable. Experimental results on both synthetic and real images demonstrate the superiority of our mEDI model and optimization method in comparison to the state-of-the-art.
13.Spiking-YOLO: Spiking Neural Network for Real-time Object Detection pdf
Over the past decade, deep neural networks (DNNs) have become a de-facto standard for solving machine learning problems. As we try to solve more advanced problems, growing demand for computing and power resources are inevitable, nearly impossible to employ DNNs on embedded systems, where available resources are limited. Given these circumstances, spiking neural networks (SNNs) are attracting widespread interest as the third generation of neural network, due to event-driven and low-powered nature. However, SNNs come at the cost of significant performance degradation largely due to complex dynamics of SNN neurons and non-differential spike operation. Thus, its application has been limited to relatively simple tasks such as image classification. In this paper, we investigate the performance degradation of SNNs in the much more challenging task of object detection. From our in-depth analysis, we introduce two novel methods to overcome a significant performance gap: channel-wise normalization and signed neuron with imbalanced threshold. Consequently, we present a spiked-based real-time object detection model, called Spiking-YOLO that provides near-lossless information transmission in a shorter period of time for deep SNN. Our experiments show that the Spiking-YOLO is able to achieve comparable results up to 97% of the original YOLO on a non-trivial dataset, PASCAL VOC.
14.Noisy Supervision for Correcting Misaligned Cadaster Maps Without Perfect Ground Truth Data pdf
In machine learning the best performance on a certain task is achieved by fully supervised methods when perfect ground truth labels are available. However, labels are often noisy, especially in remote sensing where manually curated public datasets are rare. We study the multi-modal cadaster map alignment problem for which available annotations are mis-aligned polygons, resulting in noisy supervision. We subsequently set up a multiple-rounds training scheme which corrects the ground truth annotations at each round to better train the model at the next round. We show that it is possible to reduce the noise of the dataset by iteratively training a better alignment model to correct the annotation alignment.
15.GolfDB: A Video Database for Golf Swing Sequencing pdf
The golf swing is a complex movement requiring considerable full-body coordination to execute proficiently. As such, it is the subject of frequent scrutiny and extensive biomechanical analyses. In this paper, we introduce the notion of golf swing sequencing for detecting key events in the golf swing and facilitating golf swing analysis. To enable consistent evaluation of golf swing sequencing performance, we also introduce the benchmark database GolfDB, consisting of 1400 high-quality golf swing videos, each labeled with event frames, bounding box, player name and sex, club type, and view type. Furthermore, to act as a reference baseline for evaluating golf swing sequencing performance on GolfDB, we propose a lightweight deep neural network called SwingNet, which possesses a hybrid deep convolutional and recurrent neural network architecture. SwingNet correctly detects eight golf swing events at an average rate of 76.1%, and six out of eight events at a rate of 91.8%. In line with the proposed baseline SwingNet, we advocate the use of computationally efficient models in future research to promote in-the-field analysis via deployment on readily-available mobile devices.
16.Phenotypic Profiling of High Throughput Imaging Screens with Generic Deep Convolutional Features pdf
While deep learning has seen many recent applications to drug discovery, most have focused on predicting activity or toxicity directly from chemical structure. Phenotypic changes exhibited in cellular images are also indications of the mechanism of action (MoA) of chemical compounds. In this paper, we show how pre-trained convolutional image features can be used to assist scientists in discovering interesting chemical clusters for further investigation. Our method reduces the dimensionality of raw fluorescent stained images from a high throughput imaging (HTI) screen, producing an embedding space that groups together images with similar cellular phenotypes. Running standard unsupervised clustering on this embedding space yields a set of distinct phenotypic clusters. This allows scientists to further select and focus on interesting clusters for downstream analyses. We validate the consistency of our embedding space qualitatively with t-sne visualizations, and quantitatively by measuring embedding variance among images that are known to be similar. Results suggested the usefulness of our proposed workflow using deep learning and clustering and it can lead to robust HTI screening and compound triage.
17.SceneCode: Monocular Dense Semantic Reconstruction using Learned Encoded Scene Representation pdf
Systems which incrementally create 3D semantic maps from image sequences must store and update representations of both geometry and semantic entities. However, while there has been much work on the correct formulation for geometrical estimation, state-of-the-art systems usually rely on simple semantic representations which store and update independent label estimates for each surface element (depth pixels, surfels, or voxels). Spatial correlation is discarded, and fused label maps are incoherent and noisy.
We introduce a new compact and optimisable semantic representation by training a variational auto-encoder that is conditioned on a colour image. Using this learned latent space, we can tackle semantic label fusion by jointly optimising the low-dimenional codes associated with each of a set of overlapping images, producing consistent fused label maps which preserve spatial correlation. We also show how this approach can be used within a monocular keyframe based semantic mapping system where a similar code approach is used for geometry. The probabilistic formulation allows a flexible formulation where we can jointly estimate motion, geometry and semantics in a unified optimisation.
18.DeepHuman: 3D Human Reconstruction from a Single Image pdf
We propose DeepHuman, a deep learning based framework for 3D human reconstruction from a single RGB image. Since this problem is highly intractable, we adopt a stage-wise, coarse-to-fine method consisting of three steps, namely body shape/pose estimation, surface reconstruction and frontal surface detail refinement. Once a body is estimated from the given image, our method generates a dense semantic representation from it. The representations not only encodes body shape and pose but also bridges the 2D image plane and 3D space. An image-guided volume-to-volume translation CNN is introduced to reconstruct the surface given the input image and the dense semantic representation. One key feature of our network is that it fuses different scales of image features into the 3D space through volumetric feature transformation, which helps to recover details of the subject's surface geometry. The details on the frontal areas of the surface are further refined through a normal map refinement network. The normal refinement network can be concatenated with the volume generation network using our proposed volumetric normal projection layer. We also contribute THuman, a 3D real-world human model dataset containing approximately 7000 models. The whole network is trained using training data generated from the dataset. Overall, due to the specific design of our network and the diversity in our dataset, our method enables 3D human reconstruction given only a single image and outperforms state-of-the-art approaches.
19.BLVD: Building A Large-scale 5D Semantics Benchmark for Autonomous Driving pdf
In autonomous driving community, numerous benchmarks have been established to assist the tasks of 3D/2D object detection, stereo vision, semantic/instance segmentation. However, the more meaningful dynamic evolution of the surrounding objects of ego-vehicle is rarely exploited, and lacks a large-scale dataset platform. To address this, we introduce BLVD, a large-scale 5D semantics benchmark which does not concentrate on the static detection or semantic/instance segmentation tasks tackled adequately before. Instead, BLVD aims to provide a platform for the tasks of dynamic 4D (3D+temporal) tracking, 5D (4D+interactive) interactive event recognition and intention prediction. This benchmark will boost the deeper understanding of traffic scenes than ever before. We totally yield 249,129 3D annotations, 4,902 independent individuals for tracking with the length of overall 214,922 points, 6,004 valid fragments for 5D interactive event recognition, and 4,900 individuals for 5D intention prediction. These tasks are contained in four kinds of scenarios depending on the object density (low and high) and light conditions (daytime and nighttime). The benchmark can be downloaded from our project site this https URL.
20.Quality-aware Unpaired Image-to-Image Translation pdf
Generative Adversarial Networks (GANs) have been widely used for the image-to-image translation task. While these models rely heavily on the labeled image pairs, recently some GAN variants have been proposed to tackle the unpaired image translation task. These models exploited supervision at the domain level with a reconstruction process for unpaired image translation. On the other hand, parallel works have shown that leveraging perceptual loss functions based on high level deep features could enhance the generated image quality. Nevertheless, as these GAN-based models either depended on the pretrained deep network structure or relied on the labeled image pairs, they could not be directly applied to the unpaired image translation task. Moreover, despite the improvement of the introduced perceptual losses from deep neural networks, few researchers have explored the possibility of improving the generated image quality from classical image quality measures. To tackle the above two challenges, in this paper, we propose a unified quality-aware GAN-based framework for unpaired image-to-image translation, where a quality-aware loss is explicitly incorporated by comparing each source image and the reconstructed image at the domain level. Specifically, we design two detailed implementations of the quality loss. The first method is based on a classical image quality assessment measure by defining a classical quality-aware loss. The second method proposes an adaptive deep network based loss. Finally, extensive experimental results on many real-world datasets clearly show the quality improvement of our proposed framework, and the superiority of leveraging classical image quality measures for unpaired image translation compared to the deep network based model.
21.DFineNet: Ego-Motion Estimation and Depth Refinement from Sparse, Noisy Depth Input with RGB Guidance pdf
Depth estimation is an important capability for autonomous vehicles to understand and reconstruct 3D environments as well as avoid obstacles during the execution. Accurate depth sensors such as LiDARs are often heavy, expensive and can only provide sparse depth while lighter depth sensors such as stereo cameras are noiser in comparison. We propose an end-to-end learning algorithm that is capable of using sparse, noisy input depth for refinement and depth completion. Our model also produces the camera pose as a byproduct, making it a great solution for autonomous systems. We evaluate our approach on both indoor and outdoor datasets. Empirical results show that our method performs well on the KITTI~\cite{kitti_geiger2012we} dataset when compared to other competing methods, while having superior performance in dealing with sparse, noisy input depth on the TUM~\cite{sturm12iros} dataset.
22.Did You Miss the Sign? A False Negative Alarm System for Traffic Sign Detectors pdf
Object detection is an integral part of an autonomous vehicle for its safety-critical and navigational purposes. Traffic signs as objects play a vital role in guiding such systems. However, if the vehicle fails to locate any critical sign, it might make a catastrophic failure. In this paper, we propose an approach to identify traffic signs that have been mistakenly discarded by the object detector. The proposed method raises an alarm when it discovers a failure by the object detector to detect a traffic sign. This approach can be useful to evaluate the performance of the detector during the deployment phase. We trained a single shot multi-box object detector to detect traffic signs and used its internal features to train a separate false negative detector (FND). During deployment, FND decides whether the traffic sign detector (TSD) has missed a sign or not. We are using precision and recall to measure the accuracy of FND in two different datasets. For 80% recall, FND has achieved 89.9% precision in Belgium Traffic Sign Detection dataset and 90.8% precision in German Traffic Sign Recognition Benchmark dataset respectively. To the best of our knowledge, our method is the first to tackle this critical aspect of false negative detection in robotic vision. Such a fail-safe mechanism for object detection can improve the engagement of robotic vision systems in our daily life.
23.Turbo Learning Framework for Human-Object Interactions Recognition and Human Pose Estimation pdf
Human-object interactions (HOI) recognition and pose estimation are two closely related tasks. Human pose is an essential cue for recognizing actions and localizing the interacted objects. Meanwhile, human action and their interacted objects' localizations provide guidance for pose estimation. In this paper, we propose a turbo learning framework to perform HOI recognition and pose estimation simultaneously. First, two modules are designed to enforce message passing between the tasks, i.e. pose aware HOI recognition module and HOI guided pose estimation module. Then, these two modules form a closed loop to utilize the complementary information iteratively, which can be trained in an end-to-end manner. The proposed method achieves the state-of-the-art performance on two public benchmarks including Verbs in COCO (V-COCO) and HICO-DET datasets.
24.Unsupervised Deep Transfer Feature Learning for Medical Image Classification pdf
The accuracy and robustness of image classification with supervised deep learning are dependent on the availability of large-scale, annotated training data. However, there is a paucity of annotated data available due to the complexity of manual annotation. To overcome this problem, a popular approach is to use transferable knowledge across different domains by: 1) using a generic feature extractor that has been pre-trained on large-scale general images (i.e., transfer-learned) but which not suited to capture characteristics from medical images; or 2) fine-tuning generic knowledge with a relatively smaller number of annotated images. Our aim is to reduce the reliance on annotated training data by using a new hierarchical unsupervised feature extractor with a convolutional auto-encoder placed atop of a pre-trained convolutional neural network. Our approach constrains the rich and generic image features from the pre-trained domain to a sophisticated representation of the local image characteristics from the unannotated medical image domain. Our approach has a higher classification accuracy than transfer-learned approaches and is competitive with state-of-the-art supervised fine-tuned methods.
25.Flickr1024: A Dataset for Stereo Image Super-Resolution pdf
With the popularity of dual cameras in recently released smart phones, a growing number of super-resolution (SR) methods have been proposed to enhance the resolution of stereo image pairs. However, the lack of high-quality stereo datasets has limited the research in this area. To facilitate the training and evaluation of novel stereo SR algorithms, in this paper, we propose a large-scale stereo dataset named Flickr1024. Compared to the existing stereo datasets, the proposed dataset contains much more high-quality images and covers diverse scenarios. We train two state-of-the-art stereo SR methods (i.e., StereoSR and PASSRnet) on the KITTI2015, Middlebury, and Flickr1024 datasets. Experimental results demonstrate that our dataset can improve the performance of stereo SR algorithms. The Flickr1024 dataset is available online at: this https URL.
26.Unsupervised Person Re-identification by Soft Multilabel Learning pdf
Although unsupervised person re-identification (RE-ID) has drawn increasing research attentions due to its potential to address the scalability problem of supervised RE-ID models, it is very challenging to learn discriminative information in the absence of pairwise labels across disjoint camera views. To overcome this problem, we propose a deep model for the soft multilabel learning for unsupervised RE-ID. The idea is to learn a soft multilabel (real-valued label likelihood vector) for each unlabeled person by comparing the unlabeled person with a set of known reference persons from an auxiliary domain. We propose the soft multilabel-guided hard negative mining to learn a discriminative embedding for the unlabeled target domain by exploring the similarity consistency of the visual features and the soft multilabels of unlabeled target pairs. Since most target pairs are cross-view pairs, we develop the cross-view consistent soft multilabel learning to achieve the learning goal that the soft multilabels are consistently good across different camera views. To enable effecient soft multilabel learning, we introduce the reference agent learning to represent each reference person by a reference agent in a joint embedding. We evaluate our unified deep model on Market-1501 and DukeMTMC-reID. Our model outperforms the state-of-theart unsupervised RE-ID methods by clear margins. Code is available at this https URL.
27.SimulCap : Single-View Human Performance Capture with Cloth Simulation pdf
This paper proposes a new method for live free-viewpoint human performance capture with dynamic details (e.g., cloth wrinkles) using a single RGBD camera. Our main contributions are: (i) a multi-layer representation of garments and body, and (ii) a physics-based performance capture procedure. We first digitize the performer using multi-layer surface representation, which includes the undressed body surface and separate clothing meshes. For performance capture, we perform skeleton tracking, cloth simulation, and iterative depth fitting sequentially for the incoming frame. By incorporating cloth simulation into the performance capture pipeline, we can simulate plausible cloth dynamics and cloth-body interactions even in the occluded regions, which was not possible in previous capture methods. Moreover, by formulating depth fitting as a physical process, our system produces cloth tracking results consistent with the depth observation while still maintaining physical constraints. Results and evaluations show the effectiveness of our method. Our method also enables new types of applications such as cloth retargeting, free-viewpoint video rendering and animations.
28.Pose Graph Optimization for Unsupervised Monocular Visual Odometry pdf
Unsupervised Learning based monocular visual odometry (VO) has lately drawn significant attention for its potential in label-free leaning ability and robustness to camera parameters and environmental variations. However, partially due to the lack of drift correction technique, these methods are still by far less accurate than geometric approaches for large-scale odometry estimation. In this paper, we propose to leverage graph optimization and loop closure detection to overcome limitations of unsupervised learning based monocular visual odometry. To this end, we propose a hybrid VO system which combines an unsupervised monocular VO called NeuralBundler with a pose graph optimization back-end. NeuralBundler is a neural network architecture that uses temporal and spatial photometric loss as main supervision and generates a windowed pose graph consists of multi-view 6DoF constraints. We propose a novel pose cycle consistency loss to relieve the tensions in the windowed pose graph, leading to improved performance and robustness. In the back-end, a global pose graph is built from local and loop 6DoF constraints estimated by NeuralBundler and is optimized over SE(3). Empirical evaluation on the KITTI odometry dataset demonstrates that 1) NeuralBundler achieves state-of-the-art performance on unsupervised monocular VO estimation, and 2) our whole approach can achieve efficient loop closing and show favorable overall translational accuracy compared to established monocular SLAM systems.
29.Show, Translate and Tell pdf
Humans have an incredible ability to process and understand information from multiple sources such as images, video, text, and speech. Recent success of deep neural networks has enabled us to develop algorithms which give machines the ability to understand and interpret this information. There is a need to both broaden their applicability and develop methods which correlate visual information along with semantic content. We propose a unified model which jointly trains on images and captions, and learns to generate new captions given either an image or a caption query. We evaluate our model on three different tasks namely cross-modal retrieval, image captioning, and sentence paraphrasing. Our model gains insight into cross-modal vector embeddings, generalizes well on multiple tasks and is competitive to state of the art methods on retrieval.
30.Distance Preserving Grid Layouts pdf
Distance preserving visualization techniques have emerged as one of the fundamental tools for data analysis. One example are the techniques that arrange data instances into two-dimensional grids so that the pairwise distances among the instances are preserved into the produced layouts. Currently, the state-of-the-art approaches produce such grids by solving assignment problems or using permutations to optimize cost functions. Although precise, such strategies are computationally expensive, limited to small datasets or being dependent on specialized hardware to speed up the process. In this paper, we present a new technique, called Distance-preserving Grid (DGrid), that employs a binary space partitioning process in combination with multidimensional projections to create orthogonal regular grid layouts. Our results show that DGrid is as precise as the existing state-of-the-art techniques whereas requiring only a fraction of the running time and computational resources.
31.Graph Hierarchical Convolutional Recurrent Neural Network (GHCRNN) for Vehicle Condition Prediction pdf
The prediction of urban vehicle flow and speed can greatly facilitate people's travel, and also can provide reasonable advice for the decision-making of relevant government departments. However, due to the spatial, temporal and hierarchy of vehicle flow and many influencing factors such as weather, it is difficult to prediction. Most of the existing research methods are to extract spatial structure information on the road network and extract time series information from the historical data. However, when extracting spatial features, these methods have higher time and space complexity, and incorporate a lot of noise. It is difficult to apply on large graphs, and only considers the influence of surrounding connected road nodes on the central node, ignoring a very important hierarchical relationship, namely, similar information of similar node features and road network structures. In response to these problems, this paper proposes the Graph Hierarchical Convolutional Recurrent Neural Network (GHCRNN) model. The model uses GCN (Graph Convolutional Networks) to extract spatial feature, GRU (Gated Recurrent Units) to extract temporal feature, and uses the learnable Pooling to extract hierarchical information, eliminate redundant information and reduce complexity. Applying this model to the vehicle flow and speed data of Shenzhen and Los Angeles has been well verified, and the time and memory consumption are effectively reduced under the compared precision.
32.Mixture Modeling of Global Shape Priors and Autoencoding Local Intensity Priors for Left Atrium Segmentation pdf
Difficult image segmentation problems, for instance left atrium MRI, can be addressed by incorporating shape priors to find solutions that are consistent with known objects. Nonetheless, a single multivariate Gaussian is not an adequate model in cases with significant nonlinear shape variation or where the prior distribution is multimodal. Nonparametric density estimation is more general, but has a ravenous appetite for training samples and poses serious challenges in optimization, especially in high dimensional spaces. Here, we propose a maximum-a-posteriori formulation that relies on a generative image model by incorporating both local intensity and global shape priors. We use deep autoencoders to capture the complex intensity distribution while avoiding the careful selection of hand-crafted features. We formulate the shape prior as a mixture of Gaussians and learn the corresponding parameters in a high-dimensional shape space rather than pre-projecting onto a low-dimensional subspace. In segmentation, we treat the identity of the mixture component as a latent variable and marginalize it within a generalized expectation-maximization framework. We present a conditional maximization-based scheme that alternates between a closed-form solution for component-specific shape parameters that provides a global update-based optimization strategy, and an intensity-based energy minimization that translates the global notion of a nonlinear shape prior into a set of local penalties. We demonstrate our approach on the left atrial segmentation from gadolinium-enhanced MRI, which is useful in quantifying the atrial geometry in patients with atrial fibrillation.
33.Conditional GANs For Painting Generation pdf
We examined the use of modern Generative Adversarial Nets to generate novel images of oil paintings using the Painter By Numbers dataset. We implemented Spectral Normalization GAN (SN-GAN) and Spectral Normalization GAN with Gradient Penalty, and compared their outputs to a Deep Convolutional GAN. Visually, and quantitatively according to the Sliced Wasserstein Distance metric, we determined that the SN-GAN produced paintings that were most comparable to our training dataset. We then performed a series of experiments to add supervised conditioning to SN-GAN, the culmination of which is what we believe to be a novel architecture that can generate face paintings with user-specified characteristics.
34.Hyperspectral Image Classification with Deep Metric Learning and Conditional Random Field pdf
To improve the classification performance in the context of hyperspectral image processing, many works have been developed based on two common strategies, namely the spatial-spectral information integration and the utilization of neural networks. However, both strategies typically require more training data than the classical algorithms, aggregating the shortage of labeled samples. In this paper, we propose a novel framework that organically combines an existing spectrum-based deep metric learning model and the conditional random field algorithm. The deep metric learning model is supervised by center loss, and is used to produce spectrum-based features that gather more tightly within classes in Euclidean space. The conditional random field with Gaussian edge potentials, which is firstly proposed for image segmentation problem, is utilized to jointly account for both the geometry distance of two pixels and the Euclidean distance between their corresponding features extracted by the deep metric learning model. The final predictions are given by the conditional random field. Generally, the proposed framework is trained by spectra pixels at the deep metric learning stage, and utilizes the half handcrafted spatial features at the conditional random field stage. This settlement alleviates the shortage of training data to some extent. Experiments on two real hyperspectral images demonstrate the advantages of the proposed method in terms of both classification accuracy and computation cost.
35.Unpaired image denoising using a generative adversarial network in X-ray CT pdf
This paper proposes a deep learning-based denoising method for noisy low-dose computerized tomography (CT) images in the absence of paired training data. The proposed method uses a fidelity-embedded generative adversarial network (GAN) to learn a denoising function from unpaired training data of low-dose CT (LDCT) and standard-dose CT (SDCT) images, where the denoising function is the optimal generator in the GAN framework. Given an optimal discriminator in the GAN, the generator is optimized by minimizing a weighted sum of two losses: the Kullback-Leibler divergence between an SDCT data distribution and a generated distribution, and the
$\ell_2$ loss between the LDCT image and the corresponding generated images (or denoised image). The experimental results show that the proposed deep-learning method with unpaired datasets performs comparably to a method using paired datasets. Clinical experiment was also performed to show the validity of the proposed method for non-Gaussian noise arising in the low-dose X-ray CT.
36.Learning Robust Representations by Projecting Superficial Statistics Out pdf
Despite impressive performance as evaluated on i.i.d. holdout data, deep neural networks depend heavily on superficial statistics of the training data and are liable to break under distribution shift. For example, subtle changes to the background or texture of an image can break a seemingly powerful classifier. Building on previous work on domain generalization, we hope to produce a classifier that will generalize to previously unseen domains, even when domain identifiers are not available during training. This setting is challenging because the model may extract many distribution-specific (superficial) signals together with distribution-agnostic (semantic) signals. To overcome this challenge, we incorporate the gray-level co-occurrence matrix (GLCM) to extract patterns that our prior knowledge suggests are superficial: they are sensitive to the texture but unable to capture the gestalt of an image. Then we introduce two techniques for improving our networks' out-of-sample performance. The first method is built on the reverse gradient method that pushes our model to learn representations from which the GLCM representation is not predictable. The second method is built on the independence introduced by projecting the model's representation onto the subspace orthogonal to GLCM representation's. We test our method on the battery of standard domain generalization data sets and, interestingly, achieve comparable or better performance as compared to other domain generalization methods that explicitly require samples from the target distribution for training.
37.Active Transfer Learning for Persian Offline Signature Verification pdf
Offline Signature Verification (OSV) remains a challenging pattern recognition task, especially in the presence of skilled forgeries that are not available during the training. This challenge is aggravated when there are small labeled training data available but with large intra-personal variations. In this study, we address this issue by employing an active learning approach, which selects the most informative instances to label and therefore reduces the human labeling effort significantly. Our proposed OSV includes three steps: feature learning, active learning, and final verification. We benefit from transfer learning using a pre-trained CNN for feature learning. We also propose SVM-based active learning for each user to separate his genuine signatures from the random forgeries. We finally used the SVMs to verify the authenticity of the questioned signature. We examined our proposed active transfer learning method on UTSig: A Persian offline signature dataset. We achieved near 13% improvement compared to the random selection of instances. Our results also showed 1% improvement over the state-of-the-art method in which a fully supervised setting with five more labeled instances per user was used.
38.Demonstration of Vector Flow Imaging using Convolutional Neural Networks pdf
Synthetic Aperture Vector Flow Imaging (SA-VFI) can visualize complex cardiac and vascular blood flow patterns at high temporal resolution with a large field of view. Convolutional neural networks (CNNs) are commonly used in image and video recognition and classification. However, most recently presented CNNs also allow for making per-pixel predictions as needed in optical flow velocimetry. To our knowledge we demonstrate here for the first time a CNN architecture to produce 2D full flow field predictions from high frame rate SA ultrasound images using supervised learning. The CNN was initially trained using CFD-generated and augmented noiseless SA ultrasound data of a realistic vessel geometry. Subsequently, a mix of noisy simulated and real \textit{in vivo} acquisitions were added to increase the network's robustness. The resulting flow field of the CNN resembled the ground truth accurately with an endpoint-error percentage between 6.5% to 14.5%. Furthermore, when confronted with an unknown geometry of an arterial bifurcation, the CNN was able to predict an accurate flow field indicating its ability for generalization. Remarkably, the CNN also performed well for rotational flows, which usually requires advanced, computationally intensive VFI methods. We have demonstrated that convolutional neural networks can be used to estimate complex multidirectional flow.
39.Object tracking in video signals using Compressive Sensing pdf
Reducing the number of pixels in video signals while maintaining quality needed for recovering the trace of an object using Compressive Sensing is main subject of this work. Quality of frames, from video that contains moving object, are gradually reduced by keeping different number of pixels in each iteration, going from 45% all the way to 1%. Using algorithm for tracing object, results were satisfactory and showed mere changes in trajectory graphs, obtained from original and reconstructed videos.
40.Learning Representations from Persian Handwriting for Offline Signature Verification, a Deep Transfer Learning Approach pdf
Offline Signature Verification (OSV) is a challenging pattern recognition task, especially when it is expected to generalize well on the skilled forgeries that are not available during the training. Its challenges also include small training sample and large intra-class variations. Considering the limitations, we suggest a novel transfer learning approach from Persian handwriting domain to multi-language OSV domain. We train two Residual CNNs on the source domain separately based on two different tasks of word classification and writer identification. Since identifying a person signature resembles identifying ones handwriting, it seems perfectly convenient to use handwriting for the feature learning phase. The learned representation on the more varied and plentiful handwriting dataset can compensate for the lack of training data in the original task, i.e. OSV, without sacrificing the generalizability. Our proposed OSV system includes two steps: learning representation and verification of the input signature. For the first step, the signature images are fed into the trained Residual CNNs. The output representations are then used to train SVMs for the verification. We test our OSV system on three different signature datasets, including MCYT (a Spanish signature dataset), UTSig (a Persian one) and GPDS-Synthetic (an artificial dataset). On UT-SIG, we achieved 9.80% Equal Error Rate (EER) which showed substantial improvement over the best EER in the literature, 17.45%. Our proposed method surpassed state-of-the-arts by 6% on GPDS-Synthetic, achieving 6.81%. On MCYT, EER of 3.98% was obtained which is comparable to the best previously reported results.
41.SuperTML: Two-Dimensional Word Embedding and Transfer Learning Using ImageNet Pretrained CNN Models for the Classifications on Tabular Data pdf
Tabular data is the most commonly used form of data in industry. Gradient Boosting Trees, Support Vector Machine, Random Forest, and Logistic Regression are typically used for classification tasks on tabular data. DNN models using categorical embeddings are also applied in this task, but all attempts thus far have used one-dimensional embeddings. The recent work of Super Characters method using two-dimensional word embeddings achieved the state of art result in text classification tasks, showcasing the promise of this new approach. In this paper, we propose the SuperTML method, which borrows the idea of Super Characters method and two-dimensional embeddings to address the problem of classification on tabular data. For each input of tabular data, the features are first projected into two-dimensional embeddings like an image, and then this image is fed into fine-tuned two-dimensional CNN models for classification. Experimental results have shown that the proposed SuperTML method had achieved state-of-the-art results on both large and small datasets.
42.Application-level Studies of Cellular Neural Network-based Hardware Accelerators pdf
As cost and performance benefits associated with Moore's Law scaling slow, researchers are studying alternative architectures (e.g., based on analog and/or spiking circuits) and/or computational models (e.g., convolutional and recurrent neural networks) to perform application-level tasks faster, more energy efficiently, and/or more accurately. We investigate cellular neural network (CeNN)-based co-processors at the application-level for these metrics. While it is well-known that CeNNs can be well-suited for spatio-temporal information processing, few (if any) studies have quantified the energy/delay/accuracy of a CeNN-friendly algorithm and compared the CeNN-based approach to the best von Neumann algorithm at the application level. We present an evaluation framework for such studies. As a case study, a CeNN-friendly target-tracking algorithm was developed and mapped to an array architecture developed in conjunction with the algorithm. We compare the energy, delay, and accuracy of our architecture/algorithm (assuming all overheads) to the most accurate von Neumann algorithm (Struck). Von Neumann CPU data is measured on an Intel i5 chip. The CeNN approach is capable of matching the accuracy of Struck, and can offer approximately 1000x improvements in energy-delay product.
43.TinBiNN: Tiny Binarized Neural Network Overlay in about 5,000 4-LUTs and 5mW pdf
Reduced-precision arithmetic improves the size, cost, power and performance of neural networks in digital logic. In convolutional neural networks, the use of 1b weights can achieve state-of-the-art error rates while eliminating multiplication, reducing storage and improving power efficiency. The BinaryConnect binary-weighted system, for example, achieves 9.9% error using floating-point activations on the CIFAR-10 dataset. In this paper, we introduce TinBiNN, a lightweight vector processor overlay for accelerating inference computations with 1b weights and 8b activations. The overlay is very small -- it uses about 5,000 4-input LUTs and fits into a low cost iCE40 UltraPlus FPGA from Lattice Semiconductor. To show this can be useful, we build two embedded 'person detector' systems by shrinking the original BinaryConnect network. The first is a 10-category classifier with a 89% smaller network that runs in 1,315ms and achieves 13.6% error. The other is a 1-category classifier that is even smaller, runs in 195ms, and has only 0.4% error. In both classifiers, the error can be attributed entirely to training and not reduced precision.
44.Technically correct visualization of biological microscopic experiments pdf
The most realistic images that reflect native cellular and intracellular structure and behavior can be obtained only using brightfield microscopy. At high-intensity pulsing LED illumination, we captured the primary 12-bit-per-channel (bpc) signal from an observed sample using a brightfield microscope equipped with a high-resolution (4872x3248) camera. In order to suppress image distortions arising from light passing through the whole microscope optical path, from camera sensor defects and geometrical peculiarities of sensor sensitivity, these uncompressed 12-bpc images underwent a kind of correction after simultaneous calibration of all the parts of the experimental arrangement. Moreover, the final corrected images (from biological experiments) show the number of photons reaching each camera pixel and can be visualized in 8-bpc intensity depth after the Least Information Loss compression (Stys et al., Lect. Notes Bioinform. 9656, 2016).
45.MFAS: Multimodal Fusion Architecture Search pdf
We tackle the problem of finding good architectures for multimodal classification problems. We propose a novel and generic search space that spans a large number of possible fusion architectures. In order to find an optimal architecture for a given dataset in the proposed search space, we leverage an efficient sequential model-based exploration approach that is tailored for the problem. We demonstrate the value of posing multimodal fusion as a neural architecture search problem by extensive experimentation on a toy dataset and two other real multimodal datasets. We discover fusion architectures that exhibit state-of-the-art performance for problems with different domain and dataset size, including the NTU RGB+D dataset, the largest multi-modal action recognition dataset available.
46.Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim pdf
NVDLA is an open-source deep neural network (DNN) accelerator which has received a lot of attention by the community since its introduction by Nvidia. It is a full-featured hardware IP and can serve as a good reference for conducting research and development of SoCs with integrated accelerators. However, an expensive FPGA board is required to do experiments with this IP in a real SoC. Moreover, since NVDLA is clocked at a lower frequency on an FPGA, it would be hard to do accurate performance analysis with such a setup. To overcome these limitations, we integrate NVDLA into a real RISC-V SoC on the Amazon could FPGA using FireSim, a cycle-exact FPGA-accelerated simulator. We then evaluate the performance of NVDLA by running YOLOv3 object-detection algorithm. Our results show that NVDLA can sustain 7.5 fps when running YOLOv3. We further analyze the performance by showing that sharing the last-level cache with NVDLA can result in up to 1.56x speedup. We then identify that sharing the memory system with the accelerator can result in unpredictable execution time for the real-time tasks running on this platform. We believe this is an important issue that must be addressed in order for on-chip DNN accelerators to be incorporated in real-time embedded systems.
47.Stitching Videos from a Fisheye Lens Camera and a Wide-Angle Lens Camera for Telepresence Robots pdf
Many telepresence robots are equipped with a forward-facing camera for video communication and a downward-facing camera for navigation. In this paper, we propose to stitch videos from the FF-camera with a wide-angle lens and the DF-camera with a fisheye lens for telepresence robots. We aim at providing more compact and efficient visual feedback for the user interface of telepresence robots with user-friendly interactive experiences. To this end, we present a multi-homography-based video stitching method which stitches videos from a wide-angle camera and a fisheye camera. The method consists of video image alignment, seam cutting, and image blending. We directly align the wide-angle video image and the fisheye video image based on the multi-homography alignment without calibration, distortion correction, and unwarping procedures. Thus, we can obtain a stitched video with shape preservation in the non-overlapping regions and alignment in the overlapping area for telepresence. To alleviate ghosting effects caused by moving objects and/or moving cameras during telepresence robot driving, an optimal seam is found for aligned video composition, and the optimal seam will be updated in subsequent frames, considering spatial and temporal coherence. The final stitched video is created by image blending based on the optimal seam. We conducted a user study to demonstrate the effectiveness of our method and the superiority of telepresence robots with a stitched video as visual feedback.
48.Learning to Generate the "Unseen" via Part Synthesis and Composition pdf
Data-driven generative modeling has made remarkable progress by leveraging the power of deep neural networks. A reoccurring challenge is how to sample a rich variety of data from the entire target distribution, rather than only from the distribution of the training data. In other words, we would like the generative model to go beyond the observed training samples and learn to also generate "unseen" data. In our work, we present a generative neural network for shapes that is based on a part-based prior, where the key idea is for the network to synthesize shapes by varying both the shape parts and their compositions. Treating a shape not as an unstructured whole, but as a (re-)composable set of deformable parts, adds a combinatorial dimension to the generative process to enrich the diversity of the output, encouraging the generator to venture more into the "unseen". We show that our part-based model generates richer variety of feasible shapes compared with a baseline generative model. To this end, we introduce two quantitative metrics to evaluate the ingenuity of the generative model and assess how well generated data covers both the training data and unseen data from the same target distribution.