Skip to content
This repository has been archived by the owner on Apr 21, 2024. It is now read-only.

Latest commit

 

History

History
73 lines (73 loc) · 44.6 KB

20190523.md

File metadata and controls

73 lines (73 loc) · 44.6 KB

ArXiv cs.CV --Thu, 23 May 2019

1.Data-Efficient Image Recognition with Contrastive Predictive Coding pdf

Large scale deep learning excels when labeled images are abundant, yet data-efficient learning remains a longstanding challenge. While biological vision is thought to leverage vast amounts of unlabeled data to solve classification problems with limited supervision, computer vision has so far not succeeded in this `semi-supervised' regime. Our work tackles this challenge with Contrastive Predictive Coding, an unsupervised objective which extracts stable structure from still images. The result is a representation which, equipped with a simple linear classifier, separates ImageNet categories better than all competing methods, and surpasses the performance of a fully-supervised AlexNet model. When given a small number of labeled images (as few as 13 per class), this representation retains a strong classification performance, outperforming state-of-the-art semi-supervised methods by 10% Top-5 accuracy and supervised methods by 20%. Finally, we find our unsupervised representation to serve as a useful substrate for image detection on the PASCAL-VOC 2007 dataset, approaching the performance of representations trained with a fully annotated ImageNet dataset. We expect these results to open the door to pipelines that use scalable unsupervised representations as a drop-in replacement for supervised ones for real-world vision tasks where labels are scarce.

2.Bridging Stereo Matching and Optical Flow via Spatiotemporal Correspondence pdf

Stereo matching and flow estimation are two essential tasks for scene understanding, spatially in 3D and temporally in motion. Existing approaches have been focused on the unsupervised setting due to the limited resource to obtain the large-scale ground truth data. To construct a self-learnable objective, co-related tasks are often linked together to form a joint framework. However, the prior work usually utilizes independent networks for each task, thus not allowing to learn shared feature representations across models. In this paper, we propose a single and principled network to jointly learn spatiotemporal correspondence for stereo matching and flow estimation, with a newly designed geometric connection as the unsupervised signal for temporally adjacent stereo pairs. We show that our method performs favorably against several state-of-the-art baselines for both unsupervised depth and flow estimation on the KITTI benchmark dataset.

3.Dual Active Sampling on Batch-Incremental Active Learning pdf

Recently, Convolutional Neural Networks (CNNs) have shown unprecedented success in the field of computer vision, especially on challenging image classification tasks by relying on a universal approach, i.e., training a deep model on a massive dataset of supervised examples. While unlabeled data are often an abundant resource, collecting a large set of labeled data, on the other hand, are very expensive, which often require considerable human efforts. One way to ease out this is to effectively select and label highly informative instances from a pool of unlabeled data (i.e., active learning). This paper proposed a new method of batch-mode active learning, Dual Active Sampling(DAS), which is based on a simple assumption, if two deep neural networks (DNNs) of the same structure and trained on the same dataset give significantly different output for a given sample, then that particular sample should be picked for additional training. While other state of the art methods in this field usually require intensive computational power or relying on a complicated structure, DAS is simpler to implement and, managed to get improved results on Cifar-10 with preferable computational time compared to the core-set method.

4.Oculum afficit: Ocular Affect Recognition pdf

Recognizing human affect and emotions is a problem that has a wide range of applications within both academia and industry. Affect and emotion recognition within computer vision primarily relies on images of faces. With the prevalence of portable devices (e.g. smart phones and/or smart glasses),acquiring user facial images requires focus, time, and precision. While existing systems work great for full frontal faces, they tend to not work so well with partial faces like those of the operator of the device when under use. Due to this, we propose a methodology in which we can accurately infer the overall affect of a person by looking at the ocular region of an individual.

5.Separating Overlapping Tissue Layers from Microscopy Images pdf

Manual preparation of tissue slices for microscopy imaging can introduce tissue tears and overlaps. Typically, further digital processing algorithms such as registration and 3D reconstruction from tissue image stacks cannot handle images with tissue tear/overlap artifacts, and so such images are usually discarded. In this paper, we propose an imaging model and an algorithm to digitally separate overlapping tissue data of mouse brain images into two layers. We show the correctness of our model and the algorithm by comparing our results with the ground truth.

6.WPU-Net:Boundary learning by using weighted propagation in convolution network pdf

Deep learning has driven great progress in natural and biological image processing. However, in materials science and engineering, there are often some flaws and indistinctions in material microscopic images induced from complex sample preparation, even due to the material itself, hindering the detection of target objects. In this work, we propose WPU-net that redesign the architecture and weighted loss of U-Net to force the network to integrate information from adjacent slices and pay more attention to the topology in this boundary detection task. Then, the WPU-net was applied into a typical material example, i.e., the grain boundary detection of polycrystalline material. Experiments demonstrate that the proposed method achieves promising performance compared to state-of-the-art methods. Besides, we propose a new method for object tracking between adjacent slices, which can effectively reconstruct the 3D structure of the whole material while maintaining relative accuracy.

7.Segmentation-Aware Hyperspectral Image Classification pdf

In this paper, we propose an unified hyperspectral image classification method which takes three-dimensional hyperspectral data cube as an input and produces a classification map. In the proposed method, a deep neural network which uses spectral and spatial information together with residual connections, and pixel affinity network based segmentation-aware superpixels are used together. In the architecture, segmentation-aware superpixels run on the initial classification map of deep residual network, and apply majority voting on obtained results. Experimental results show that our propoped method yields state-of-the-art results in two benchmark datasets. Moreover, we also show that the segmentation-aware superpixels have great contribution to the success of hyperspectral image classification methods in cases where training data is insufficient.

8.Multi-View Large-Scale Bundle Adjustment Method for High-Resolution Satellite Images pdf

Given enough multi-view image corresponding points (also called tie points) and ground control points (GCP), bundle adjustment for high-resolution satellite images is used to refine the orientations or most often used geometric parameters Rational Polynomial Coefficients (RPC) of each satellite image in a unified geodetic framework, which is very critical in many photogrammetry and computer vision applications. However, the growing number of high resolution spaceborne optical sensors has brought two challenges to the bundle adjustment: 1) images come from different satellite cameras may have different imaging dates, viewing angles, resolutions, etc., thus resulting in geometric and radiometric distortions in the bundle adjustment; 2) The large-scale mapping area always corresponds to vast number of bundle adjustment corrections (including RPC bias and object space point coordinates). Due to the limitation of computer memory, it is hard to refine all corrections at the same time. Hence, how to efficiently realize the bundle adjustment in large-scale regions is very important. This paper particularly addresses the multi-view large-scale bundle adjustment problem by two steps: 1) to get robust tie points among different satellite images, we design a multi-view, multi-source tie point matching algorithm based on plane rectification and epipolar constraints, which is able to compensate geometric and local nonlinear radiometric distortions among satellite datasets, and 2) to solve dozens of thousands or even millions of variables bundle adjustment corrections in the large scale bundle adjustment, we use an efficient solution with only a little computer memory. Experiments on in-track and off-track satellite datasets show that the proposed method is capable of computing sub-pixel accuracy bundle adjustment results.

9.Using Orthophoto for Building Boundary Sharpening in the Digital Surface Model pdf

Nowadays dense stereo matching has become one of the dominant tools in 3D reconstruction of urban regions for its low cost and high flexibility in generating dense 3D points. However, state-of-the-art stereo matching algorithms usually apply a semi-global matching (SGM) strategy. This strategy normally assumes the surface geometry pieceswise planar, where a smooth penalty is imposed to deal with non-texture or repeating-texture areas. This on one hand, generates much smooth surface models, while on the other hand, may partially leads to smoothing on depth discontinuities, particularly for fence-shaped regions or densely built areas with narrow streets. To solve this problem, in this work, we propose to use the line segment information extracted from the corresponding orthophoto as a pose-processing tool to sharpen the building boundary of the Digital Surface Model (DSM) generated by SGM. Two methods which are based on graph-cut and plane fitting are proposed and compared. Experimental results on several satellite datasets with ground truth show the robustness and effectiveness of the proposed DSM sharpening method.

10.A Comparison of Stereo-Matching Cost between Convolutional Neural Network and Census for Satellite Images pdf

Stereo dense image matching can be categorized to low-level feature based matching and deep feature based matching according to their matching cost metrics. Census has been proofed to be one of the most efficient low-level feature based matching methods, while fast Convolutional Neural Network (fst-CNN), as a deep feature based method, has small computing time and is robust for satellite images. Thus, a comparison between fst-CNN and census is critical for further studies in stereo dense image matching. This paper used cost function of fst-CNN and census to do stereo matching, then utilized semi-global matching method to obtain optimized disparity images. Those images are used to produce digital surface model to compare with ground truth points. It addresses that fstCNN performs better than census in the aspect of absolute matching accuracy, histogram of error distribution and matching completeness, but these two algorithms still performs in the same order of magnitude.

11.A Comprehensive Study of ImageNet Pre-Training for Historical Document Image Analysis pdf

Automatic analysis of scanned historical documents comprises a wide range of image analysis tasks, which are often challenging for machine learning due to a lack of human-annotated learning samples. With the advent of deep neural networks, a promising way to cope with the lack of training data is to pre-train models on images from a different domain and then fine-tune them on historical documents. In the current research, a typical example of such cross-domain transfer learning is the use of neural networks that have been pre-trained on the ImageNet database for object recognition. It remains a mostly open question whether or not this pre-training helps to analyse historical documents, which have fundamentally different image properties when compared with ImageNet. In this paper, we present a comprehensive empirical survey on the effect of ImageNet pre-training for diverse historical document analysis tasks, including character recognition, style classification, manuscript dating, semantic segmentation, and content-based retrieval. While we obtain mixed results for semantic segmentation at pixel-level, we observe a clear trend across different network architectures that ImageNet pre-training has a positive effect on classification as well as content-based retrieval.

12.Automated Segmentation for Hyperdense Middle Cerebral Artery Sign of Acute Ischemic Stroke on Non-Contrast CT Images pdf

The hyperdense middle cerebral artery (MCA) dot sign has been reported as an important factor in the diagnosis of acute ischemic stroke due to large vessel occlusion. Interpreting the initial CT brain scan in these patients requires high level of expertise, and has high inter-observer variability. An automated computerized interpretation of the urgent CT brain image, with an emphasis to pick up early signs of ischemic stroke will facilitate early patient diagnosis, triage, and shorten the door-to-revascularization time for these group of patients. In this paper, we present an automated detection method of segmenting the MCA dot sign on non-contrast CT brain image scans based on powerful deep learning technique.

13.End-to-End Learned Random Walker for Seeded Image Segmentation pdf

We present an end-to-end learned algorithm for seeded segmentation. Our method is based on the Random Walker algorithm, where we predict the edge weights of the underlying graph using a convolutional neural network. This can be interpreted as learning context-dependent diffusivities for a linear diffusion process. Besides calculating the exact gradient for optimizing these diffusivities, we also propose simplifications that sparsely sample the gradient and still yield competitive results. The proposed method achieves the currently best results on a seeded version of the CREMI neuron segmentation challenge.

14.Robust Motion Segmentation from Pairwise Matches pdf

In this paper we address a classification problem that has not been considered before, namely motion segmentation given pairwise matches only. Our contribution to this unexplored task is a novel formulation of motion segmentation as a two-step process. First, motion segmentation is performed on image pairs independently. Secondly, we combine independent pairwise segmentation results in a robust way into the final globally consistent segmentation. Our approach is inspired by the success of averaging methods. We demonstrate in simulated as well as in real experiments that our method is very effective in reducing the errors in the pairwise motion segmentation and can cope with large number of mismatches.

15.What Would You Expect? Anticipating Egocentric Actions with Rolling-Unrolling LSTMs and Modality Attention pdf

Egocentric action anticipation consists in understanding which objects the camera wearer will interact with in the near future and which actions they will perform. We tackle the problem proposing an architecture able to anticipate actions at multiple temporal scales using two LSTMs to 1) summarize the past, and 2) formulate predictions about the future. The input video is processed considering three complimentary modalities: appearance (RGB), motion (optical flow) and objects (object-based features). Modality-specific predictions are fused using a novel Modality ATTention (MATT) mechanism which learns to weigh modalities in an adaptive fashion. Extensive evaluations on two large-scale benchmark datasets show that our method outperforms prior art by up to +7% on the challenging EPIC-KITCHENS dataset including more than 2500 actions, and generalizes to EGTEA Gaze+. Our approach is also shown to generalize to the tasks of early action recognition and action recognition. At the moment of submission, our method is ranked first in the leaderboard of the EPIC-KITCHENS egocentric action anticipation challenge.

16.Spatial Sampling Network for Fast Scene Understanding pdf

We propose a network architecture to perform efficient scene understanding. This work presents three main novelties: the first is an Improved Guided Upsampling Module that can replace in toto the decoder part in common semantic segmentation networks. Our second contribution is the introduction of a new module based on spatial sampling to perform Instance Segmentation. It provides a very fast instance segmentation, needing only thresholding as post-processing step at inference time. Finally, we propose a novel efficient network design that includes the new modules and test it against different datasets for outdoor scene understanding. To our knowledge, our network is one of the themost efficient architectures for scene understanding published to date, furthermore being 8.6% more accurate than the fastest competitor on semantic segmentation and almost five times faster than the most efficient network for instance segmentation.

17.PEPSI++: Fast and Lightweight Network for Image Inpainting pdf

Generative adversarial network (GAN)-based image inpainting methods which utilize coarse-to-fine network with a contextual attention module (CAM) have shown remarkable performance. However, they require numerous computational resources such as convolution operations and network parameters due to two stacked generative networks, which results in a low speed. To address this problem, we propose a novel network structure called PEPSI: parallel extended-decoder path for semantic inpainting network, which aims at not only reducing hardware costs but also improving the inpainting performance. The PEPSI consists of a single shared encoding network and parallel decoding networks with coarse and inpainting paths. The coarse path generates a preliminary inpainting result to train the encoding network for prediction of features for the CAM. At the same time, the inpainting path results in higher inpainting quality with refined features reconstructed using the CAM. In addition, we propose a Diet-PEPSI which significantly reduces the network parameters while maintaining the performance. In the proposed method, we present a Diet-PEPSI unit (DPU) which effectively aggregates the global contextual information with a small number of parameters. Extensive experiments and comparisons with state-of-the-art image inpainting methods demonstrate that both PEPSI and Diet-PEPSI achieve significant improvements in qualitative scores and reduced computation cost.

18.Underwater Color Restoration Using U-Net Denoising Autoencoder pdf

Visual inspection of underwater structures by vehicles, e.g. remotely operated vehicles (ROVs), plays an important role in scientific, military, and commercial sectors. However, the automatic extraction of information using software tools is hindered by the characteristics of water which degrade the quality of captured videos. As a contribution for restoring the color of underwater images, Underwater Denoising Autoencoder (UDAE) model is developed using a denoising autoencoder with U-Net architecture. The proposed network takes into consideration the accuracy and the computation cost to enable real-time implementation on underwater visual tasks using end-to-end autoencoder network. Underwater vehicles perception is improved by reconstructing captured frames; hence obtaining better performance in underwater tasks. Related learning methods use generative adversarial networks (GANs) to generate color corrected underwater images, and to our knowledge this paper is the first to deal with a single autoencoder capable of producing same or better results. Moreover, image pairs are constructed for training the proposed network, where it is hard to obtain such dataset from underwater scenery. At the end, the proposed model is compared to a state-of-the-art method.

19.Attributes Guided Feature Learning for Vehicle Re-identification pdf

Vehicle Re-ID has recently attracted enthusiastic attention due to its potential applications in smart city and urban surveillance. However, it suffers from large intra-class variation caused by view variations and illumination changes, and inter-class similarity especially for different identities with the similar appearance. To handle these issues, in this paper, we propose a novel deep network architecture, which guided by meaningful attributes including camera views, vehicle types and colors for vehicle Re-ID. In particular, our network is end-to-end trained and contains three subnetworks of deep features embedded by the corresponding attributes (i.e., camera view, vehicle type and vehicle color). Moreover, to overcome the shortcomings of limited vehicle images of different views, we design a view-specified generative adversarial network to generate the multi-view vehicle images. For network training, we annotate the view labels on the VeRi-776 dataset. Note that one can directly adopt the pre-trained view (as well as type and color) subnetwork on the other datasets with only ID information, which demonstrates the generalization of our model. Extensive experiments on the benchmark datasets VeRi-776 and VehicleID suggest that the proposed approach achieves the promising performance and yields to a new state-of-the-art for vehicle Re-ID.

20.LapTool-Net: A Contextual Detector of Surgical Tools in Laparoscopic Videos Based on Recurrent Convolutional Neural Networks pdf

We propose a new multilabel classifier, called LapTool-Net to detect the presence of surgical tools in each frame of a laparoscopic video. The novelty of LapTool-Net is the exploitation of the correlation among the usage of different tools and, the tools and tasks - namely, the context of the tools' usage. Towards this goal, the pattern in the co-occurrence of the tools is utilized for designing a decision policy for a multilabel classifier based on a Recurrent Convolutional Neural Network (RCNN) architecture to simultaneously extract the spatio-temporal features. In contrast to the previous multilabel classification methods, the RCNN and the decision model are trained in an end-to-end manner using a multitask learning scheme. To overcome the high imbalance and avoid overfitting caused by the lack of variety in the training data, a high down-sampling rate is chosen based on the more frequent combinations. Furthermore, at the post-processing step, the prediction for all the frames of a video are corrected by designing a bi-directional RNN to model the long-term task's order. LapTool-net was trained using a publicly available dataset of laparoscopic cholecystectomy. The results show LapTool-Net outperforms existing methods significantly, even while using fewer training samples and a shallower architecture.

21.Segmentation-Aware Image Denoising without Knowing True Segmentation pdf

Several recent works discussed application-driven image restoration neural networks, which are capable of not only removing noise in images but also preserving their semantic-aware details, making them suitable for various high-level computer vision tasks as the pre-processing step. However, such approaches require extra annotations for their high-level vision tasks, in order to train the joint pipeline using hybrid losses. The availability of those annotations is yet often limited to a few image sets, potentially restricting the general applicability of these methods to denoising more unseen and unannotated images. Motivated by that, we propose a segmentation-aware image denoising model dubbed U-SAID, based on a novel unsupervised approach with a pixel-wise uncertainty loss. U-SAID does not need any ground-truth segmentation map, and thus can be applied to any image dataset. It generates denoised images with comparable or even better quality, and the denoised results show stronger robustness for subsequent semantic segmentation tasks, when compared to either its supervised counterpart or classical "application-agnostic" denoisers. Moreover, we demonstrate the superior generalizability of U-SAID in three-folds, by plugging its "universal" denoiser without fine-tuning: (1) denoising unseen types of images; (2) denoising as pre-processing for segmenting unseen noisy images; and (3) denoising for unseen high-level tasks. Extensive experiments demonstrate the effectiveness, robustness and generalizability of the proposed U-SAID over various popular image sets.

22.Domain Adaptation for Vehicle Detection from Bird's Eye View LiDAR Point Cloud Data pdf

Point cloud data from 3D LiDAR sensors are one of the most crucial sensor modalities for versatile safety-critical applications such as self-driving vehicles. Since the annotations of point cloud data is an expensive and time-consuming process, therefore recently the utilisation of simulated environments and 3D LiDAR sensors for this task started to get some popularity. With simulated sensors and environments, the process for obtaining an annotated synthetic point cloud data became much easier. However, the generated synthetic point cloud data are still missing the artefacts usually exist in point cloud data from real 3D LiDAR sensors. As a result, the performance of the trained models on this data for perception tasks when tested on real point cloud data is degraded due to the domain shift between simulated and real environments. Thus, in this work, we are proposing a domain adaptation framework for bridging this gap between synthetic and real point cloud data. Our proposed framework is based on the deep cycle-consistent generative adversarial networks (CycleGAN) architecture. We have evaluated the performance of our proposed framework on the task of vehicle detection from a bird's eye view (BEV) point cloud images coming from real 3D LiDAR sensors. The framework has shown competitive results with an improvement of more than 7% in average precision score over other baseline approaches when tested on real BEV point cloud images.

23.Learning Fully Dense Neural Networks for Image Semantic Segmentation pdf

Semantic segmentation is pixel-wise classification which retains critical spatial information. The "feature map reuse" has been commonly adopted in CNN based approaches to take advantage of feature maps in the early layers for the later spatial reconstruction. Along this direction, we go a step further by proposing a fully dense neural network with an encoder-decoder structure that we abbreviate as FDNet. For each stage in the decoder module, feature maps of all the previous blocks are adaptively aggregated to feed-forward as input. On the one hand, it reconstructs the spatial boundaries accurately. On the other hand, it learns more efficiently with the more efficient gradient backpropagation. In addition, we propose the boundary-aware loss function to focus more attention on the pixels near the boundary, which boosts the "hard examples" labeling. We have demonstrated the best performance of the FDNet on the two benchmark datasets: PASCAL VOC 2012, NYUDv2 over previous works when not considering training on other datasets.

24.A Neural-Symbolic Architecture for Inverse Graphics Improved by Lifelong Meta-Learning pdf

We follow the idea of formulating vision as inverse graphics and propose a new type of element for this task, a neural-symbolic capsule. It is capable of de-rendering a scene into semantic information feed-forward, as well as rendering it feed-backward. An initial set of capsules for graphical primitives is obtained from a generative grammar and connected into a full capsule network. Lifelong meta-learning continuously improves this network's detection capabilities by adding capsules for new and more complex objects it detects in a scene using few-shot learning. Preliminary results demonstrate the potential of our novel approach.

25.Looking to Relations for Future Trajectory Forecast pdf

Inferring relational behavior between road users as well as road users and their surrounding physical space is an important step toward effective modeling and prediction of navigation strategies adopted by participants in road scenes. To this end, we propose a relation-aware framework for future trajectory forecast. Our system aims to infer relational information from the interactions of road users with each other and with the environment. The first module involves visual encoding of spatio-temporal features, which captures human-human and human-space interactions over time. The following module explicitly constructs pair-wise relations from spatio-temporal interactions and identifies more descriptive relations that highly influence future motion of the target road user by considering its past trajectory. The resulting relational features are used to forecast future locations of the target, in the form of heatmaps with an additional guidance of spatial dependencies and consideration of the uncertainty. Extensive evaluations on a public benchmark dataset demonstrate the robustness and efficacy of the proposed framework as observed by performances higher than the state-of-the-art methods.

26.Efficient Plane-Based Optimization of Geometry and Texture for Indoor RGB-D Reconstruction pdf

We propose a novel approach to reconstruct RGB-D indoor scene based on plane primitives. Our approach takes as input a RGB-D sequence and a dense coarse mesh reconstructed from it, and generates a lightweight, low-polygonal mesh with clear face textures and sharp features without losing geometry details from the original scene. Compared to existing methods which only cover large planar regions in the scene, our method builds the entire scene by adaptive planes without losing geometry details and also preserves sharp features in the mesh. Experiments show that our method is more efficient to generate textured mesh from RGB-D data than state-of-the-arts.

27.Semi-Supervised Learning with Scarce Annotations pdf

While semi-supervised learning (SSL) algorithms provide an efficient way to make use of both labelled and unlabelled data, they generally struggle when the number of annotated samples is very small. In this work, we consider the problem of SSL multi-class classification with very few labelled instances. We introduce two key ideas. The first is a simple but effective one: we leverage the power of transfer learning among different tasks and self-supervision to initialize a good representation of the data without making use of any label. The second idea is a new algorithm for SSL that can exploit well such a pre-trained representation.
The algorithm works by alternating two phases, one fitting the labelled points and one fitting the unlabelled ones, with carefully-controlled information flow between them. The benefits are greatly reducing overfitting of the labelled data and avoiding issue with balancing labelled and unlabelled losses during training. We show empirically that this method can successfully train competitive models with as few as 10 labelled data points per class. More in general, we show that the idea of bootstrapping features using self-supervised learning always improves SSL on standard benchmarks. We show that our algorithm works increasingly well compared to other methods when refining from other tasks or datasets.

28.Joint Object and State Recognition using Language Knowledge pdf

The state of an object is an important piece of knowledge in robotics applications. States and objects are intertwined together, meaning that object information can help recognize the state of an image and vice versa. This paper addresses the state identification problem in cooking related images and uses state and object predictions together to improve the classification accuracy of objects and their states from a single image. The pipeline presented in this paper includes a CNN with a double classification layer and the Concept-Net language knowledge graph on top. The language knowledge creates a semantic likelihood between objects and states. The resulting object and state confidences from the deep architecture are used together with object and state relatedness estimates from a language knowledge graph to produce marginal probabilities for objects and states. The marginal probabilities and confidences of objects (or states) are fused together to improve the final object (or state) classification results. Experiments on a dataset of cooking objects show that using a language knowledge graph on top of a deep neural network effectively enhances object and state classification.

29.Borrow from Anywhere: Pseudo Multi-modal Object Detection in Thermal Imagery pdf

Can we improve detection in the thermal domain by borrowing features from rich domains like visual RGB? In this paper, we propose a pseudo-multimodal object detector trained on natural image domain data to help improve the performance of object detection in thermal images. We assume access to a large-scale dataset in the visual RGB domain and relatively smaller dataset (in terms of instances) in the thermal domain, as is common today. We propose the use of well-known image-to-image translation frameworks to generate pseudo-RGB equivalents of a given thermal image and then use a multi-modal architecture for object detection in the thermal image. We show that our framework outperforms existing benchmarks without the explicit need for paired training examples from the two domains. We also show that our framework has the ability to learn with less data from thermal domain when using our approach.

30.Fine-grained Optimization of Deep Neural Networks pdf

In recent studies, several asymptotic upper bounds on generalization errors on deep neural networks (DNNs) are theoretically derived. These bounds are functions of several norms of weights of the DNNs, such as the Frobenius and spectral norms, and they are computed for weights grouped according to either input and output channels of the DNNs. In this work, we conjecture that if we can impose multiple constraints on weights of DNNs to upper bound the norms of the weights, and train the DNNs with these weights, then we can attain empirical generalization errors closer to the derived theoretical bounds, and improve accuracy of the DNNs.
To this end, we pose two problems. First, we aim to obtain weights whose different norms are all upper bounded by a constant number, e.g. 1.0. To achieve these bounds, we propose a two-stage renormalization procedure; (i) normalization of weights according to different norms used in the bounds, and (ii) reparameterization of the normalized weights to set a constant and finite upper bound of their norms. In the second problem, we consider training DNNs with these renormalized weights. To this end, we first propose a strategy to construct joint spaces (manifolds) of weights according to different constraints in DNNs. Next, we propose a fine-grained SGD algorithm (FG-SGD) for optimization on the weight manifolds to train DNNs with assurance of convergence to minima. Experimental results show that image classification accuracy of baseline DNNs can be boosted using FG-SGD on collections of manifolds identified by multiple constraints.

31.Beyond Alternating Updates for Matrix Factorization with Inertial Bregman Proximal Gradient Algorithms pdf

Matrix Factorization is a popular non-convex objective, for which alternating minimization schemes are mostly used. They usually suffer from the major drawback that the solution is biased towards one of the optimization variables. A remedy is non-alternating schemes. However, due to a lack of Lipschitz continuity of the gradient in matrix factorization problems, convergence cannot be guaranteed. A recently developed remedy relies on the concept of Bregman distances, which generalizes the standard Euclidean distance. We exploit this theory by proposing a novel Bregman distance for matrix factorization problems, which, at the same time, allows for simple/closed form update steps. Therefore, for non-alternating schemes, such as the recently introduced Bregman Proximal Gradient (BPG) method and an inertial variant Convex--Concave Inertial BPG (CoCaIn BPG), convergence of the whole sequence to a stationary point is proved for Matrix Factorization. In several experiments, we observe a superior performance of our non-alternating schemes in terms of speed and objective value at the limit point.

32.Joint Information Preservation for Heterogeneous Domain Adaptation pdf

Domain adaptation aims to assist the modeling tasks of the target domain with knowledge of the source domain. The two domains often lie in different feature spaces due to diverse data collection methods, which leads to the more challenging task of heterogeneous domain adaptation (HDA). A core issue of HDA is how to preserve the information of the original data during adaptation. In this paper, we propose a joint information preservation method to deal with the problem. The method preserves the information of the original data from two aspects. On the one hand, although paired samples often exist between the two domains of the HDA, current algorithms do not utilize such information sufficiently. The proposed method preserves the paired information by maximizing the correlation of the paired samples in the shared subspace. On the other hand, the proposed method improves the strategy of preserving the structural information of the original data, where the local and global structural information are preserved simultaneously. Finally, the joint information preservation is integrated by distribution matching. Experimental results show the superiority of the proposed method over the state-of-the-art HDA algorithms.

33.Automated Pupillary Light Reflex Test on a Portable Platform pdf

In this paper, we introduce a portable eye imaging device denoted as lab-on-a-headset, which can automatically perform a swinging flashlight test. We utilized this device in a clinical study to obtain high-resolution recordings of eyes while they are exposed to a varying light stimuli. Half of the participants had relative afferent pupillary defect (RAPD) while the other half was a control group. In case of positive RAPD, patients pupils constrict less or do not constrict when light stimuli swings from the unaffected eye to the affected eye. To automatically diagnose RAPD, we propose an algorithm based on pupil localization, pupil size measurement, and pupil size comparison of right and left eye during the light reflex test. We validate the algorithmic performance over a dataset obtained from 22 subjects and show that proposed algorithm can achieve a sensitivity of 93.8% and a specificity of 87.5%.

34.DoPa: A Fast and Comprehensive CNN Defense Methodology against Physical Adversarial Attacks pdf

Recently, Convolutional Neural Networks (CNNs) demonstrate a considerable vulnerability to adversarial attacks, which can be easily misled by adversarial perturbations. With more aggressive methods proposed, adversarial attacks can be also applied to the physical world, causing practical issues to various CNN powered applications. Most existing defense works for physical adversarial attacks only focus on eliminating explicit perturbation patterns from inputs, ignoring interpretation and solution to CNN's intrinsic vulnerability. Therefore, most of them depend on considerable data processing costs and lack the expected versatility to different attacks. In this paper, we propose DoPa - a fast and comprehensive CNN defense methodology against physical adversarial attacks. By interpreting the CNN's vulnerability, we find that non-semantic adversarial perturbations can activate CNN with significantly abnormal activations and even overwhelm other semantic input patterns' activations. We improve the CNN recognition process by adding a self-verification stage to analyze the semantics of distinguished activation patterns with only one CNN inference involved. Based on the detection result, we further propose a data recovery methodology to defend the physical adversarial attacks. We apply such detection and defense methodology into both image and audio CNN recognition process. Experiments show that our methodology can achieve an average rate of 90% success for attack detection and 81% accuracy recovery for image physical adversarial attacks. Also, the proposed defense method can achieve a 92% detection successful rate and 77.5% accuracy recovery for audio recognition applications. Moreover, the proposed defense methods are at most 2.3x faster compared to the state-of-the-art defense methods, making them feasible to resource-constrained platforms, such as mobile devices.

35.Large-scale Distance Metric Learning with Uncertainty pdf

Distance metric learning (DML) has been studied extensively in the past decades for its superior performance with distance-based algorithms. Most of the existing methods propose to learn a distance metric with pairwise or triplet constraints. However, the number of constraints is quadratic or even cubic in the number of the original examples, which makes it challenging for DML to handle the large-scale data set. Besides, the real-world data may contain various uncertainty, especially for the image data. The uncertainty can mislead the learning procedure and cause the performance degradation. By investigating the image data, we find that the original data can be observed from a small set of clean latent examples with different distortions. In this work, we propose the margin preserving metric learning framework to learn the distance metric and latent examples simultaneously. By leveraging the ideal properties of latent examples, the training efficiency can be improved significantly while the learned metric also becomes robust to the uncertainty in the original data. Furthermore, we can show that the metric is learned from latent examples only, but it can preserve the large margin property even for the original data. The empirical study on the benchmark image data sets demonstrates the efficacy and efficiency of the proposed method.

36.Robust Optimization over Multiple Domains pdf

In this work, we study the problem of learning a single model for multiple domains. Unlike the conventional machine learning scenario where each domain can have the corresponding model, multiple domains (i.e., applications/users) may share the same machine learning model due to maintenance loads in cloud computing services. For example, a digit-recognition model should be applicable to hand-written digits, house numbers, car plates, etc. Therefore, an ideal model for cloud computing has to perform well at each applicable domain. To address this new challenge from cloud computing, we develop a framework of robust optimization over multiple domains. In lieu of minimizing the empirical risk, we aim to learn a model optimized to the adversarial distribution over multiple domains. Hence, we propose to learn the model and the adversarial distribution simultaneously with the stochastic algorithm for efficiency. Theoretically, we analyze the convergence rate for convex and non-convex models. To our best knowledge, we first study the convergence rate of learning a robust non-convex model with a practical algorithm. Furthermore, we demonstrate that the robustness of the framework and the convergence rate can be further enhanced by appropriate regularizers over the adversarial distribution. The empirical study on real-world fine-grained visual categorization and digits recognition tasks verifies the effectiveness and efficiency of the proposed framework.