1.VisualBERT: A Simple and Performant Baseline for Vision and Language ⬇️
We propose VisualBERT, a simple and flexible framework for modeling a broad range of vision-and-language tasks. VisualBERT consists of a stack of Transformer layers that implicitly align elements of an input text and regions in an associated input image with self-attention. We further propose two visually-grounded language model objectives for pre-training VisualBERT on image caption data. Experiments on four vision-and-language tasks including VQA, VCR, NLVR2, and Flickr30K show that VisualBERT outperforms or rivals with state-of-the-art models while being significantly simpler. Further analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.
2.Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings ⬇️
We address the problem of cross-modal fine-grained action retrieval between text and video. Cross-modal retrieval is commonly achieved through learning a shared embedding space, that can indifferently embed modalities. In this paper, we propose to enrich the embedding by disentangling parts-of-speech (PoS) in the accompanying captions. We build a separate multi-modal embedding space for each PoS tag. The outputs of multiple PoS embeddings are then used as input to an integrated multi-modal space, where we perform action retrieval. All embeddings are trained jointly through a combination of PoS-aware and PoS-agnostic losses. Our proposal enables learning specialised embedding spaces that offer multiple views of the same embedded entities.
We report the first retrieval results on fine-grained actions for the large-scale EPIC dataset, in a generalised zero-shot setting. Results show the advantage of our approach for both video-to-text and text-to-video action retrieval. We also demonstrate the benefit of disentangling the PoS for the generic task of cross-modal video retrieval on the MSR-VTT dataset.
3.Zero-shot Feature Selection via Exploiting Semantic Knowledge ⬇️
Feature selection plays an important role in pattern recognition and machine learning systems. Supervised knowledge can significantly improve the performance. However, confronted with the rapid growth of newly-emerging concepts, existing supervised methods may easily suffer from the scarcity of labeled data for training. Therefore, this paper studies the problem of Zero-Shot Feature Selection, i.e., building a feature selection model that generalizes well to "unseen" concepts with limited training data of "seen" concepts. To address this, inspired by zero-shot learning, we use class-semantic descriptions (i.e., attributes) which provide additional semantic information about unseen concepts as supervision. In addition, to seek for more reliable discriminative features, we further propose a novel loss function (named center-characteristic loss) which encourages the selected features to capture the central characteristics of seen concepts. Experimental results on three benchmarks demonstrate the superiority of the proposed method.
4.Relation-Aware Pyramid Network (RapNet) for temporal action proposal ⬇️
In this technical report, we describe our solution to temporal action proposal (task 1) in ActivityNet Challenge 2019. First, we fine-tune a ResNet-50-C3D CNN on ActivityNet v1.3 based on Kinetics pretrained model to extract snippet-level video representations and then we design a Relation-Aware Pyramid Network (RapNet) to generate temporal multiscale proposals with confidence score. After that, we employ a two-stage snippet-level boundary adjustment scheme to re-rank the order of generated proposals. Ensemble methods are also been used to improve the performance of our solution, which helps us achieve 2nd place.
5.A Fast and Precise Method for Large-Scale Land-Use Mapping Based on Deep Learning ⬇️
The land-use map is an important data that can reflect the use and transformation of human land, and can provide valuable reference for land-use planning. For the traditional image classification method, producing a high spatial resolution (HSR), land-use map in large-scale is a big project that requires a lot of human labor, time, and financial expenditure. The rise of the deep learning technique provides a new solution to the problems above. This paper proposes a fast and precise method that can achieve large-scale land-use classification based on deep convolutional neural network (DCNN). In this paper, we optimize the data tiling method and the structure of DCNN for the multi-channel data and the splicing edge effect, which are unique to remote sensing deep learning, and improve the accuracy of land-use classification. We apply our improved methods in the Guangdong Province of China using GF-1 images, and achieve the land-use classification accuracy of 81.52%. It takes only 13 hours to complete the work, which will take several months for human labor.
6.Transferable Representation Learning in Vision-and-Language Navigation ⬇️
Vision-and-Language Navigation (VLN) tasks such as Room-to-Room (R2R) require machine agents to interpret natural language instructions and learn to act in visually realistic environments to achieve navigation goals. The overall task requires competence in several perception problems: successful agents combine spatio-temporal, vision and language understanding to produce appropriate action sequences. Our approach adapts pre-trained vision and language representations to relevant in-domain tasks making them more effective for VLN. Specifically, the representations are adapted to solve both a cross-modal sequence alignment and sequence coherence task. In the sequence alignment task, the model determines whether an instruction corresponds to a sequence of visual frames. In the sequence coherence task, the model determines whether the perceptual sequences are predictive sequentially in the instruction-conditioned latent space. By transferring the domain-adapted representations, we improve competitive agents in R2R as measured by the success rate weighted by path length (SPL) metric.
7.Distinguishing Individual Red Pandas from Their Faces ⬇️
Individual identification is essential to animal behavior and ecology research and is of significant importance for protecting endangered species. Red pandas, among the world's rarest animals, are currently identified mainly by visual inspection and microelectronic chips, which are costly and inefficient. Motivated by recent advancement in computer-vision-based animal identification, in this paper, we propose an automatic framework for identifying individual red pandas based on their face images. We implement the framework by exploring well-established deep learning models with necessary adaptation for effectively dealing with red panda images. Based on a database of red panda images constructed by ourselves, we evaluate the effectiveness of the proposed automatic individual red panda identification method. The evaluation results show the promising potential of automatically recognizing individual red pandas from their faces. We are going to release our database and model in the public domain to promote the research on automatic animal identification and particularly on the technique for protecting red pandas.
8.Video Face Clustering with Unknown Number of Clusters ⬇️
Understanding videos such as TV series and movies requires analyzing who the characters are and what they are doing. We address the challenging problem of clustering face tracks based on their identity. Different from previous work in this area, we choose to operate in a realistic and difficult setting where: (i) the number of characters is not known a priori; and (ii) face tracks belonging to minor or background characters are not discarded.
To this end, we propose Ball Cluster Learning (BCL), a supervised approach to carve the embedding space into balls of equal size, one for each cluster. The learned ball radius is easily translated to a stopping criterion for iterative merging algorithms. This gives BCL the ability to estimate the number of clusters as well as their assignment, achieving promising results on commonly used datasets. We also present a thorough discussion of how existing metric learning literature can be adapted for this task.
9.Recognizing Part Attributes with Insufficient Data ⬇️
Recognizing attributes of objects and their parts is important to many computer vision applications. Although great progress has been made to apply object-level recognition, recognizing the attributes of parts remains less applicable since the training data for part attributes recognition is usually scarce especially for internet-scale applications. Furthermore, most existing part attribute recognition methods rely on the part annotation which is more expensive to obtain. To solve the data insufficiency problem and get rid of dependence on the part annotation, we introduce a novel Concept Sharing Network (CSN) for part attribute recognition. A great advantage of CSN is its capability of recognizing the part attribute (a combination of part location and appearance pattern) that has insufficient or zero training data, by learning the part location and appearance pattern respectively from the training data that usually mix them in a single label. Extensive experiments on CUB-200-2011 [51], CelebA [35] and a newly proposed human attribute dataset demonstrate the effectiveness of CSN and its advantages over other methods, especially for the attributes with few training samples. Further experiments show that CSN can also perform zero-shot part attribute recognition. The code will be made available at this https URL.
10.Convex hull algorithms based on some variational models ⬇️
Seeking the convex hull of an object is a very fundamental problem arising from various tasks. In this work, we propose two variational convex hull models using level set representation for 2-dimensional data. The first one is an exact model, which can get the convex hull of one or multiple objects. In this model, the convex hull is characterized by the zero sublevel-set of a convex level set function, which is non-positive at every given point. By minimizing the area of the zero sublevel-set, we can find the desired convex hull. The second one is intended to get convex hull of objects with outliers. Instead of requiring all the given points are included, this model penalizes the distance from each given point to the zero sublevel-set. Literature methods are not able to handle outliers. For the solution of these models, we develop efficient numerical schemes using alternating direction method of multipliers. Numerical examples are given to demonstrate the advantages of the proposed methods.
11.Deep Density-aware Count Regressor ⬇️
We seek to improve crowd counting as we perceive limits of currently prevalent density map estimation approach on both prediction accuracy and time efficiency. We show that a CNN regressing a global count trained with density map supervision can make more accurate prediction. We introduce multilayer gradient fusion for training a densityaware global count regressor. More specifically, on training stage, a backbone network receives gradients from multiple branches to learn the density information, whereas those branches are to be detached to accelerate inference. By taking advantages of such method, our model improves benchmark results on public datasets and exhibits itself to be a new solution to crowd counting problem in practice.
12.PosNeg-Balanced Anchors with Aligned Features for Single-Shot Object Detection ⬇️
We introduce a novel single-shot object detector to ease the imbalance of foreground-background class by suppressing the easy negatives while increasing the positives. To achieve this, we propose an Anchor Promotion Module (APM) which predicts the probability of each anchor as positive and adjusts their initial locations and shapes to promote both the quality and quantity of positive anchors. In addition, we design an efficient Feature Alignment Module (FAM) to extract aligned features for fitting the promoted anchors with the help of both the location and shape transformation information from the APM. We assemble the two proposed modules to the backbone of VGG-16 and ResNet-101 network with an encoder-decoder architecture. Extensive experiments on MS COCO well demonstrate our model performs competitively with alternative methods (40.0% mAP on \textit{test-dev} set) and runs faster (28.6 \textit{fps}).
13.Question-Agnostic Attention for Visual Question Answering ⬇️
Visual Question Answering (VQA) models employ attention mechanisms to discover image locations that are most relevant for answering a specific question. For this purpose, several multimodal fusion strategies have been proposed, ranging from relatively simple operations (e.g., linear sum) to more complex ones (e.g., Block). The resulting multimodal representations define an intermediate feature space for capturing the interplay between visual and semantic features, that is helpful in selectively focusing on image content. In this paper, we propose a question-agnostic attention mechanism that is complementary to the existing question-dependent attention mechanisms. Our proposed model parses object instances to obtain an `object map' and applies this map on the visual features to generate Question-Agnostic Attention (QAA) features. In contrast to question-dependent attention approaches that are learned end-to-end, the proposed QAA does not involve question-specific training, and can be easily included in almost any existing VQA model as a generic light-weight pre-processing step, thereby adding minimal computation overhead for training. Further, when used in complement with the question-dependent attention, the QAA allows the model to focus on the regions containing objects that might have been overlooked by the learned attention representation. Through extensive evaluation on VQAv1, VQAv2 and TDIUC datasets, we show that incorporating complementary QAA allows state-of-the-art VQA models to perform better, and provides significant boost to simplistic VQA models, enabling them to performance on par with highly sophisticated fusion strategies.
14.Exploiting Sparse Semantic HD Maps for Self-Driving Vehicle Localization ⬇️
In this paper we propose a novel semantic localization algorithm that exploits multiple sensors and has precision on the order of a few centimeters. Our approach does not require detailed knowledge about the appearance of the world, and our maps require orders of magnitude less storage than maps utilized by traditional geometry- and LiDAR intensity-based localizers. This is important as self-driving cars need to operate in large environments. Towards this goal, we formulate the problem in a Bayesian filtering framework, and exploit lanes, traffic signs, as well as vehicle dynamics to localize robustly with respect to a sparse semantic map. We validate the effectiveness of our method on a new highway dataset consisting of 312km of roads. Our experiments show that the proposed approach is able to achieve 0.05m lateral accuracy and 1.12m longitudinal accuracy on average while taking up only 0.3% of the storage required by previous LiDAR intensity-based approaches.
15.Efficient Inference of CNNs via Channel Pruning ⬇️
The deployment of Convolutional Neural Networks (CNNs) on resource constrained platforms such as mobile devices and embedded systems has been greatly hindered by their high implementation cost, and thus motivated a lot research interest in compressing and accelerating trained CNN models. Among various techniques proposed in literature, structured pruning, especially channel pruning, has gain a lot focus due to 1) its superior performance in memory, computation, and energy reduction; and 2) it is friendly to existing hardware and software libraries. In this paper, we investigate the intermediate results of convolutional layers and present a novel pivoted QR factorization based channel pruning technique that can prune any specified number of input channels of any layer. We also explore more pruning opportunities in ResNet-like architectures by applying two tweaks to our technique. Experiment results on VGG-16 and ResNet-50 models with ImageNet ILSVRC 2012 dataset are very impressive with 4.29X and 2.84X computation reduction while only sacrificing about 1.40% top-5 accuracy. Compared to many prior works, the pruned models produced by our technique require up to 47.7% less computation while still achieve higher accuracies.
16.One-shot Face Reenactment ⬇️
To enable realistic shape (e.g. pose and expression) transfer, existing face reenactment methods rely on a set of target faces for learning subject-specific traits. However, in real-world scenario end-users often only have one target face at hand, rendering existing methods inapplicable. In this work, we bridge this gap by proposing a novel one-shot face reenactment learning framework. Our key insight is that the one-shot learner should be able to disentangle and compose appearance and shape information for effective modeling. Specifically, the target face appearance and the source face shape are first projected into latent spaces with their corresponding encoders. Then these two latent spaces are associated by learning a shared decoder that aggregates multi-level features to produce the final reenactment results. To further improve the synthesizing quality on mustache and hair regions, we additionally propose FusionNet which combines the strengths of our learned decoder and the traditional warping method. Extensive experiments show that our one-shot face reenactment system achieves superior transfer fidelity as well as identity preserving capability than alternatives. More remarkably, our approach trained with only one target image per subject achieves competitive results to those using a set of target images, demonstrating the practical merit of this work. Code, models and an additional set of reenacted faces have been publicly released at the project page.
17.GridDehazeNet: Attention-Based Multi-Scale Network for Image Dehazing ⬇️
We propose an end-to-end trainable Convolutional Neural Network (CNN), named GridDehazeNet, for single image dehazing. The GridDehazeNet consists of three modules: pre-processing, backbone, and post-processing. The trainable pre-processing module can generate learned inputs with better diversity and more pertinent features as compared to those derived inputs produced by hand-selected pre-processing methods. The backbone module implements a novel attention-based multi-scale estimation on a grid network, which can effectively alleviate the bottleneck issue often encountered in the conventional multi-scale approach. The post-processing module helps to reduce the artifacts in the final output. Experimental results indicate that the GridDehazeNet outperforms the state-of-the-arts on both synthetic and real-world images. The proposed hazing method does not rely on the atmosphere scattering model, and we provide an explanation as to why it is not necessarily beneficial to take advantage of the dimension reduction offered by the atmosphere scattering model for image dehazing, even if only the dehazing results on synthetic images are concerned.
18.Image-based marker tracking and registration for intraoperative 3D image-guided interventions using augmented reality ⬇️
Augmented reality has the potential to improve operating room workflow by allowing physicians to "see" inside a patient through the projection of imaging directly onto the surgical field. For this to be useful the acquired imaging must be quickly and accurately registered with patient and the registration must be maintained. Here we describe a method for projecting a CT scan with Microsoft Hololens and then aligning that projection to a set of fiduciary markers. Radio-opaque stickers with unique QR-codes are placed on an object prior to acquiring a CT scan. The location of the markers in the CT scan are extracted and the CT scan is converted into a 3D surface object. The 3D object is then projected using the Hololens onto a table on which the same markers are placed. We designed an algorithm that aligns the markers on the 3D object with the markers on the table. To extract the markers and convert the CT into a 3D object took less than 5 seconds. To align three markers, it took
$0.9 \pm 0.2$ seconds to achieve an accuracy of$5 \pm 2$ mm. These findings show that it is feasible to use a combined radio-opaque optical marker, placed on a patient prior to a CT scan, to subsequently align the acquired CT scan with the patient.
19.Sparse Coding of Shape Trajectories for Facial Expression and Action Recognition ⬇️
The detection and tracking of human landmarks in video streams has gained in reliability partly due to the availability of affordable RGB-D sensors. The analysis of such time-varying geometric data is playing an important role in the automatic human behavior understanding. However, suitable shape representations as well as their temporal evolution, termed trajectories, often lie to nonlinear manifolds. This puts an additional constraint (i.e., nonlinearity) in using conventional Machine Learning techniques. As a solution, this paper accommodates the well-known Sparse Coding and Dictionary Learning approach to study time-varying shapes on the Kendall shape spaces of 2D and 3D landmarks. We illustrate effective coding of 3D skeletal sequences for action recognition and 2D facial landmark sequences for macro- and micro-expression recognition. To overcome the inherent nonlinearity of the shape spaces, intrinsic and extrinsic solutions were explored. As main results, shape trajectories give rise to more discriminative time-series with suitable computational properties, including sparsity and vector space structure. Extensive experiments conducted on commonly-used datasets demonstrate the competitiveness of the proposed approaches with respect to state-of-the-art.
20.Bayesian Inference for Large Scale Image Classification ⬇️
Bayesian inference promises to ground and improve the performance of deep neural networks. It promises to be robust to overfitting, to simplify the training procedure and the space of hyperparameters, and to provide a calibrated measure of uncertainty that can enhance decision making, agent exploration and prediction fairness. Markov Chain Monte Carlo (MCMC) methods enable Bayesian inference by generating samples from the posterior distribution over model parameters. Despite the theoretical advantages of Bayesian inference and the similarity between MCMC and optimization methods, the performance of sampling methods has so far lagged behind optimization methods for large scale deep learning tasks. We aim to fill this gap and introduce ATMC, an adaptive noise MCMC algorithm that estimates and is able to sample from the posterior of a neural network. ATMC dynamically adjusts the amount of momentum and noise applied to each parameter update in order to compensate for the use of stochastic gradients. We use a ResNet architecture without batch normalization to test ATMC on the Cifar10 benchmark and the large scale ImageNet benchmark and show that, despite the absence of batch normalization, ATMC outperforms a strong optimization baseline in terms of both classification accuracy and test log-likelihood. We show that ATMC is intrinsically robust to overfitting on the training data and that ATMC provides a better calibrated measure of uncertainty compared to the optimization baseline.
21.Bias and variance reduction and denoising for CTF Estimation ⬇️
When using an electron microscope for imaging of particles embedded in vitreous ice, the objective lens will inevitably corrupt the projection images. This corruption manifests as a band-pass filter on the micrograph. In addition, it causes the phase of several frequency bands to be flipped and distorts frequency bands. As a precursor to compensating for this distortion, the corrupting point spread function, which is termed the contrast transfer function (CTF) in reciprocal space, must be estimated. In this paper, we will present a novel method for CTF estimation. Our method is based on the multi-taper method for power spectral density estimation, which aims to reduce the bias and variance of the estimator. Furthermore, we use known properties of the CTF and of the background of the power spectrum to increase the accuracy of our estimation. We will show that the resulting estimates capture the zero-crossings of the CTF in the low-mid frequency range.
22.Deep Learning based Wearable Assistive System for Visually Impaired People ⬇️
In this paper, we propose a deep learning based assistive system to improve the environment perception experience of visually impaired (VI). The system is composed of a wearable terminal equipped with an RGBD camera and an earphone, a powerful processor mainly for deep learning inferences and a smart phone for touch-based interaction. A data-driven learning approach is proposed to predict safe and reliable walkable instructions using RGBD data and the established semantic map. This map is also used to help VI understand their 3D surrounding objects and layout through well-designed touchscreen interactions. The quantitative and qualitative experimental results show that our learning based obstacle avoidance approach achieves excellent results in both indoor and outdoor datasets with low-lying obstacles. Meanwhile, user studies have also been carried out in various scenarios and showed the improvement of VI's environment perception experience with our system.
23.Enhancing Flood Impact Analysis using Interactive Retrieval of Social Media Images ⬇️
The analysis of natural disasters such as floods in a timely manner often suffers from limited data due to a coarse distribution of sensors or sensor failures. This limitation could be alleviated by leveraging information contained in images of the event posted on social media platforms, so-called "Volunteered Geographic Information (VGI)". To save the analyst from the need to inspect all images posted online manually, we propose to use content-based image retrieval with the possibility of relevance feedback for retrieving only relevant images of the event to be analyzed. To evaluate this approach, we introduce a new dataset of 3,710 flood images, annotated by domain experts regarding their relevance with respect to three tasks (determining the flooded area, inundation depth, water pollution). We compare several image features and relevance feedback methods on that dataset, mixed with 97,085 distractor images, and are able to improve the precision among the top 100 retrieval results from 55% with the baseline retrieval to 87% after 5 rounds of feedback.
24.Hyper Vision Net: Kidney Tumor Segmentation Using Coordinate Convolutional Layer and Attention Unit ⬇️
KiTs19 challenge paves the way to haste the improvement of solid kidney tumor semantic segmentation methodologies. Accurate segmentation of kidney tumor in computer tomography (CT) images is a challenging task due to the non-uniform motion, similar appearance and various shape. Inspired by this fact, in this manuscript, we present a novel kidney tumor segmentation method using deep learning network termed as Hyper vision Net model. All the existing U-net models are using a modified version of U-net to segment the kidney tumor region. In the proposed architecture, we introduced supervision layers in the decoder part, and it refines even minimal regions in the output. A dataset consists of real arterial phase abdominal CT scans of 300 patients, including 45964 images has been provided from KiTs19 for training and validation of the proposed model. Compared with the state-of-the-art segmentation methods, the results demonstrate the superiority of our approach on training dice value score of 0.9552 and 0.9633 in tumor region and kidney region, respectively.
25.Deep Learning for Visual Recognition of Environmental Enteropathy and Celiac Disease ⬇️
Physicians use biopsies to distinguish between different but histologically similar enteropathies. The range of syndromes and pathologies that could cause different gastrointestinal conditions makes this a difficult problem. Recently, deep learning has been used successfully in helping diagnose cancerous tissues in histopathological images. These successes motivated the research presented in this paper, which describes a deep learning approach that distinguishes between Celiac Disease (CD) and Environmental Enteropathy (EE) and normal tissue from digitized duodenal biopsies. Experimental results show accuracies of over 90% for this approach. We also look into interpreting the neural network model using Gradient-weighted Class Activation Mappings and filter activations on input images to understand the visual explanations for the decisions made by the model.
26.WhiteNNer-Blind Image Denoising via Noise Whiteness Priors ⬇️
The accuracy of medical imaging-based diagnostics is directly impacted by the quality of the collected images. A passive approach to improve image quality is one that lags behind improvements in imaging hardware, awaiting better sensor technology of acquisition devices. An alternative, active strategy is to utilize prior knowledge of the imaging system to directly post-process and improve the acquired images. Traditionally, priors about the image properties are taken into account to restrict the solution space. However, few techniques exploit the prior about the noise properties. In this paper, we propose a neural network-based model for disentangling the signal and noise components of an input noisy image, without the need for any ground truth training data. We design a unified loss function that encodes priors about signal as well as noise estimate in the form of regularization terms. Specifically, by using total variation and piecewise constancy priors along with noise whiteness priors such as auto-correlation and stationary losses, our network learns to decouple an input noisy image into the underlying signal and noise components. We compare our proposed method to Noise2Noise and Noise2Self, as well as non-local mean and BM3D, on three public confocal laser endomicroscopy datasets. Experimental results demonstrate the superiority of our network compared to state-of-the-art in terms of PSNR and SSIM.