1.Learning Continuous Face Age Progression: A Pyramid of GANs pdf
The two underlying requirements of face age progression, i.e. aging accuracy and identity permanence, are not well studied in the literature. This paper presents a novel generative adversarial network based approach to address the issues in a coupled manner. It separately models the constraints for the intrinsic subject-specific characteristics and the age-specific facial changes with respect to the elapsed time, ensuring that the generated faces present desired aging effects while simultaneously keeping personalized properties stable. To ensure photo-realistic facial details, high-level age-specific features conveyed by the synthesized face are estimated by a pyramidal adversarial discriminator at multiple scales, which simulates the aging effects with finer details. Further, an adversarial learning scheme is introduced to simultaneously train a single generator and multiple parallel discriminators, resulting in smooth continuous face aging sequences. The proposed method is applicable even in the presence of variations in pose, expression, makeup, etc., achieving remarkably vivid aging effects. Quantitative evaluations by a COTS face recognition system demonstrate that the target age distributions are accurately recovered, and 99.88% and 99.98% age progressed faces can be correctly verified at 0.001% FAR after age transformations of approximately 28 and 23 years elapsed time on the MORPH and CACD databases, respectively. Both visual and quantitative assessments show that the approach advances the state-of-the-art.
2.Hybrid Task Cascade for Instance Segmentation pdf
Cascade is a classic yet powerful architecture that has boosted performance on various tasks. However, how to introduce cascade to instance segmentation remains an open question. A simple combination of Cascade R-CNN and Mask R-CNN only brings limited gain. In exploring a more effective approach, we find that the key to a successful instance segmentation cascade is to fully leverage the reciprocal relationship between detection and segmentation. In this work, we propose a new framework, Hybrid Task Cascade (HTC), which differs in two important aspects: (1) instead of performing cascaded refinement on these two tasks separately, it interweaves them for a joint multi-stage processing; (2) it adopts a fully convolutional branch to provide spatial context, which can help distinguishing hard foreground from cluttered background. Overall, this framework can learn more discriminative features progressively while integrating complementary features together in each stage. Without bells and whistles, a single HTC obtains 38.4% and 1.5% improvement over a strong Cascade Mask R-CNN baseline on MSCOCO dataset. More importantly, our overall system achieves 48.6 mask AP on the test-challenge dataset and 49.0 mask AP on test-dev, which are the state-of-the-art performance.
3.Pedestrian Attribute Recognition: A Survey pdf
Recognizing pedestrian attributes is an important task in computer vision community due to it plays an important role in video surveillance. Many algorithms has been proposed to handle this task. The goal of this paper is to review existing works using traditional methods or based on deep learning networks. Firstly, we introduce the background of pedestrian attributes recognition (PAR, for short), including the fundamental concepts of pedestrian attributes and corresponding challenges. Secondly, we introduce existing benchmarks, including popular datasets and evaluation criterion. Thirdly, we analyse the concept of multi-task learning and multi-label learning, and also explain the relations between these two learning algorithms and pedestrian attribute recognition. We also review some popular network architectures which have widely applied in the deep learning community. Fourthly, we analyse popular solutions for this task, such as attributes group, part-based, \emph{etc}. Fifthly, we shown some applications which takes pedestrian attributes into consideration and achieve better performance. Finally, we summarized this paper and give several possible research directions for pedestrian attributes recognition. The project page of this paper can be found from the following website: \url{this https URL}.
4.Use of First and Third Person Views for Deep Intersection Classification pdf
We explore the problem of intersection classification using monocular on-board passive vision, with the goal of classifying traffic scenes with respect to road topology. We divide the existing approaches into two broad categories according to the type of input data: (a) first person vision (FPV) approaches, which use an egocentric view sequence as the intersection is passed; and (b) third person vision (TPV) approaches, which use a single view immediately before entering the intersection. The FPV and TPV approaches each have advantages and disadvantages. Therefore, we aim to combine them into a unified deep learning framework. Experimental results show that the proposed FPV-TPV scheme outperforms previous methods and only requires minimal FPV/TPV measurements.
5.Multiple Graph Adversarial Learning pdf
Recently, Graph Convolutional Networks (GCNs) have been widely studied for graph-structured data representation and learning. However, in many real applications, data are coming with multiple graphs, and it is non-trivial to adapt GCNs to deal with data representation with multiple graph structures. One main challenge for multi-graph representation is how to exploit both structure information of each individual graph and correlation information across multiple graphs simultaneously. In this paper, we propose a novel Multiple Graph Adversarial Learning (MGAL) framework for multi-graph representation and learning. MGAL aims to learn an optimal structure-invariant and consistent representation for multiple graphs in a common subspace via a novel adversarial learning framework, which thus incorporates both structure information of intra-graph and correlation information of inter-graphs simultaneously. Based on MGAL, we then provide a unified network for semi-supervised learning task. Promising experimental results demonstrate the effectiveness of MGAL model.
6.Simultaneous lesion and neuroanatomy segmentation in Multiple Sclerosis using deep neural networks pdf
Segmentation of both white matter lesions and deep grey matter structures is an important task in the quantification of magnetic resonance imaging in multiple sclerosis. Typically these tasks are performed separately: in this paper we present a single CNN-based segmentation solution for providing fast, reliable segmentations of multimodal MR imagies into lesion classes and healthy-appearing grey- and white-matter structures. We show substantial, statistically significant improvements in both Dice coefficient and in lesion-wise specificity and sensitivity, compared to previous approaches, and agreement with individual human raters in the range of human inter-rater variability. The method is trained on data gathered from a single centre: nonetheless, it performs well on data from centres, scanners and field-strengths not represented in the training dataset. A retrospective study found that the classifier successfully identified lesions missed by the human raters.
Lesion labels were provided by human raters, while weak labels for other brain structures (including CSF, cortical grey matter, cortical white matter, cerebellum, amygdala, hippocampus, subcortical GM structures and choroid plexus) were provided by Freesurfer 5.3. The segmentations of these structures compared well, not only with Freesurfer 5.3, but also with FSL-First and Freesurfer 6.1.
7.A Recent Survey on the Applications of Genetic Programming in Image Processing pdf
During the last two decades, Genetic Programming (GP) has been largely used to tackle optimization, classification, and automatic features selection related tasks. The widespread use of GP is mainly due to its flexible and comprehensible tree-type structure. Similarly, research is also gaining momentum in the field of Image Processing (IP) because of its promising results over wide areas of applications ranging from medical IP to multispectral imaging. IP is mainly involved in applications such as computer vision, pattern recognition, image compression, storage and transmission, and medical diagnostics. This prevailing nature of images and their associated algorithm i.e complexities gave an impetus to the exploration of GP. GP has thus been used in different ways for IP since its inception. Many interesting GP techniques have been developed and employed in the field of IP. To give the research community an extensive view of these techniques, this paper presents the diverse applications of GP in IP and provides useful resources for further research. Also, comparison of different parameters used in ten different applications of IP are summarized in tabular form. Moreover, analysis of different parameters used in IP related tasks is carried-out to save the time needed in future for evaluating the parameters of GP. As more advancement is made in GP methodologies, its success in solving complex tasks not only related to IP but also in other fields will increase. Additionally, guidelines are provided for applying GP in IP related tasks, pros and cons of GP techniques are discussed, and some future directions are also set.
8.Extension of Convolutional Neural Network with General Image Processing Kernels pdf
We applied pre-defined kernels also known as filters or masks developed for image processing to convolution neural network. Instead of letting neural networks find its own kernels, we used 41 different general-purpose kernels of blurring, edge detecting, sharpening, discrete cosine transformation, etc. for the first layer of the convolution neural networks. This architecture, thus named as general filter convolutional neural network (GFNN), can reduce training time by 30% with a better accuracy compared to the regular convolutional neural network (CNN). GFNN also can be trained to achieve 90% accuracy with only 500 samples. Furthermore, even though these kernels are not specialized for the MNIST dataset, we achieved 99.56% accuracy without ensemble nor any other special algorithms.
9.SAML-QC: a Stochastic Assessment and Machine Learning based QC technique for Industrial Printing pdf
Recently, the advancement in industrial automation and high-speed printing has raised numerous challenges related to the printing quality inspection of final products. This paper proposes a machine vision based technique to assess the printing quality of text on industrial objects. The assessment is based on three quality defects such as text misalignment, varying printing shades, and misprinted text. The proposed scheme performs the quality inspection through stochastic assessment technique based on the second-order statistics of printing. First: the text-containing area on printed product is identified through image processing techniques. Second: the alignment testing of the identified text-containing area is performed. Third: optical character recognition is performed to divide the text into different small boxes and only the intensity value of each text-containing box is taken as a random variable and second-order statistics are estimated to determine the varying printing defects in the text under one, two and three sigma thresholds. Fourth: the K-Nearest Neighbors based supervised machine learning is performed to provide the stochastic process for misprinted text detection. Finally, the technique is deployed on an industrial image for the printing quality assessment with varying values of n and m. The results have shown that the proposed SAML-QC technique can perform real-time automated inspection for industrial printing.
10.DCNN-GAN: Reconstructing Realistic Image from fMRI pdf
Visualizing the perceptual content by analyzing human functional magnetic resonance imaging (fMRI) has been an active research area. However, due to its high dimensionality, complex dimensional structure, and small number of samples available, reconstructing realistic images from fMRI remains challenging. Recently with the development of convolutional neural network (CNN) and generative adversarial network (GAN), mapping multi-voxel fMRI data to complex, realistic images has been made possible. In this paper, we propose a model, DCNN-GAN, by combining a reconstruction network and GAN. We utilize the CNN for hierarchical feature extraction and the DCNN-GAN to reconstruct more realistic images. Extensive experiments have been conducted, showing that our method outperforms previous works, regarding reconstruction quality and computational cost.
11.Measuring Effectiveness of Video Advertisements pdf
Advertisements are unavoidable in modern society. Times Square is notorious for its incessant display of advertisements. Its popularity is worldwide and smaller cities possess miniature versions of the display, such as Pittsburgh and its digital works in Oakland on Forbes Avenue. Tokyo's Ginza district recently rose to popularity due to its upscale shops and constant onslaught of advertisements to pedestrians. Advertisements arise in other mediums as well. For example, they help popular streaming services, such as Spotify, Hulu, and Youtube TV gather significant streams of revenue to reduce the cost of monthly subscriptions for consumers. Ads provide an additional source of money for companies and entire industries to allocate resources toward alternative business motives. They are attractive to companies and nearly unavoidable for consumers. One challenge for advertisers is examining a advertisement's effectiveness or usefulness in conveying a message to their targeted demographics. Rather than constructing a single, static image of content, a video advertisement possesses hundreds of frames of data with varying scenes, actors, objects, and complexity. Therefore, measuring effectiveness of video advertisements is important to impacting a billion-dollar industry. This paper explores the combination of human-annotated features and common video processing techniques to predict effectiveness ratings of advertisements collected from Youtube. This task is seen as a binary (effective vs. non-effective), four-way, and five-way machine learning classification task. The first findings in terms of accuracy and inference on this dataset, as well as some of the first ad research, on a small dataset are presented. Accuracies of 84%, 65%, and 55% are reached on the binary, four-way, and five-way tasks respectively.
12.Optical Flow augmented Semantic Segmentation networks for Automated Driving pdf
Motion is a dominant cue in automated driving systems. Optical flow is typically computed to detect moving objects and to estimate depth using triangulation. In this paper, our motivation is to leverage the existing dense optical flow to improve the performance of semantic segmentation. To provide a systematic study, we construct four different architectures which use RGB only, flow only, RGBF concatenated and two-stream RGB + flow. We evaluate these networks on two automotive datasets namely Virtual KITTI and Cityscapes using the state-of-the-art flow estimator FlowNet v2. We also make use of the ground truth optical flow in Virtual KITTI to serve as an ideal estimator and a standard Farneback optical flow algorithm to study the effect of noise. Using the flow ground truth in Virtual KITTI, two-stream architecture achieves the best results with an improvement of 4% IoU. As expected, there is a large improvement for moving objects like trucks, vans and cars with 38%, 28% and 6% increase in IoU. FlowNet produces an improvement of 2.4% in average IoU with larger improvement in the moving objects corresponding to 26%, 11% and 5% in trucks, vans and cars. In Cityscapes, flow augmentation provided an improvement for moving objects like motorcycle and train with an increase of 17% and 7% in IoU.
13.Adversarial Pseudo Healthy Synthesis Needs Pathology Factorization pdf
Pseudo healthy synthesis, i.e. the creation of a subject-specific `healthy' image from a pathological one, could be helpful in tasks such as anomaly detection, understanding changes induced by pathology and disease or even as data augmentation. We treat this task as a factor decomposition problem: we aim to separate what appears to be healthy and where disease is (as a map). The two factors are then recombined (by a network) to reconstruct the input disease image. We train our models in an adversarial way using either paired or unpaired settings, where we pair disease images and maps (as segmentation masks) when available. We quantitatively evaluate the quality of pseudo healthy images. We show in a series of experiments, performed in ISLES and BraTS datasets, that our method is better than conditional GAN and CycleGAN, highlighting challenges in using adversarial methods in the image translation task of pseudo healthy image generation.
14.Unsupervised Learning-based Depth Estimation aided Visual SLAM Approach pdf
The RGB-D camera maintains a limited range for working and is hard to accurately measure the depth information in a far distance. Besides, the RGB-D camera will easily be influenced by strong lighting and other external factors, which will lead to a poor accuracy on the acquired environmental depth information. Recently, deep learning technologies have achieved great success in the visual SLAM area, which can directly learn high-level features from the visual inputs and improve the estimation accuracy of the depth information. Therefore, deep learning technologies maintain the potential to extend the source of the depth information and improve the performance of the SLAM system. However, the existing deep learning-based methods are mainly supervised and require a large amount of ground-truth depth data, which is hard to acquire because of the realistic constraints. In this paper, we first present an unsupervised learning framework, which not only uses image reconstruction for supervising but also exploits the pose estimation method to enhance the supervised signal and add training constraints for the task of monocular depth and camera motion estimation. Furthermore, we successfully exploit our unsupervised learning framework to assist the traditional ORB-SLAM system when the initialization module of ORB-SLAM method could not match enough features. Qualitative and quantitative experiments have shown that our unsupervised learning framework performs the depth estimation task comparable to the supervised methods and outperforms the previous state-of-the-art approach by
$13.5%$ on KITTI dataset. Besides, our unsupervised learning framework could significantly accelerate the initialization process of ORB-SLAM system and effectively improve the accuracy on environmental mapping in strong lighting and weak texture scenes.
15.Ego-motion Sensor for Unmanned Aerial Vehicles Based on a Single-Board Computer pdf
This paper describes the design and implementation of a ground-related odometry sensor suitable for micro aerial vehicles. The sensor is based on a ground-facing camera and a single-board Linux-based embedded computer with a multimedia System on a Chip (SoC). The SoC features a hardware video encoder which is used to estimate the optical flow online. The optical flow is then used in combination with a distance sensor to estimate the vehicle's velocity. The proposed sensor is compared to a similar existing solution and evaluated in both indoor and outdoor environments.
16.Super-Trajectories: A Compact Yet Rich Video Representation pdf
We propose a new video representation in terms of an over-segmentation of dense trajectories covering the whole video. Trajectories are often used to encode long-temporal information in several computer vision applications. Similar to temporal superpixels, a temporal slice of super-trajectories are superpixels, but the later contains more information because it maintains the long dense pixel-wise tracking information as well. The main challenge in using trajectories for any application, is the accumulation of tracking error in the trajectory construction. For our problem, this results in disconnected superpixels. We exploit constraints for edges in addition to trajectory based color and position similarity. Analogous to superpixels as a preprocessing tool for images, the proposed representation has its applications for videos, especially in trajectory based video analysis.
17.Fast, Accurate and Lightweight Super-Resolution with Neural Architecture Search pdf
Deep convolution neural networks demonstrate impressive results in super-resolution domain. An ocean of researches concentrate on improving peak signal noise ratio (PSNR) by using deeper and deeper layers, which is not friendly to constrained resources. Pursuing a trade-off between restoration capacity and simplicity of a model is still non-trivial by now. Recently, more contributions are devoted to this balance and our work is focusing on improving it further with automatic neural architecture search. In this paper, we handle super-resolution using multi-objective approach and propose an elastic search method involving both macro and micro aspects based on a hybrid controller of evolutionary algorithm and reinforcement learning. Quantitative experiments can help to draw a conclusion that the models generated by our methods are very competitive than and even dominate most of state-of-the-art super-resolution methods with different levels of FLOPS.
18.RPC: A Large-Scale Retail Product Checkout Dataset pdf
Over recent years, emerging interest has occurred in integrating computer vision technology into the retail industry. Automatic checkout (ACO) is one of the critical problems in this area which aims to automatically generate the shopping list from the images of the products to purchase. The main challenge of this problem comes from the large scale and the fine-grained nature of the product categories as well as the difficulty for collecting training images that reflect the realistic checkout scenarios due to continuous update of the products. Despite its significant practical and research value, this problem is not extensively studied in the computer vision community, largely due to the lack of a high-quality dataset. To fill this gap, in this work we propose a new dataset to facilitate relevant research. Our dataset enjoys the following characteristics: (1) It is by far the largest dataset in terms of both product image quantity and product categories. (2) It includes single-product images taken in a controlled environment and multi-product images taken by the checkout system. (3) It provides different levels of annotations for the check-out images. Comparing with the existing datasets, ours is closer to the realistic setting and can derive a variety of research problems. Besides the dataset, we also benchmark the performance on this dataset with various approaches. The dataset and related resources can be found at \url{this https URL}.
19.DF-SLAM: A Deep-Learning Enhanced Visual SLAM System based on Deep Local Features pdf
As the foundation of driverless vehicle and intelligent robots, Simultaneous Localization and Mapping(SLAM) has attracted much attention these days. However, non-geometric modules of traditional SLAM algorithms are limited by data association tasks and have become a bottleneck preventing the development of SLAM. To deal with such problems, many researchers seek to Deep Learning for help. But most of these studies are limited to virtual datasets or specific environments, and even sacrifice efficiency for accuracy. Thus, they are not practical enough.
We propose DF-SLAM system that uses deep local feature descriptors obtained by the neural network as a substitute for traditional hand-made features. Experimental results demonstrate its improvements in efficiency and stability. DF-SLAM outperforms popular traditional SLAM systems in various scenes, including challenging scenes with intense illumination changes. Its versatility and mobility fit well into the need for exploring new environments. Since we adopt a shallow network to extract local descriptors and remain others the same as original SLAM systems, our DF-SLAM can still run in real-time on GPU.
20.Unsupervised Automated Event Detection using an Iterative Clustering based Segmentation Approach pdf
A class of vision problems, less commonly studied, consists of detecting objects in imagery obtained from physics-based experiments. These objects can span in 4D (x, y, z, t) and are visible as disturbances (caused due to physical phenomena) in the image with background distribution being approximately uniform. Such objects, occasionally referred to as `events', can be considered as high energy blobs in the image. Unlike the images analyzed in conventional vision problems, very limited features are associated with such events, and their shape, size and count can vary significantly. This poses a challenge on the use of pre-trained models obtained from supervised approaches.
In this paper, we propose an unsupervised approach involving iterative clustering based segmentation (ICS) which can detect target objects (events) in real-time. In this approach, a test image is analyzed over several cycles, and one event is identified per cycle. Each cycle consists of the following steps: (1) image segmentation using a modified k-means clustering method, (2) elimination of empty (with no events) segments based on statistical analysis of each segment, (3) merging segments that overlap (correspond to same event), and (4) selecting the strongest event. These four steps are repeated until all the events have been identified. The ICS approach consists of a few hyper-parameters that have been chosen based on statistical study performed over a set of test images. The applicability of ICS method is demonstrated on several 2D and 3D test examples.
21.Fully Convolutional Network-based Multi-Task Learning for Rectum and Rectal Cancer Segmentation pdf
In this study, we present a fully automatic method to segment both rectum and rectal cancer based on Deep Neural Networks (DNNs) with axial T2-weighted Magnetic Resonance images. Clinically, the relative location between rectum and rectal cancer plays an important role in cancer treatment planning. Such a need motivates us to propose a fully convolutional architecture for Multi-Task Learning (MTL) to segment both rectum and rectal cancer. Moreover, we propose a bias-variance decomposition-based method which can visualize and assess regional robustness of the segmentation model. In addition, we also suggest a novel augmentation method which can improve the segmentation performance as well as reduce the training time. Overall, our proposed method is not only computationally efficient due to its fully convolutional nature but also outperforms the current state-of-the-art for rectal cancer segmentation. It also scores high accuracy in rectum segmentation without any prior study reported. Moreover, we conclude that supplementing rectum information benefits the rectal cancer segmentation model, especially in model variance.
22.CAE-P: Compressive Autoencoder with Pruning Based on ADMM pdf
Since compressive autoencoder (CAE) was proposed, autoencoder, as a simple and efficient neural network model, has achieved better performance than traditional codecs such as JPEG[3], JPEG 2000[4] etc. in lossy image compression. However, it faces the problem that the bitrate, characterizing the compression ratio, cannot be optimized by general methods due to its discreteness. Current research additionally trains a entropy estimator to indirectly optimize the bitrate. In this paper, we proposed the compressive autoencoder with pruning based on ADMM (CAE-P) which replaces the traditionally used entropy estimating technique with ADMM pruning method inspired by the field of neural network architecture search and avoided the extra effort needed for training an entropy estimator. We tested our models on natural image dataset Kodak PhotoCD and achieved better results than the original CAE model which relies on entropy coding along with traditional codecs. We further explored the effectiveness of the ADMM-based pruning method in CAE-P by looking into the detail of latent codes learned by the model.
23.Efficient Image Splicing Localization via Contrastive Feature Extraction pdf
In this work, we propose a new data visualization and clustering technique for discovering discriminative structures in high-dimensional data. This technique, referred to as cPCA++, utilizes the fact that the interesting features of a "target" dataset may be obscured by high variance components during traditional PCA. By analyzing what is referred to as a "background" dataset (i.e., one that exhibits the high variance principal components but not the interesting structures), our technique is capable of efficiently highlighting the structure that is unique to the "target" dataset. Similar to another recently proposed algorithm called "contrastive PCA" (cPCA), the proposed cPCA++ method identifies important dataset specific patterns that are not detected by traditional PCA in a wide variety of settings. However, the proposed cPCA++ method is significantly more efficient than cPCA, because it does not require the parameter sweep in the latter approach. We applied the cPCA++ method to the problem of image splicing localization. In this application, we utilize authentic edges as the background dataset and the spliced edges as the target dataset. The proposed method is significantly more efficient than state-of-the-art methods, as the former does not require iterative updates of filter weights via stochastic gradient descent and backpropagation, nor the training of a classifier. Furthermore, the cPCA++ method is shown to provide performance scores comparable to the state-of-the-art Multi-task Fully Convolutional Network (MFCN).
24.Energy Confused Adversarial Metric Learning for Zero-Shot Image Retrieval and Clustering pdf
Deep metric learning has been widely applied in many computer vision tasks, and recently, it is more attractive in \emph{zero-shot image retrieval and clustering}(ZSRC) where a good embedding is requested such that the unseen classes can be distinguished well. Most existing works deem this 'good' embedding just to be the discriminative one and thus race to devise powerful metric objectives or hard-sample mining strategies for leaning discriminative embedding. However, in this paper, we first emphasize that the generalization ability is a core ingredient of this 'good' embedding as well and largely affects the metric performance in zero-shot settings as a matter of fact. Then, we propose the Energy Confused Adversarial Metric Learning(ECAML) framework to explicitly optimize a robust metric. It is mainly achieved by introducing an interesting Energy Confusion regularization term, which daringly breaks away from the traditional metric learning idea of discriminative objective devising, and seeks to 'confuse' the learned model so as to encourage its generalization ability by reducing overfitting on the seen classes. We train this confusion term together with the conventional metric objective in an adversarial manner. Although it seems weird to 'confuse' the network, we show that our ECAML indeed serves as an efficient regularization technique for metric learning and is applicable to various conventional metric methods. This paper empirically and experimentally demonstrates the importance of learning embedding with good generalization, achieving state-of-the-art performances on the popular CUB, CARS, Stanford Online Products and In-Shop datasets for ZSRC tasks. \textcolor[rgb]{1, 0, 0}{Code available at this http URL}.
25.Linearized Multi-Sampling for Differentiable Image Transformation pdf
We propose a novel image sampling method for differentiable image transformation in deep neural networks. The sampling schemes currently used in deep learning, such as Spatial Transformer Networks, rely on bilinear interpolation, which performs poorly under severe scale changes, and more importantly, results in poor gradient propagation. This is due to their strict reliance on direct neighbors. Instead, we propose to generate random auxiliary samples in the vicinity of each pixel in the sampled image, and create a linear approximation using their intensity values. We then use this approximation as a differentiable formula for the transformed image. However, we observe that these auxiliary samples may collapse to a single pixel under severe image transformations, and propose to address it by adding constraints to the distance between the center pixel and the auxiliary samples. We demonstrate that our approach produces more representative gradients with a wider basin of convergence for image alignment, which leads to considerable performance improvements when training networks for image registration and classification tasks, particularly under large downsampling.
26.Robust Angular Local Descriptor Learning pdf
In recent years, the learned local descriptors have outperformed handcrafted ones by a large margin, due to the powerful deep convolutional neural network architectures such as L2-Net [1] and triplet based metric learning [2]. However, there are two problems in the current methods, which hinders the overall performance. Firstly, the widely-used margin loss is sensitive to incorrect correspondences, which are prevalent in the existing local descriptor learning datasets. Second, the L2 distance ignores the fact that the feature vectors have been normalized to unit norm. To tackle these two problems and further boost the performance, we propose a robust angular loss which 1) uses cosine similarity instead of L2 distance to compare descriptors and 2) relies on a robust loss function that gives smaller penalty to triplets with negative relative similarity. The resulting descriptor shows robustness on different datasets, reaching the state-of-the-art result on Brown dataset , as well as demonstrating excellent generalization ability on the Hpatches dataset and a Wide Baseline Stereo dataset.
27.On Compression of Unsupervised Neural Nets by Pruning Weak Connections pdf
Unsupervised neural nets such as Restricted Boltzmann Machines(RBMs) and Deep Belif Networks(DBNs), are powerful in automatic feature extraction,unsupervised weight initialization and density estimation. In this paper,we demonstrate that the parameters of these neural nets can be dramatically reduced without affecting their performance. We describe a method to reduce the parameters required by RBM which is the basic building block for deep architectures. Further we propose an unsupervised sparse deep architectures selection algorithm to form sparse deep neural networks.Experimental results show that there is virtually no loss in either generative or discriminative performance.
28.MIMIC-CXR: A large publicly available database of labeled chest radiographs pdf
Chest radiography is an extremely powerful imaging modality, allowing for a detailed inspection of a patient's thorax, but requiring specialized training for proper interpretation. With the advent of high performance general purpose computer vision algorithms, the accurate automated analysis of chest radiographs is becoming increasingly of interest to researchers. However, a key challenge in the development of these techniques is the lack of sufficient data. Here we describe MIMIC-CXR, a large dataset of 371,920 chest x-rays associated with 227,943 imaging studies sourced from the Beth Israel Deaconess Medical Center between 2011 - 2016. Each imaging study can pertain to one or more images, but most often are associated with two images: a frontal view and a lateral view. Images are provided with 14 labels derived from a natural language processing tool applied to the corresponding free-text radiology reports. All images have been de-identified to protect patient privacy. The dataset is made freely available to facilitate and encourage wide range of research in medical computer vision.
29.CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison pdf
Large, labeled datasets have driven deep learning methods to achieve expert-level performance on a variety of medical imaging tasks. We present CheXpert, a large dataset that contains 224,316 chest radiographs of 65,240 patients. We design a labeler to automatically detect the presence of 14 observations in radiology reports, capturing uncertainties inherent in radiograph interpretation. We investigate different approaches to using the uncertainty labels for training convolutional neural networks that output the probability of these observations given the available frontal and lateral radiographs. On a validation set of 200 chest radiographic studies which were manually annotated by 3 board-certified radiologists, we find that different uncertainty approaches are useful for different pathologies. We then evaluate our best model on a test set composed of 500 chest radiographic studies annotated by a consensus of 5 board-certified radiologists, and compare the performance of our model to that of 3 additional radiologists in the detection of 5 selected pathologies. On Cardiomegaly, Edema, and Pleural Effusion, the model ROC and PR curves lie above all 3 radiologist operating points. We release the dataset to the public as a standard benchmark to evaluate performance of chest radiograph interpretation models.
The dataset is freely available at this https URL .
30.Understanding the Impact of Label Granularity on CNN-based Image Classification pdf
In recent years, supervised learning using Convolutional Neural Networks (CNNs) has achieved great success in image classification tasks, and large scale labeled datasets have contributed significantly to this achievement. However, the definition of a label is often application dependent. For example, an image of a cat can be labeled as "cat" or perhaps more specifically "Persian cat." We refer to this as label granularity. In this paper, we conduct extensive experiments using various datasets to demonstrate and analyze how and why training based on fine-grain labeling, such as "Persian cat" can improve CNN accuracy on classifying coarse-grain classes, in this case "cat." The experimental results show that training CNNs with fine-grain labels improves both network's optimization and generalization capabilities, as intuitively it encourages the network to learn more features, and hence increases classification accuracy on coarse-grain classes under all datasets considered. Moreover, fine-grain labels enhance data efficiency in CNN training. For example, a CNN trained with fine-grain labels and only 40% of the total training data can achieve higher accuracy than a CNN trained with the full training dataset and coarse-grain labels. These results point to two possible applications of this work: (i) with sufficient human resources, one can improve CNN performance by re-labeling the dataset with fine-grain labels, and (ii) with limited human resources, to improve CNN performance, rather than collecting more training data, one may instead use fine-grain labels for the dataset. We further propose a metric called Average Confusion Ratio to characterize the effectiveness of fine-grain labeling, and show its use through extensive experimentation. Code is available at this https URL.
31.Adversarial training with cycle consistency for unsupervised super-resolution in endomicroscopy pdf
In recent years, endomicroscopy has become increasingly used for diagnostic purposes and interventional guidance. It can provide intraoperative aids for real-time tissue characterization and can help to perform visual investigations aimed for example to discover epithelial cancers. Due to physical constraints on the acquisition process, endomicroscopy images, still today have a low number of informative pixels which hampers their quality. Post-processing techniques, such as Super-Resolution (SR), are a potential solution to increase the quality of these images. SR techniques are often supervised, requiring aligned pairs of low-resolution (LR) and high-resolution (HR) images patches to train a model. However, in our domain, the lack of HR images hinders the collection of such pairs and makes supervised training unsuitable. For this reason, we propose an unsupervised SR framework based on an adversarial deep neural network with a physically-inspired cycle consistency, designed to impose some acquisition properties on the super-resolved images. Our framework can exploit HR images, regardless of the domain where they are coming from, to transfer the quality of the HR images to the initial LR images. This property can be particularly useful in all situations where pairs of LR/HR are not available during the training. Our quantitative analysis, validated using a database of 238 endomicroscopy video sequences from 143 patients, shows the ability of the pipeline to produce convincing super-resolved images. A Mean Opinion Score (MOS) study also confirms this quantitative image quality assessment.
32.Deep Neural Network Approximation for Custom Hardware: Where We've Been, Where We're Going pdf
Deep neural networks have proven to be particularly effective in visual and audio recognition tasks. Existing models tend to be computationally expensive and memory intensive, however, and so methods for hardware-oriented approximation have become a hot topic. Research has shown that custom hardware-based neural network accelerators can surpass their general-purpose processor equivalents in terms of both throughput and energy efficiency. Application-tailored accelerators, when co-designed with approximation-based network training methods, transform large, dense and computationally expensive networks into small, sparse and hardware-efficient alternatives, increasing the feasibility of network deployment. In this article, we provide a comprehensive evaluation of approximation methods for high-performance network inference along with in-depth discussion of their effectiveness for custom hardware implementation. We also include proposals for future research based on a thorough analysis of current trends. This article represents the first survey providing detailed comparisons of custom hardware accelerators featuring approximation for both convolutional and recurrent neural networks, through which we hope to inspire exciting new developments in the field.
33.Segmentation of Lumen and External Elastic Laminae in Intravascular Ultrasound Images using Ultrasonic Backscattering Physics Initialized Multiscale Random Walks pdf
Coronary artery disease accounts for a large number of deaths across the world and clinicians generally prefer using x-ray computed tomography or magnetic resonance imaging for localizing vascular pathologies. Interventional imaging modalities like intravascular ultrasound (IVUS) are used to adjunct diagnosis of atherosclerotic plaques in vessels, and help assess morphological state of the vessel and plaque, which play a significant role for treatment planning. Since speckle intensity in IVUS images are inherently stochastic in nature and challenge clinicians with accurate visibility of the vessel wall boundaries, it requires automation. In this paper we present a method for segmenting the lumen and external elastic laminae of the artery wall in IVUS images using random walks over a multiscale pyramid of Gaussian decomposed frames. The seeds for the random walker are initialized by supervised learning of ultrasonic backscattering and attenuation statistical mechanics from labelled training samples. We have experimentally evaluated the performance using
$77$ IVUS images acquired at$40$ MHz that are available in the IVUS segmentation challenge dataset\footnote{this http URL} to obtain a Jaccard score of$0.89 \pm 0.14$ for lumen and$0.85 \pm 0.12$ for external elastic laminae segmentation over a$10$ -fold cross-validation study.
34.SUMNet: Fully Convolutional Model for Fast Segmentation of Anatomical Structures in Ultrasound Volumes pdf
Ultrasound imaging is generally employed for real-time investigation of internal anatomy of the human body for disease identification. Delineation of the anatomical boundary of organs and pathological lesions is quite challenging due to the stochastic nature of speckle intensity in the images, which also introduces visual fatigue for the observer. This paper introduces a fully convolutional neural network based method to segment organ and pathologies in ultrasound volume by learning the spatial-relationship between closely related classes in the presence of stochastically varying speckle intensity. We propose a convolutional encoder-decoder like framework with (i) feature concatenation across matched layers in encoder and decoder and (ii) index passing based unpooling at the decoder for semantic segmentation of ultrasound volumes. We have experimentally evaluated the performance on publicly available datasets consisting of
$10$ intravascular ultrasound pullback acquired at$20$ MHz and$16$ freehand thyroid ultrasound volumes acquired$11 - 16$ MHz. We have obtained a dice score of$0.93 \pm 0.08$ and$0.92 \pm 0.06$ respectively, following a$10$ -fold cross-validation experiment while processing frame of$256 \times 384$ pixel in $0.035$s and a volume of$256 \times 384 \times 384$ voxel in $13.44$s.
35.A Fourier Disparity Layer representation for Light Fields pdf
In this paper, we present a new Light Field representation for efficient Light Field processing and rendering called Fourier Disparity Layers (FDL). The proposed FDL representation samples the Light Field in the depth (or equivalently the disparity) dimension by decomposing the scene as a discrete sum of layers. The layers can be constructed from various types of Light Field inputs including a set of sub-aperture images, a focal stack, or even a combination of both. From our derivations in the Fourier domain, the layers are simply obtained by a regularized least square regression performed independently at each spatial frequency, which is efficiently parallelized in a GPU implementation. Our model is also used to derive a gradient descent based calibration step that estimates the input view positions and an optimal set of disparity values required for the layer construction. Once the layers are known, they can be simply shifted and filtered to produce different viewpoints of the scene while controlling the focus and simulating a camera aperture of arbitrary shape and size. Our implementation in the Fourier domain allows real time Light Field rendering. Finally, direct applications such as view interpolation or extrapolation and denoising are presented and evaluated.
36.Skeleton-based Action Recognition of People Handling Objects pdf
In visual surveillance systems, it is necessary to recognize the behavior of people handling objects such as a phone, a cup, or a plastic bag. In this paper, to address this problem, we propose a new framework for recognizing object-related human actions by graph convolutional networks using human and object poses. In this framework, we construct skeletal graphs of reliable human poses by selectively sampling the informative frames in a video, which include human joints with high confidence scores obtained in pose estimation. The skeletal graphs generated from the sampled frames represent human poses related to the object position in both the spatial and temporal domains, and these graphs are used as inputs to the graph convolutional networks. Through experiments over an open benchmark and our own data sets, we verify the validity of our framework in that our method outperforms the state-of-the-art method for skeleton-based action recognition.
37.Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos pdf
The task of video grounding, which temporally localizes a natural language description in a video, plays an important role in understanding videos. Existing studies have adopted strategies of sliding window over the entire video or exhaustively ranking all possible clip-sentence pairs in a pre-segmented video, which inevitably suffer from exhaustively enumerated candidates. To alleviate this problem, we formulate this task as a problem of sequential decision making by learning an agent which regulates the temporal grounding boundaries progressively based on its policy. Specifically, we propose a reinforcement learning based framework improved by multi-task learning and it shows steady performance gains by considering additional supervised boundary information during training. Our proposed framework achieves state-of-the-art performance on ActivityNet'18 DenseCaption dataset and Charades-STA dataset while observing only 10 or less clips per video.
38.Salient Object Detection with Lossless Feature Reflection and Weighted Structural Loss pdf
Salient object detection (SOD), which aims to identify and locate the most salient pixels or regions in images, has been attracting more and more interest due to its various real-world applications. However, this vision task is quite challenging, especially under complex image scenes. Inspired by the intrinsic reflection of natural images, in this paper we propose a novel feature learning framework for large-scale salient object detection. Specifically, we design a symmetrical fully convolutional network (SFCN) to effectively learn complementary saliency features under the guidance of lossless feature reflection. The location information, together with contextual and semantic information, of salient objects are jointly utilized to supervise the proposed network for more accurate saliency predictions. In addition, to overcome the blurry boundary problem, we propose a new weighted structural loss function to ensure clear object boundaries and spatially consistent saliency. The coarse prediction results are effectively refined by these structural information for performance improvements. Extensive experiments on seven saliency detection datasets demonstrate that our approach achieves consistently superior performance and outperforms the very recent state-of-the-art methods with a large margin.
39.Deep Level Sets: Implicit Surface Representations for 3D Shape Inference pdf
Existing 3D surface representation approaches are unable to accurately classify pixels and their orientation lying on the boundary of an object. Thus resulting in coarse representations which usually require post-processing steps to extract 3D surface meshes. To overcome this limitation, we propose an end-to-end trainable model that directly predicts implicit surface representations of arbitrary topology by optimising a novel geometric loss function. Specifically, we propose to represent the output as an oriented level set of a continuous embedding function, and incorporate this in a deep end-to-end learning framework by introducing a variational shape inference formulation. We investigate the benefits of our approach on the task of 3D surface prediction and demonstrate its ability to produce a more accurate reconstruction compared to voxel-based representations. We further show that our model is flexible and can be applied to a variety of shape inference problems.
40.Semantic Image Networks for Human Action Recognition pdf
In this paper, we propose the use of a semantic image, an improved representation for video analysis, principally in combination with Inception networks. The semantic image is obtained by applying localized sparse segmentation using global clustering (LSSGC) prior to the approximate rank pooling which summarizes the motion characteristics in single or multiple images. It incorporates the background information by overlaying a static background from the window onto the subsequent segmented frames. The idea is to improve the action-motion dynamics by focusing on the region which is important for action recognition and encoding the temporal variances using the frame ranking method. We also propose the sequential combination of Inception-ResNetv2 and long-short-term memory network (LSTM) to leverage the temporal variances for improved recognition performance. Extensive analysis has been carried out on UCF101 and HMDB51 datasets which are widely used in action recognition studies. We show that (i) the semantic image generates better activations and converges faster than its original variant, (ii) using segmentation prior to approximate rank pooling yields better recognition performance, (iii) The use of LSTM leverages the temporal variance information from approximate rank pooling to model the action behavior better than the base network, (iv) the proposed representations can be adaptive as they can be used with existing methods such as temporal segment networks to improve the recognition performance, and (v) our proposed four-stream network architecture comprising of semantic images and semantic optical flows achieves state-of-the-art performance, 95.9% and 73.5% recognition accuracy on UCF101 and HMDB51, respectively.
41.Dynamic Curriculum Learning for Imbalanced Data Classification pdf
Human attribute analysis is a challenging task in the field of computer vision, since the data is largely imbalance-distributed. Common techniques such as re-sampling and cost-sensitive learning require prior-knowledge to train the system. To address this problem, we propose a unified framework called Dynamic Curriculum Learning (DCL) to online adaptively adjust the sampling strategy and loss learning in single batch, which resulting in better generalization and discrimination. Inspired by the curriculum learning, DCL consists of two level curriculum schedulers: (1) sampling scheduler not only manages the data distribution from imbalanced to balanced but also from easy to hard; (2) loss scheduler controls the learning importance between classification and metric learning loss. Learning from these two schedulers, we demonstrate our DCL framework with the new state-of-the-art performance on the widely used face attribute dataset CelebA and pedestrian attribute dataset RAP.
42.Generating Text Sequence Images for Recognition pdf
Recently, methods based on deep learning have dominated the field of text recognition. With a large number of training data, most of them can achieve the state-of-the-art performances. However, it is hard to harvest and label sufficient text sequence images from the real scenes. To mitigate this issue, several methods to synthesize text sequence images were proposed, yet they usually need complicated preceding or follow-up steps. In this work, we present a method which is able to generate infinite training data without any auxiliary pre/post-process. We tackle the generation task as an image-to-image translation one and utilize conditional adversarial networks to produce realistic text sequence images in the light of the semantic ones. Some evaluation metrics are involved to assess our method and the results demonstrate that the caliber of the data is satisfactory. The code and dataset will be publicly available soon.
43.Hybrid coarse-fine classification for head pose estimation pdf
Head pose estimation, which computes the intrinsic Euler angles (yaw, pitch, roll) from a target human head, is crucial for gaze estimation, face alignment and 3D reconstruction. Traditional approaches to head pose estimation heavily relies on the accuracy of facial landmarks, and solve the correspondence problem between 2D facial landmarks and a mean 3D head model (ad-hoc fitting procedures), which seriously limited their performance, especially when the visibility of face is not in good condition. But existed landmark-free methods either treat head pose estimation as a sub-problem, or bring extra error during problem reduction. Therefore, in this paper, we present our efficient hybrid coarse-fine classification to deal with issues above. First of all, we extend previous work with stricter fine classification by increasing class number. Then, we introduce our hybrid coarse-fine classification scheme into the network. Integrate regression is adopted to get the final prediction. Our proposed approach to head pose estimation is evaluated on three challenging benchmarks, we achieve the state-of-the-art on AFLW2000 and BIWI, and our approach closes the gap with state-of-the-art on AFLW.
44.LayoutGAN: Generating Graphic Layouts with Wireframe Discriminators pdf
Layout is important for graphic design and scene generation. We propose a novel Generative Adversarial Network, called LayoutGAN, that synthesizes layouts by modeling geometric relations of different types of 2D elements. The generator of LayoutGAN takes as input a set of randomly-placed 2D graphic elements and uses self-attention modules to refine their labels and geometric parameters jointly to produce a realistic layout. Accurate alignment is critical for good layouts. We thus propose a novel differentiable wireframe rendering layer that maps the generated layout to a wireframe image, upon which a CNN-based discriminator is used to optimize the layouts in image space. We validate the effectiveness of LayoutGAN in various experiments including MNIST digit generation, document layout generation, clipart abstract scene generation and tangram graphic design.
45.Real-time 3D Face-Eye Performance Capture of a Person Wearing VR Headset pdf
Teleconference or telepresence based on virtual reality (VR) headmount display (HMD) device is a very interesting and promising application since HMD can provide immersive feelings for users. However, in order to facilitate face-to-face communications for HMD users, real-time 3D facial performance capture of a person wearing HMD is needed, which is a very challenging task due to the large occlusion caused by HMD. The existing limited solutions are very complex either in setting or in approach as well as lacking the performance capture of 3D eye gaze movement. In this paper, we propose a convolutional neural network (CNN) based solution for real-time 3D face-eye performance capture of HMD users without complex modification to devices. To address the issue of lacking training data, we generate massive pairs of HMD face-label dataset by data synthesis as well as collecting VR-IR eye dataset from multiple subjects. Then, we train a dense-fitting network for facial region and an eye gaze network to regress 3D eye model parameters. Extensive experimental results demonstrate that our system can efficiently and effectively produce in real time a vivid personalized 3D avatar with the correct identity, pose, expression and eye motion corresponding to the HMD user.
46.Pattern Generation Strategies for Improving Recognition of Handwritten Mathematical Expressions pdf
Recognition of Handwritten Mathematical Expressions (HMEs) is a challenging problem because of the ambiguity and complexity of two-dimensional handwriting. Moreover, the lack of large training data is a serious issue, especially for academic recognition systems. In this paper, we propose pattern generation strategies that generate shape and structural variations to improve the performance of recognition systems based on a small training set. For data generation, we employ the public databases: CROHME 2014 and 2016 of online HMEs. The first strategy employs local and global distortions to generate shape variations. The second strategy decomposes an online HME into sub-online HMEs to get more structural variations. The hybrid strategy combines both these strategies to maximize shape and structural variations. The generated online HMEs are converted to images for offline HME recognition. We tested our strategies in an end-to-end recognition system constructed from a recent deep learning model: Convolutional Neural Network and attention-based encoder-decoder. The results of experiments on the CROHME 2014 and 2016 databases demonstrate the superiority and effectiveness of our strategies: our hybrid strategy achieved classification rates of 48.78% and 45.60%, respectively, on these databases. These results are competitive compared to others reported in recent literature. Our generated datasets are openly available for research community and constitute a useful resource for the HME recognition research in future.
47.Fitting 3D Shapes from Partial and Noisy Point Clouds with Evolutionary Computing pdf
Point clouds obtained from photogrammetry are noisy and incomplete models of reality. We propose an evolutionary optimization methodology that is able to approximate the underlying object geometry on such point clouds. This approach assumes a priori knowledge on the 3D structure modeled and enables the identification of a collection of primitive shapes approximating the scene. Built-in mechanisms that enforce high shape diversity and adaptive population size make this method suitable to modeling both simple and complex scenes. We focus here on the case of cylinder approximations and we describe, test, and compare a set of mutation operators designed for optimal exploration of their search space. We assess the robustness and limitations of this algorithm through a series of synthetic examples, and we finally demonstrate its general applicability on two real-life cases in vegetation and industrial settings.
48.Visual Entailment: A Novel Task for Fine-Grained Image Understanding pdf
Existing visual reasoning datasets such as Visual Question Answering (VQA), often suffer from biases conditioned on the question, image or answer distributions. The recently proposed CLEVR dataset addresses these limitations and requires fine-grained reasoning but the dataset is synthetic and consists of similar objects and sentence structures across the dataset.
In this paper, we introduce a new inference task, Visual Entailment (VE) - consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal of a trained VE model is to predict whether the image semantically entails the text. To realize this task, we build a dataset SNLI-VE based on the Stanford Natural Language Inference corpus and Flickr30k dataset. We evaluate various existing VQA baselines and build a model called Explainable Visual Entailment (EVE) system to address the VE task. EVE achieves up to 71% accuracy and outperforms several other state-of-the-art VQA based models. Finally, we demonstrate the explainability of EVE through cross-modal attention visualizations. The SNLI-VE dataset is publicly available at this https URL necla-ml/SNLI-VE.
49.Localizing dexterous surgical tools in X-ray for image-based navigation pdf
X-ray image based surgical tool navigation is fast and supplies accurate images of deep seated structures. Typically, recovering the 6 DOF rigid pose and deformation of tools with respect to the X-ray camera can be accurately achieved through intensity-based 2D/3D registration of 3D images or models to 2D X-rays. However, the capture range of image-based 2D/3D registration is inconveniently small suggesting that automatic and robust initialization strategies are of critical importance. This manuscript describes a first step towards leveraging semantic information of the imaged object to initialize 2D/3D registration within the capture range of image-based registration by performing concurrent segmentation and localization of dexterous surgical tools in X-ray images.
We presented a learning-based strategy to simultaneously localize and segment dexterous surgical tools in X-ray images and demonstrate promising performance on synthetic and ex vivo data. We currently investigate methods to use semantic information extracted by the proposed network to reliably and robustly initialize image-based 2D/3D registration.
While image-based 2D/3D registration has been an obvious focus of the CAI community, robust initialization thereof (albeit critical) has largely been neglected. This manuscript discusses learning-based retrieval of semantic information on imaged-objects as a stepping stone for such initialization and may therefore be of interest to the IPCAI community. Since results are still preliminary and only focus on localization, we target the Long Abstract category.
50.Improved Selective Refinement Network for Face Detection pdf
As a long-standing problem in computer vision, face detection has attracted much attention in recent decades for its practical applications. With the availability of face detection benchmark WIDER FACE dataset, much of the progresses have been made by various algorithms in recent years. Among them, the Selective Refinement Network (SRN) face detector introduces the two-step classification and regression operations selectively into an anchor-based face detector to reduce false positives and improve location accuracy simultaneously. Moreover, it designs a receptive field enhancement block to provide more diverse receptive field. In this report, to further improve the performance of SRN, we exploit some existing techniques via extensive experiments, including new data augmentation strategy, improved backbone network, MS COCO pretraining, decoupled classification module, segmentation branch and Squeeze-and-Excitation block. Some of these techniques bring performance improvements, while few of them do not well adapt to our baseline. As a consequence, we present an improved SRN face detector by combining these useful techniques together and obtain the best performance on widely used face detection benchmark WIDER FACE dataset.
51.Binary Image Selection (BISON): Interpretable Evaluation of Visual Grounding pdf
Providing systems the ability to relate linguistic and visual content is one of the hallmarks of computer vision. Tasks such as image captioning and retrieval were designed to test this ability, but come with complex evaluation measures that gauge various other abilities and biases simultaneously. This paper presents an alternative evaluation task for visual-grounding systems: given a caption the system is asked to select the image that best matches the caption from a pair of semantically similar images. The system's accuracy on this Binary Image SelectiON (BISON) task is not only interpretable, but also measures the ability to relate fine-grained text content in the caption to visual content in the images. We gathered a BISON dataset that complements the COCO Captions dataset and used this dataset in auxiliary evaluations of captioning and caption-based retrieval systems. While captioning measures suggest visual grounding systems outperform humans, BISON shows that these systems are still far away from human performance.
52.Face Detection and Face Recognition In the Wild Using Off-the-Shelf Freely Available Components pdf
This paper presents an easy and efficient face detection and face recognition approach using free software components from the internet. Face detection and face recognition problems have wide applications in home and office security. Therefore this work will helpful for those searching for a free face off-the-shelf face detection system. Using this system, faces can be detected in uncontrolled environments. In the detection phase, every individual face is detected and in the recognition phase the detected faces are compared with the faces in a given data set and recognized.
53.Design of Real-time Semantic Segmentation Decoder for Automated Driving pdf
Semantic segmentation remains a computationally intensive algorithm for embedded deployment even with the rapid growth of computation power. Thus efficient network design is a critical aspect especially for applications like automated driving which requires real-time performance. Recently, there has been a lot of research on designing efficient encoders that are mostly task agnostic. Unlike image classification and bounding box object detection tasks, decoders are computationally expensive as well for semantic segmentation task. In this work, we focus on efficient design of the segmentation decoder and assume that an efficient encoder is already designed to provide shared features for a multi-task learning system. We design a novel efficient non-bottleneck layer and a family of decoders which fit into a small run-time budget using VGG10 as efficient encoder. We demonstrate in our dataset that experimentation with various design choices led to an improvement of 10% from a baseline performance.
54.Consistent Optimization for Single-Shot Object Detection pdf
We present consistent optimization for single stage object detection. Previous works of single stage object detectors usually rely on the regular, dense sampled anchors to generate hypothesis for the optimization of the model. Through an examination of the behavior of the detector, we observe that the misalignment between the optimization target and inference configurations has hindered the performance improvement. We propose to bride this gap by consistent optimization, which is an extension of the traditional single stage detector's optimization strategy. Consistent optimization focuses on matching the training hypotheses and the inference quality by utilizing of the refined anchors during training. To evaluate its effectiveness, we conduct various design choices based on the state-of-the-art RetinaNet detector. We demonstrate it is the consistent optimization, not the architecture design, that yields the performance boosts. Consistent optimization is nearly cost-free, and achieves stable performance gains independent of the model capacities or input scales. Specifically, utilizing consistent optimization improves RetinaNet from 39.1 AP to 40.1 AP on COCO dataset without any bells or whistles, which surpasses the accuracy of all existing state-of-the-art one-stage detectors when adopting ResNet-101 as backbone. The code will be made available.
55.Comparative Performance Analysis of Image De-noising Techniques pdf
Noise is an important factor which when get added to an image reduces its quality and appearance. So in order to enhance the image qualities, it has to be removed with preserving the textural information and structural features of image. There are different types of noises exist who corrupt the images. Selection of the denoising algorithm is application dependent. Hence, it is necessary to have knowledge about the noise present in the image so as to select the appropriate denoising algorithm. Objective of this paper is to present brief account on types of noises, its types and different noise removal algorithms. In the first section types of noises on the basis of their additive and multiplicative nature are being discussed. In second section a precise classification and analysis of the different potential image denoising algorithm is presented. At the end of paper, a comparative study of all these algorithms in context of performance evaluation is done and concluded with several promising directions for future research work.
56.Image De-Noising For Salt and Pepper Noise by Introducing New Enhanced Filter pdf
When an image is formed, factors such as lighting (spectra, source, and intensity) and camera characteristics (sensor response, lenses) affect the appearance of the image. Therefore, the prime factor that reduces the quality of the image is noise. It hides the important details and information of images. In order to enhance the qualities of the image, the removal of noises become imperative and that should not at the cost of any loss of image information. Noise removal is one of the pre-processing stages of image processing. In this paper a new method for the enhancement of grayscale images is introduced, when images are corrupted by fixed valued impulse noise (salt and pepper noise). The proposed methodology ensures a better output for the low and medium density of fixed value impulse noise as compared to the other famous filters like Standard Median Filter (SMF), Decision Based Median Filter (DBMF) and Modified Decision Based Median Filter (MDBMF) etc. The main objective of the proposed method was to improve peak signal to noise ratio (PSNR), visual perception and reduction in the blurring of the image. The proposed algorithm replaced the noisy pixel by trimmed mean value. When previous pixel values, 0s, and 255s are present in the particular window and all the pixel values are 0s and 255s then the remaining noisy pixels are replaced by mean value. The gray-scale image of mandrill and Lena were tested via the proposed method. The experimental result shows better peak signal to noise ratio (PSNR), mean square error values with better visual and human perception.
57.The RobotriX: An eXtremely Photorealistic and Very-Large-Scale Indoor Dataset of Sequences with Robot Trajectories and Interactions pdf
Enter the RobotriX, an extremely photorealistic indoor dataset designed to enable the application of deep learning techniques to a wide variety of robotic vision problems. The RobotriX consists of hyperrealistic indoor scenes which are explored by robot agents which also interact with objects in a visually realistic manner in that simulated world. Photorealistic scenes and robots are rendered by Unreal Engine into a virtual reality headset which captures gaze so that a human operator can move the robot and use controllers for the robotic hands; scene information is dumped on a per-frame basis so that it can be reproduced offline to generate raw data and ground truth labels. By taking this approach, we were able to generate a dataset of 38 semantic classes totaling 8M stills recorded at +60 frames per second with full HD resolution. For each frame, RGB-D and 3D information is provided with full annotations in both spaces. Thanks to the high quality and quantity of both raw information and annotations, the RobotriX will serve as a new milestone for investigating 2D and 3D robotic vision tasks with large-scale data-driven techniques.
58.Writer Independent Offline Signature Recognition Using Ensemble Learning pdf
The area of Handwritten Signature Verification has been broadly researched in the last decades, but remains an open research problem. In offline (static) signature verification, the dynamic information of the signature writing process is lost, and it is difficult to design good feature extractors that can distinguish genuine signatures and skilled forgeries. This verification task is even harder in writer independent scenarios which is undeniably fiscal for realistic cases. In this paper, we have proposed an Ensemble model for offline writer, independent signature verification task with Deep learning. We have used two CNNs for feature extraction, after that RGBT for classification & Stacking to generate final prediction vector. We have done extensive experiments on various datasets from various sources to maintain a variance in the dataset. We have achieved the state of the art performance on various datasets.
59.Endoscopic vs. volumetric OCT imaging of mastoid bone structure for pose estimation in minimally invasive cochlear implant surgery pdf
Purpose: The facial recess is a delicate structure that must be protected in minimally invasive cochlear implant surgery. Current research estimates the drill trajectory by using endoscopy of the unique mastoid patterns. However, missing depth information limits available features for a registration to preoperative CT data. Therefore, this paper evaluates OCT for enhanced imaging of drill holes in mastoid bone and compares OCT data to original endoscopic images.
Methods: A catheter-based OCT probe is inserted into a drill trajectory of a mastoid phantom in a translation-rotation manner to acquire the inner surface state. The images are undistorted and stitched to create volumentric data of the drill hole. The mastoid cell pattern is segmented automatically and compared to ground truth.
Results: The mastoid pattern segmented on images acquired with OCT show a similarity of J = 73.6 % to ground truth based on endoscopic images and measured with the Jaccard metric. Leveraged by additional depth information, automated segmentation tends to be more robust and fail-safe compared to endoscopic images.
Conclusion: The feasibility of using a clinically approved OCT probe for imaging the drill hole in cochlear implantation is shown. The resulting volumentric images provide additional information on the shape of caveties in the bone structure, which will be useful for image-to-patient registration and to estimate the drill trajectory. This will be another step towards safe minimally invasive cochlear implantation.
60.Single MR Image Super-Resolution via Channel Splitting and Serial Fusion Network pdf
Spatial resolution is a critical imaging parameter in magnetic resonance imaging (MRI). Acquiring high resolution MRI data usually takes long scanning time and would subject to motion artifacts due to hardware, physical, and physiological limitations. Single image super-resolution (SISR), especially that based on deep learning techniques, is an effective and promising alternative technique to improve the current spatial resolution of magnetic resonance (MR) images. However, the deeper network is more difficult to be effectively trained because the information is gradually weakened as the network deepens. This problem becomes more serious for medical images due to the degradation of training examples. In this paper, we present a novel channel splitting and serial fusion network (CSSFN) for single MR image super-resolution. Specifically, the proposed CSSFN network splits the hierarchical features into a series of subfeatures, which are then integrated together in a serial manner. Thus, the network becomes deeper and can deal with the subfeatures on different channels discriminatively. Besides, a dense global feature fusion (DGFF) is adopted to integrate the intermediate features, which further promotes the information flow in the network. Extensive experiments on several typical MR images show the superiority of our CSSFN model over other advanced SISR methods.
61.Deep Representation Learning Characterized by Inter-class Separation for Image Clustering pdf
Despite significant advances in clustering methods in recent years, the outcome of clustering of a natural image dataset is still unsatisfactory due to two important drawbacks. Firstly, clustering of images needs a good feature representation of an image and secondly, we need a robust method which can discriminate these features for making them belonging to different clusters such that intra-class variance is less and inter-class variance is high. Often these two aspects are dealt with independently and thus the features are not sufficient enough to partition the data meaningfully. In this paper, we propose a method where we discover these features required for the separation of the images using deep autoencoder. Our method learns the image representation features automatically for the purpose of clustering and also select a coherent image and an incoherent image simultaneously for a given image so that the feature representation learning can learn better discriminative features for grouping the similar images in a cluster and at the same time separating the dissimilar images across clusters. Experiment results show that our method produces significantly better result than the state-of-the-art methods and we also show that our method is more generalized across different dataset without using any pre-trained model like other existing methods.
62.Learning single-image 3D reconstruction by generative modelling of shape, pose and shading pdf
We present a unified framework tackling two problems: class-specific 3D reconstruction from a single image, and generation of new 3D shape samples. These tasks have received considerable attention recently; however, most existing approaches rely on 3D supervision, annotation of 2D images with keypoints or poses, and/or training with multiple views of each object instance. Our framework is very general: it can be trained in similar settings to existing approaches, while also supporting weaker supervision. Importantly, it can be trained purely from 2D images, without pose annotations, and with only a single view per instance. We employ meshes as an output representation, instead of voxels used in most prior work. This allows us to reason over lighting parameters and exploit shading information during training, which previous 2D-supervised methods cannot. Thus, our method can learn to generate and reconstruct concave object classes. We evaluate our approach in various settings, showing that: (i) it learns to disentangle shape from pose and lighting; (ii) using shading in the loss improves performance compared to just silhouettes; (iii) when using a standard single white light, our model outperforms state-of-the-art 2D-supervised methods, both with and without pose supervision, thanks to exploiting shading cues; (iv) performance improves further when using multiple coloured lights, even approaching that of state-of-the-art 3D-supervised methods; (v) shapes produced by our model capture smooth surfaces and fine details better than voxel-based approaches; and (vi) our approach supports concave classes such as bathtubs and sofas, which methods based on silhouettes cannot learn.
63.Learning a Deep Convolution Network with Turing Test Adversaries for Microscopy Image Super Resolution pdf
Adversarially trained deep neural networks have significantly improved performance of single image super resolution, by hallucinating photorealistic local textures, thereby greatly reducing the perception difference between a real high resolution image and its super resolved (SR) counterpart. However, application to medical imaging requires preservation of diagnostically relevant features while refraining from introducing any diagnostically confusing artifacts. We propose using a deep convolutional super resolution network (SRNet) trained for (i) minimising reconstruction loss between the real and SR images, and (ii) maximally confusing learned relativistic visual Turing test (rVTT) networks to discriminate between (a) pair of real and SR images (T1) and (b) pair of patches in real and SR selected from region of interest (T2). The adversarial loss of T1 and T2 while backpropagated through SRNet helps it learn to reconstruct pathorealism in the regions of interest such as white blood cells (WBC) in peripheral blood smears or epithelial cells in histopathology of cancerous biopsy tissues, which are experimentally demonstrated here. Experiments performed for measuring signal distortion loss using peak signal to noise ratio (pSNR) and structural similarity (SSIM) with variation of SR scale factors, impact of rVTT adversarial losses, and impact on reporting using SR on a commercially available artificial intelligence (AI) digital pathology system substantiate our claims.
64.Multisource Region Attention Network for Fine-Grained Object Recognition in Remote Sensing Imagery pdf
Fine-grained object recognition concerns the identification of the type of an object among a large number of closely related sub-categories. Multisource data analysis, that aims to leverage the complementary spectral, spatial, and structural information embedded in different sources, is a promising direction towards solving the fine-grained recognition problem that involves low between-class variance, small training set sizes for rare classes, and class imbalance. However, the common assumption of co-registered sources may not hold at the pixel level for small objects of interest. We present a novel methodology that aims to simultaneously learn the alignment of multisource data and the classification model in a unified framework. The proposed method involves a multisource region attention network that computes per-source feature representations, assigns attention scores to candidate regions sampled around the expected object locations by using these representations, and classifies the objects by using an attention-driven multisource representation that combines the feature representations and the attention scores from all sources. All components of the model are realized using deep neural networks and are learned in an end-to-end fashion. Experiments using RGB, multispectral, and LiDAR elevation data for classification of street trees showed that our approach achieved 64.2% and 47.3% accuracies for the 18-class and 40-class settings, respectively, which correspond to 13% and 14.3% improvement relative to the commonly used feature concatenation approach from multiple sources.
65.PadChest: A large chest x-ray image dataset with multi-label annotated reports pdf
We present a labeled large-scale, high resolution chest x-ray dataset for the automated exploration of medical images along with their associated reports. This dataset includes more than 160,000 images obtained from 67,000 patients that were interpreted and reported by radiologists at Hospital San Juan Hospital (Spain) from 2009 to 2017, covering six different position views and additional information on image acquisition and patient demography. The reports were labeled with 174 different radiographic findings, 19 differential diagnoses and 104 anatomic locations organized as a hierarchical taxonomy and mapped onto standard Unified Medical Language System (UMLS) terminology. Of these reports, 27% were manually annotated by trained physicians and the remaining set was labeled using a supervised method based on a recurrent neural network with attention mechanisms. The labels generated were then validated in an independent test set achieving a 0.93 Micro-F1 score. To the best of our knowledge, this is the largest public chest x-ray database suitable for training supervised models concerning radiographs, and the first to contain radiographic reports in Spanish. The PadChest dataset can be downloaded from this http URL.
66.Spatial Broadcast Decoder: A Simple Architecture for Learning Disentangled Representations in VAEs pdf
We present a simple neural rendering architecture that helps variational autoencoders (VAEs) learn disentangled representations. Instead of the deconvolutional network typically used in the decoder of VAEs, we tile (broadcast) the latent vector across space, concatenate fixed X- and Y-"coordinate" channels, and apply a fully convolutional network with 1x1 stride. This provides an architectural prior for dissociating positional from non-positional features in the latent distribution of VAEs, yet without providing any explicit supervision to this effect. We show that this architecture, which we term the Spatial Broadcast decoder, improves disentangling, reconstruction accuracy, and generalization to held-out regions in data space. It provides a particularly dramatic benefit when applied to datasets with small objects. We also emphasize a method for visualizing learned latent spaces that helped us diagnose our models and may prove useful for others aiming to assess data representations. Finally, we show the Spatial Broadcast Decoder is complementary to state-of-the-art (SOTA) disentangling techniques and when incorporated improves their performance.
67.Training Neural Networks with Local Error Signals pdf
Supervised training of neural networks for classification is typically performed with a global loss function. The loss function provides a gradient for the output layer, and this gradient is back-propagated to hidden layers to dictate an update direction for the weights. An alternative approach is to train the network with layer-wise loss functions. In this paper we demonstrate, for the first time, that layer-wise training can approach the state-of-the-art on a variety of image datasets. We use single-layer sub-networks and two different supervised loss functions to generate local error signals for the hidden layers, and we show that the combination of these losses help with optimization in the context of local learning. Using local errors could be a step towards more biologically plausible deep learning because the global error does not have to be transported back to hidden layers. A completely backprop free variant outperforms previously reported results among methods aiming for higher biological plausibility. Code is available this https URL
68.Modeling the Biological Pathology Continuum with HSIC-regularized Wasserstein Auto-encoders pdf
A crucial challenge in image-based modeling of biomedical data is to identify trends and features that separate normality and pathology. In many cases, the morphology of the imaged object exhibits continuous change as it deviates from normality, and thus a generative model can be trained to model this morphological continuum. Moreover, given side information that correlates to certain trend in morphological change, a latent variable model can be regularized such that its latent representation reflects this side information. In this work, we use the Wasserstein Auto-encoder to model this pathology continuum, and apply the Hilbert-Schmitt Independence Criterion (HSIC) to enforce dependency between certain latent features and the provided side information. We experimentally show that the model can provide disentangled and interpretable latent representations and also generate a continuum of morphological changes that corresponds to change in the side information.
69.Synthesizing facial photometries and corresponding geometries using generative adversarial networks pdf
Artificial data synthesis is currently a well studied topic with useful applications in data science, computer vision, graphics and many other fields. Generating realistic data is especially challenging since human perception is highly sensitive to non realistic appearance. In recent times, new levels of realism have been achieved by advances in GAN training procedures and architectures. These successful models, however, are tuned mostly for use with regularly sampled data such as images, audio and video. Despite the successful application of the architecture on these types of media, applying the same tools to geometric data poses a far greater challenge. The study of geometric deep learning is still a debated issue within the academic community as the lack of intrinsic parametrization inherent to geometric objects prohibits the direct use of convolutional filters, a main building block of today's machine learning systems. In this paper we propose a new method for generating realistic human facial geometries coupled with overlayed textures. We circumvent the parametrization issue by imposing a global mapping from our data to the unit rectangle. We further discuss how to design such a mapping to control the mapping distortion and conserve area within the mapped image. By representing geometric textures and geometries as images, we are able to use advanced GAN methodologies to generate new geometries. We address the often neglected topic of relation between texture and geometry and propose to use this correlation to match between generated textures and their corresponding geometries. We offer a new method for training GAN models on partially corrupted data. Finally, we provide empirical evidence demonstrating our generative model's ability to produce examples of new identities independent from the training data while maintaining a high level of realism, two traits that are often at odds.
70.Cross-referencing Social Media and Public Surveillance Camera Data for Disaster Response pdf
Physical media (like surveillance cameras) and social media (like Instagram and Twitter) may both be useful in attaining on-the-ground information during an emergency or disaster situation. However, the intersection and reliability of both surveillance cameras and social media during a natural disaster are not fully understood. To address this gap, we tested whether social media is of utility when physical surveillance cameras went off-line during Hurricane Irma in 2017. Specifically, we collected and compared geo-tagged Instagram and Twitter posts in the state of Florida during times and in areas where public surveillance cameras went off-line. We report social media content and frequency and content to determine the utility for emergency managers or first responders during a natural disaster.
71.Machine Learning with Clos Networks pdf
We present a new methodology for improving the accuracy of small neural networks by applying the concept of a clos network to achieve maximum expression in a smaller network. We explore the design space to show that more layers is beneficial, given the same number of parameters. We also present findings on how the relu nonlinearity ffects accuracy in separable networks. We present results on early work with Cifar-10 dataset.