ArXiv cs.CV --Mon, 14 Oct 2019

1.Shape Constrained Network for Eye Segmentation in the Wild ⬇️

Semantic segmentation of eyes has long been a vital pre-processing step in many biometric applications. Majority of the works focus only on high resolution eye images, while little has been done to segment the eyes from low quality images in the wild. However, this is a particularly interesting and meaningful topic, as eyes play a crucial role in conveying the emotional state and mental well-being of a person. In this work, we take two steps toward solving this problem: (1) We collect and annotate a challenging eye segmentation dataset containing 8882 eye patches from 4461 facial images of different resolutions, illumination conditions and head poses; (2) We develop a novel eye segmentation method, Shape Constrained Network (SCN), that incorporates shape prior into the segmentation network training procedure. Specifically, we learn the shape prior from our dataset using VAE-GAN, and leverage the pre-trained encoder and discriminator to regularise the training of SegNet. To improve the accuracy and quality of predicted masks, we replace the loss of SegNet with three new losses: Intersection-over-Union (IoU) loss, shape discriminator loss and shape embedding loss. Extensive experiments shows that our method outperforms state-of-the-art segmentation and landmark detection methods in terms of mean IoU (mIoU) accuracy and the quality of segmentation masks. The eye segmentation database is available at this https URL.

2.Augmented Hard Example Mining for Generalizable Person Re-Identification ⬇️

Although the performance of person re-identification (Re-ID) has been much improved by using sophisticated training methods and large-scale labelled datasets, many existing methods make the impractical assumption that information of a target domain can be utilized during training. In practice, a Re-ID system often starts running as soon as it is deployed, hence training with data from a target domain is unrealistic. To make Re-ID systems more practical, methods have been proposed that achieve high performance without information of a target domain. However, they need cumbersome tuning for training and unusual operations for testing. In this paper, we propose augmented hard example mining, which can be easily integrated to a common Re-ID training process and can utilize sophisticated models without any network modification. The method discovers hard examples on the basis of classification probabilities, and to make the examples harder, various types of augmentation are applied to the examples. Among those examples, excessively augmented ones are eliminated by a classification based selection process. Extensive analysis shows that our method successfully selects effective examples and achieves state-of-the-art performance on publicly available benchmark datasets.

3.Face Reflectance and Geometry Modeling via Differentiable Ray Tracing ⬇️

We present a novel strategy to automatically reconstruct 3D faces from monocular images with explicitly disentangled facial geometry (pose, identity and expression), reflectance (diffuse and specular albedo), and self-shadows. The scene lights are modeled as a virtual light stage with pre-oriented area lights used in conjunction with differentiable Monte-Carlo ray tracing to optimize the scene and face parameters. With correctly disentangled self-shadows and specular reflection parameters, we can not only obtain robust facial geometry reconstruction, but also gain explicit control over these parameters, with several practical applications. We can change facial expressions with accurate resultant self-shadows or relight the scene and obtain accurate specular reflection and several other parameter combinations.

4.Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval ⬇️

Image-text retrieval of natural scenes has been a popular research topic. Since image and text are heterogeneous cross-modal data, one of the key challenges is how to learn comprehensive yet unified representations to express the multi-modal data. A natural scene image mainly involves two kinds of visual concepts, objects and their relationships, which are equally essential to image-text retrieval. Therefore, a good representation should account for both of them. In the light of recent success of scene graph in many CV and NLP tasks for describing complex natural scenes, we propose to represent image and text with two kinds of scene graphs: visual scene graph (VSG) and textual scene graph (TSG), each of which is exploited to jointly characterize objects and relationships in the corresponding modality. The image-text retrieval task is then naturally formulated as cross-modal scene graph matching. Specifically, we design two particular scene graph encoders in our model for VSG and TSG, which can refine the representation of each node on the graph by aggregating neighborhood information. As a result, both object-level and relationship-level cross-modal features can be obtained, which favorably enables us to evaluate the similarity of image and text in the two levels in a more plausible way. We achieve state-of-the-art results on Flickr30k and MSCOCO, which verifies the advantages of our graph matching based approach for image-text retrieval.

5.Methods and open-source toolkit for analyzing and visualizing challenge results ⬇️

Biomedical challenges have become the de facto standard for benchmarking biomedical image analysis algorithms. While the number of challenges is steadily increasing, surprisingly little effort has been invested in ensuring high quality design, execution and reporting for these international competitions. Specifically, results analysis and visualization in the event of uncertainties have been given almost no attention in the literature. Given these shortcomings, the contribution of this paper is two-fold: (1) We present a set of methods to comprehensively analyze and visualize the results of single-task and multi-task challenges and apply them to a number of simulated and real-life challenges to demonstrate their specific strengths and weaknesses; (2) We release the open-source framework challengeR as part of this work to enable fast and wide adoption of the methodology proposed in this paper. Our approach offers an intuitive way to gain important insights into the relative and absolute performance of algorithms, which cannot be revealed by commonly applied visualization techniques. This is demonstrated by the experiments performed within this work. Our framework could thus become an important tool for analyzing and visualizing challenge results in the field of biomedical image analysis and beyond.

6.Rosetta: Large scale system for text detection and recognition in images ⬇️

In this paper we present a deployed, scalable optical character recognition (OCR) system, which we call Rosetta, designed to process images uploaded daily at Facebook scale. Sharing of image content has become one of the primary ways to communicate information among internet users within social networks such as Facebook and Instagram, and the understanding of such media, including its textual information, is of paramount importance to facilitate search and recommendation applications. We present modeling techniques for efficient detection and recognition of text in images and describe Rosetta's system architecture. We perform extensive evaluation of presented technologies, explain useful practical approaches to build an OCR system at scale, and provide insightful intuitions as to why and how certain components work based on the lessons learnt during the development and deployment of the system.

7.CHD:Consecutive Horizontal Dropout for Human Gait Feature Extraction ⬇️

Despite gait recognition and person re-identification researches have made a lot of progress, the accuracy of identification is not high enough in some specific situations, for example, people carrying bags or changing coats. In order to alleviate above situations, we propose a simple but effective Consecutive Horizontal Dropout (CHD) method apply on human feature extraction in deep learning network to avoid overfitting. Within the CHD, we intensify the robust of deep learning network for cross-view gait recognition and person re-identification. The experiments illustrate that the rank-1 accuracy on cross-view gait recognition task has been increased about 10% from 68.0% to 78.201% and 8% from 83.545% to 91.364% in person re-identification task in wearing coat or jacket condition. In addition, 100% accuracy of NM condition was first obtained with CHD. On the benchmarks of CASIC-B, above accuracies are state-of-the-arts.

8.Shooting Labels: 3D Semantic Labeling by Virtual Reality ⬇️

Availability of a few, large-size, annotated datasets, like ImageNet, Pascal VOC and COCO, has lead deep learning to revolutionize computer vision research by achieving astonishing results in several vision tasks. We argue that new tools to facilitate generation of annotated datasets may help spreading data-driven AI throughout applications and domains. In this work we propose Shooting Labels, the first 3D labeling tool for dense 3D semantic segmentation which exploits Virtual Reality to render the labeling task as easy and fun as playing a video-game. Our tool allows for semantically labeling large scale environments very expeditiously, whatever the nature of the 3D data at hand (e.g. pointclouds, mesh). Furthermore, Shooting Labels efficiently integrates multi-users annotations to improve the labeling accuracy automatically and compute a label uncertainty map. Besides, within our framework the 3D annotations can be projected into 2D images, thereby speeding up also a notoriously slow and expensive task such as pixel-wise semantic labeling. We demonstrate the accuracy and efficiency of our tool in two different scenarios: an indoor workspace provided by Matterport3D and a large-scale outdoor environment reconstructed from 1000+ KITTI images.

9.End-to-End Defect Detection in Automated Fiber Placement Based on Artificially Generated Data ⬇️

Automated fiber placement (AFP) is an advanced manufacturing technology that increases the rate of production of composite materials. At the same time, the need for adaptable and fast inline control methods of such parts raises. Existing inspection systems make use of handcrafted filter chains and feature detectors, tuned for a specific measurement methods by domain experts. These methods hardly scale to new defects or different measurement devices. In this paper, we propose to formulate AFP defect detection as an image segmentation problem that can be solved in an end-to-end fashion using artificially generated training data. We employ a probabilistic graphical model to generate training images and annotations. We then train a deep neural network based on recent architectures designed for image segmentation. This leads to an appealing method that scales well with new defect types and measurement devices and requires little real world data for training.

10.Road Damage Detection Based on Unsupervised Disparity Map Segmentation ⬇️

This paper presents a novel road damage detection algorithm based on unsupervised disparity map segmentation. Firstly, a disparity map is transformed by minimizing an energy function with respect to stereo rig roll angle and road disparity projection model. Instead of solving this energy minimization problem using non-linear optimization techniques, we directly find its numerical solution. The transformed disparity map is then segmented using Otus's thresholding method, and the damaged road areas can be extracted. The proposed algorithm requires no parameters when detecting road damage. The experimental results illustrate that our proposed algorithm performs both accurately and efficiently. The pixel-level road damage detection accuracy is approximately 97.56%.

11.Artistic Glyph Image Synthesis via One-Stage Few-Shot Learning ⬇️

Automatic generation of artistic glyph images is a challenging task that attracts many research interests. Previous methods either are specifically designed for shape synthesis or focus on texture transfer. In this paper, we propose a novel model, AGIS-Net, to transfer both shape and texture styles in one-stage with only a few stylized samples. To achieve this goal, we first disentangle the representations for content and style by using two encoders, ensuring the multi-content and multi-style generation. Then we utilize two collaboratively working decoders to generate the glyph shape image and its texture image simultaneously. In addition, we introduce a local texture refinement loss to further improve the quality of the synthesized textures. In this manner, our one-stage model is much more efficient and effective than other multi-stage stacked methods. We also propose a large-scale dataset with Chinese glyph images in various shape and texture styles, rendered from 35 professional-designed artistic fonts with 7,326 characters and 2,460 synthetic artistic fonts with 639 characters, to validate the effectiveness and extendability of our method. Extensive experiments on both English and Chinese artistic glyph image datasets demonstrate the superiority of our model in generating high-quality stylized glyph images against other state-of-the-art methods.

12.VarGFaceNet: An Efficient Variable Group Convolutional Neural Network for Lightweight Face Recognition ⬇️

To improve the discriminative and generalization ability of lightweight network for face recognition, we propose an efficient variable group convolutional network called VarGFaceNet. Variable group convolution is introduced by VarGNet to solve the conflict between small computational cost and the unbalance of computational intensity inside a block. We employ variable group convolution to design our network which can support large scale face identification while reduce computational cost and parameters. Specifically, we use a head setting to reserve essential information at the start of the network and propose a particular embedding setting to reduce parameters of fully-connected layer for embedding. To enhance interpretation ability, we employ an equivalence of angular distillation loss to guide our lightweight network and we apply recursive knowledge distillation to relieve the discrepancy between the teacher model and the student model. The champion of deepglint-light track of LFR (2019) challenge demonstrates the effectiveness of our model and approach. Implementation of VarGFaceNet will be released at this https URL soon.

13.Estimating Solar Irradiance Using Sky Imagers ⬇️

Ground-based whole sky cameras are extensively used for localized monitoring of clouds nowadays. They capture hemispherical images of the sky at regular intervals using a fisheye lens. In this paper, we propose a framework for estimating solar irradiance from pictures taken by those imagers. Unlike pyranometers, such sky images contain information about cloud coverage and can be used to derive cloud movement. An accurate estimation of solar irradiance using solely those images is thus a first step towards short-term forecasting of solar energy generation based on cloud movement. We derive and validate our model using pyranometers co-located with our whole sky imagers. We achieve a better performance in estimating solar irradiance and in particular its short-term variations as compared to other related methods using ground-based observations.

14.Interaction Relational Network for Mutual Action Recognition ⬇️

Person-person mutual action recognition (also referred to as interaction recognition) is an important research branch of human activity analysis. Current solutions in the field are mainly dominated by CNNs, GCNs and LSTMs. These approaches often consist of complicated architectures and mechanisms to embed the relationships between the two persons on the architecture itself, to ensure the interaction patterns can be properly learned. In this paper, we propose a more simple yet very powerful architecture, named Interaction Relational Network (IRN), which utilizes minimal prior knowledge about the structure of the human body. We drive the network to identify by itself how to relate the body parts from the individuals interacting. In order to better represent the interaction, we define two different relationships, leading to specialized architectures and models for each. These multiple relationship models will then be fused into a single and special architecture, in order to leverage both streams of information for further enhancing the relational reasoning capability. Furthermore we define important structured pair-wise operations to extract meaningful extra information from each pair of joints -- distance and motion. Ultimately, with the coupling of an LSTM, our IRN is capable of paramount sequential relational reasoning. These important extensions we made to our network can also be valuable to other problems that require sophisticated relational reasoning. Our solution is able to achieve state-of-the-art performance on the traditional interaction recognition datasets SBU and UT, and also on the mutual actions from the large-scale NTU RGB+D and NTU RGB+D 120 datasets.

15.An Automatic Digital Terrain Generation Technique for Terrestrial Sensing and Virtual Reality Applications ⬇️

The identification and modeling of the terrain from point cloud data is an important component of Terrestrial Remote Sensing (TRS) applications. The main focus in terrain modeling is capturing details of complex geological features of landforms. Traditional terrain modeling approaches rely on the user to exert control over terrain features. However, relying on the user input to manually develop the digital terrain becomes intractable when considering the amount of data generated by new remote sensing systems capable of producing massive aerial and ground-based point clouds from scanned environments. This article provides a novel terrain modeling technique capable of automatically generating accurate and physically realistic Digital Terrain Models (DTM) from a variety of point cloud data. The proposed method runs efficiently on large-scale point cloud data with real-time performance over large segments of terrestrial landforms. Moreover, generated digital models are designed to effectively render within a Virtual Reality (VR) environment in real time. The paper concludes with an in-depth discussion of possible research directions and outstanding technical and scientific challenges to improve the proposed approach.

16.FetusMap: Fetal Pose Estimation in 3D Ultrasound ⬇️

The 3D ultrasound (US) entrance inspires a multitude of automated prenatal examinations. However, studies about the structuralized description of the whole fetus in 3D US are still rare. In this paper, we propose to estimate the 3D pose of fetus in US volumes to facilitate its quantitative analyses in global and local scales. Given the great challenges in 3D US, including the high volume dimension, poor image quality, symmetric ambiguity in anatomical structures and large variations of fetal pose, our contribution is three-fold. (i) This is the first work about 3D pose estimation of fetus in the literature. We aim to extract the skeleton of whole fetus and assign different segments/joints with correct torso/limb labels. (ii) We propose a self-supervised learning (SSL) framework to finetune the deep network to form visually plausible pose predictions. Specifically, we leverage the landmark-based registration to effectively encode case-adaptive anatomical priors and generate evolving label proxy for supervision. (iii) To enable our 3D network perceive better contextual cues with higher resolution input under limited computing resource, we further adopt the gradient check-pointing (GCP) strategy to save GPU memory and improve the prediction. Extensively validated on a large 3D US dataset, our method tackles varying fetal poses and achieves promising results. 3D pose estimation of fetus has potentials in serving as a map to provide navigation for many advanced studies.

17.DiabDeep: Pervasive Diabetes Diagnosis based on Wearable Medical Sensors and Efficient Neural Networks ⬇️

Diabetes impacts the quality of life of millions of people. However, diabetes diagnosis is still an arduous process, given that the disease develops and gets treated outside the clinic. The emergence of wearable medical sensors (WMSs) and machine learning points to a way forward to address this challenge. WMSs enable a continuous mechanism to collect and analyze physiological signals. However, disease diagnosis based on WMS data and its effective deployment on resource-constrained edge devices remain challenging due to inefficient feature extraction and vast computation cost. In this work, we propose a framework called DiabDeep that combines efficient neural networks (called DiabNNs) with WMSs for pervasive diabetes diagnosis. DiabDeep bypasses the feature extraction stage and acts directly on WMS data. It enables both an (i) accurate inference on the server, e.g., a desktop, and (ii) efficient inference on an edge device, e.g., a smartphone, based on varying design goals and resource budgets. On the server, we stack sparsely connected layers to deliver high accuracy. On the edge, we use a hidden-layer long short-term memory based recurrent layer to cut down on computation and storage. At the core of DiabDeep lies a grow-and-prune training flow: it leverages gradient-based growth and magnitude-based pruning algorithms to learn both weights and connections for DiabNNs. We demonstrate the effectiveness of DiabDeep through analyzing data from 52 participants. For server (edge) side inference, we achieve a 96.3% (95.3%) accuracy in classifying diabetics against healthy individuals, and a 95.7% (94.6%) accuracy in distinguishing among type-1/type-2 diabetic, and healthy individuals. Against conventional baselines, DiabNNs achieve higher accuracy, while reducing the model size (FLOPs) by up to 454.5x (8.9x). Therefore, the system can be viewed as pervasive and efficient, yet very accurate.

18.From Species to Cultivar: Soybean Cultivar Recognition using Multiscale Sliding Chord Matching of Leaf Images ⬇️

Leaf image recognition techniques have been actively researched for plant species identification. However it remains unclear whether leaf patterns can provide sufficient information for cultivar recognition. This paper reports the first attempt on soybean cultivar recognition from plant leaves which is not only a challenging research problem but also important for soybean cultivar evaluation, selection and production in agriculture. In this paper, we propose a novel multiscale sliding chord matching (MSCM) approach to extract leaf patterns that are distinctive for soybean cultivar identification. A chord is defined to slide along the contour for measuring the synchronised patterns of exterior shape and interior appearance of soybean leaf images. A multiscale sliding chord strategy is developed to extract features in a coarse-to-fine hierarchical order. A joint description that integrates the leaf descriptors from different parts of a soybean plant is proposed for further enhancing the discriminative power of cultivar description. We built a cultivar leaf image database, SoyCultivar, consisting of 1200 sample leaf images from 200 soybean cultivars for performance evaluation. Encouraging experimental results of the proposed method in comparison to the state-of-the-art leaf species recognition methods demonstrate the availability of cultivar information in soybean leaves and effectiveness of the proposed MSCM for soybean cultivar identification, which may advance the research in leaf recognition from species to cultivar.

19.Visual Natural Language Query Auto-Completion for Estimating Instance Probabilities ⬇️

We present a new task of query auto-completion for estimating instance probabilities. We complete a user query prefix conditioned upon an image. Given the complete query, we fine tune a BERT embedding for estimating probabilities of a broad set of instances. The resulting instance probabilities are used for selection while being agnostic to the segmentation or attention mechanism. Our results demonstrate that auto-completion using both language and vision performs better than using only language, and that fine tuning a BERT embedding allows to efficiently rank instances in the image. In the spirit of reproducible research we make our data, models, and code available.

20.Predicting Auction Price of Vehicle License Plate with Deep Residual Learning ⬇️

Due to superstition, license plates with desirable combinations of characters are highly sought after in China, fetching prices that can reach into the millions in government-held auctions. Despite the high stakes involved, there has been essentially no attempt to provide price estimates for license plates. We present an end-to-end neural network model that simultaneously predict the auction price, gives the distribution of prices and produces latent feature vectors. While both types of neural network architectures we consider outperform simpler machine learning methods, convolutional networks outperform recurrent networks for comparable training time or model complexity. The resulting model powers our online price estimator and search engine.

21.Bit Efficient Quantization for Deep Neural Networks ⬇️

Quantization for deep neural networks have afforded models for edge devices that use less on-board memory and enable efficient low-power inference. In this paper, we present a comparison of model-parameter driven quantization approaches that can achieve as low as 3-bit precision without affecting accuracy. The post-training quantization approaches are data-free, and the resulting weight values are closely tied to the dataset distribution on which the model has converged to optimality. We show quantization results for a number of state-of-art deep neural networks (DNN) using large dataset like ImageNet. To better analyze quantization results, we describe the overall range and local sparsity of values afforded through various quantization schemes. We show the methods to lower bit-precision beyond quantization limits with object class clustering.

22.FastEstimator: A Deep Learning Library for Fast Prototyping and Productization ⬇️

As the complexity of state-of-the-art deep learning models increases by the month, implementation, interpretation, and traceability become ever-more-burdensome challenges for AI practitioners around the world. Several AI frameworks have risen in an effort to stem this tide, but the steady advance of the field has begun to test the bounds of their flexibility, expressiveness, and ease of use. To address these concerns, we introduce a radically flexible high-level open source deep learning framework for both research and industry. We introduce FastEstimator.

23.A Stereo Algorithm for Thin Obstacles and Reflective Objects ⬇️

Stereo cameras are a popular choice for obstacle avoidance for outdoor lighweight, low-cost robotics applications. However, they are unable to sense thin and reflective objects well. Currently, many algorithms are tuned to perform well on indoor scenes like the Middlebury dataset. When navigating outdoors, reflective objects, like windows and glass, and thin obstacles, like wires, are not well handled by most stereo disparity algorithms. Reflections, repeating patterns and objects parallel to the cameras' baseline causes mismatches between image pairs which leads to bad disparity estimates. Thin obstacles are difficult for many sliding window based disparity methods to detect because they do not take up large portions of the pixels in the sliding window. We use a trinocular camera setup and micropolarizer camera capable of detecting reflective objects to overcome these issues. We present a hierarchical disparity algorithm that reduces noise, separately identify wires using semantic object triangulation in three images, and use information about the polarization of light to estimate the disparity of reflective objects. We evaluate our approach on outdoor data that we collected. Our method contained an average of 9.27% of bad pixels compared to a typical stereo algorithm's 18.4% of bad pixels in scenes containing reflective objects. Our trinocular and semantic wire disparity methods detected 53% of wire pixels, whereas a typical two camera stereo algorithm detected 5%.

24.Global visual localization in LiDAR-maps through shared 2D-3D embedding space ⬇️

Global localization is an important and widely studied problem for many robotic applications. Place recognition approaches can be exploited to solve this task, e.g., in the autonomous driving field. While most vision-based approaches match an image w.r.t an image database, global visual localization within LiDAR-maps remains fairly unexplored, even though the path toward high definition 3D maps, produced mainly from LiDARs, is clear. In this work we leverage DNN approaches to create a shared embedding space between images and LiDAR-maps, allowing for image to 3D-LiDAR place recognition. We trained a 2D and a 3D Deep Neural Networks (DNNs) that create embeddings, respectively from images and from point clouds, that are close to each other whether they refer to the same place. An extensive experimental activity is presented to assess the effectiveness of the approach w.r.t. different learning methods, network architectures, and loss functions. All the evaluations have been performed using the Oxford Robotcar Dataset, which encompasses a wide range of weather and light conditions.

25.Road scenes analysis in adverse weather conditions by polarization-encoded images and adapted deep learning ⬇️

Object detection in road scenes is necessary to develop both autonomous vehicles and driving assistance systems. Even if deep neural networks for recognition task have shown great performances using conventional images, they fail to detect objects in road scenes in complex acquisition situations. In contrast, polarization images, characterizing the light wave, can robustly describe important physical properties of the object even under poor illumination or strong reflections. This paper shows how non-conventional polarimetric imaging modality overcomes the classical methods for object detection especially in adverse weather conditions. The efficiency of the proposed method is mostly due to the high power of the polarimetry to discriminate any object by its reflective properties and on the use of deep neural networks for object detection. Our goal by this work, is to prove that polarimetry brings a real added value compared with RGB images for object detection. Experimental results on our own dataset composed of road scene images taken during adverse weather conditions show that polarimetry together with deep learning can improve the state-of-the-art by about 20% to 50% on different detection tasks.

26.Inferring and Improving Street Maps with Data-Driven Automation ⬇️

Street maps are a crucial data source that help to inform a wide range of decisions, from navigating a city to disaster relief and urban planning. However, in many parts of the world, street maps are incomplete or lag behind new construction. Editing maps today involves a tedious process of manually tracing and annotating roads, buildings, and other map features.
Over the past decade, many automatic map inference systems have been proposed to automatically extract street map data from satellite imagery, aerial imagery, and GPS trajectory datasets. However, automatic map inference has failed to gain traction in practice due to two key limitations: high error rates (low precision), which manifest in noisy inference outputs, and a lack of end-to-end system design to leverage inferred data to update existing street maps.
At MIT and QCRI, we have developed a number of algorithms and approaches to address these challenges, which we combined into a new system we call Mapster. Mapster is a human-in-the-loop street map editing system that incorporates three components to robustly accelerate the mapping process over traditional tools and workflows: high-precision automatic map inference, data refinement, and machine-assisted map editing.
Through an evaluation on a large-scale dataset including satellite imagery, GPS trajectories, and ground-truth map data in forty cities, we show that Mapster makes automation practical for map editing, and enables the curation of map datasets that are more complete and up-to-date at less cost.

27.The Visual Task Adaptation Benchmark ⬇️

Representation learning promises to unlock deep learning for the long tail of vision tasks without expansive labelled datasets. Yet, the absence of a unified yardstick to evaluate general visual representations hinders progress. Many sub-fields promise representations, but each has different evaluation protocols that are either too constrained (linear classification), limited in scope (ImageNet, CIFAR, Pascal-VOC), or only loosely related to representation quality (generation). We present the Visual Task Adaptation Benchmark (VTAB): a diverse, realistic, and challenging benchmark to evaluate representations. VTAB embodies one principle: good representations adapt to unseen tasks with few examples. We run a large VTAB study of popular algorithms, answering questions like: How effective are ImageNet representation on non-standard datasets? Are generative models competitive? Is self-supervision useful if one already has labels?

28.NASS-AI: Towards Digitization of Parliamentary Bills using Document Level Embedding and Bidirectional Long Short-Term Memory ⬇️

There has been several reports in the Nigerian and International media about the Senators and House of Representative Members of the Nigerian National Assembly (NASS) being the highest paid in the world. Despite this high-level of parliamentary compensation and a lack of oversight, most of the legislative duties like bills introduced and vote proceedings are shrouded in mystery without an open and annotated corpus. In this paper, we present results from ongoing research on the categorization of bills introduced in the Nigerian parliament since the fourth republic (1999 - 2018). For this task, we employed a multi-step approach which involves extracting text from scanned and embedded pdfs with low to medium quality using Optical Character Recognition (OCR) tools and labeling them into eight categories. We investigate the performance of document level embedding for feature representation of the extracted texts before using a Bidirectional Long Short-Term Memory (Bi-LSTM) for our classifier. The performance was further compared with other feature representation and machine learning techniques. We believe that these results are well-positioned to have a substantial impact on the quest to meet the basic open data charter principles.

29.Brain-inspired automated visual object discovery and detection ⬇️

Despite significant recent progress, machine vision systems lag considerably behind their biological counterparts in performance, scalability, and robustness. A distinctive hallmark of the brain is its ability to automatically discover and model objects, at multiscale resolutions, from repeated exposures to unlabeled contextual data and then to be able to robustly detect the learned objects under various nonideal circumstances, such as partial occlusion and different view angles. Replication of such capabilities in a machine would require three key ingredients: (i) access to large-scale perceptual data of the kind that humans experience, (ii) flexible representations of objects, and (iii) an efficient unsupervised learning algorithm. The Internet fortunately provides unprecedented access to vast amounts of visual data. This paper leverages the availability of such data to develop a scalable framework for unsupervised learning of object prototypes--brain-inspired flexible, scale, and shift invariant representations of deformable objects (e.g., humans, motorcycles, cars, airplanes) comprised of parts, their different configurations and views, and their spatial relationships. Computationally, the object prototypes are represented as geometric associative networks using probabilistic constructs such as Markov random fields. We apply our framework to various datasets and show that our approach is computationally scalable and can construct accurate and operational part-aware object models much more efficiently than in much of the recent computer vision literature. We also present efficient algorithms for detection and localization in new scenes of objects and their partial views.

30.Vision-Based Autonomous Vehicle Control using the Two-Point Visual Driver Control Model ⬇️

This work proposes a new self-driving framework that uses a human driver control model, whose feature-input values are extracted from images using deep convolutional neural networks (CNNs). The development of image processing techniques using CNNs along with accelerated computing hardware has recently enabled real-time detection of these feature-input values. The use of human driver models can lead to more "natural" driving behavior of self-driving vehicles. Specifically, we use the well-known two-point visual driver control model as the controller, and we use a top-down lane cost map CNN and the YOLOv2 CNN to extract feature-input values. This framework relies exclusively on inputs from low-cost sensors like a monocular camera and wheel speed sensors. We experimentally validate the proposed framework on an outdoor track using a 1/5th-scale autonomous vehicle platform.

31.Place Deduplication with Embeddings ⬇️

Thanks to the advancing mobile location services, people nowadays can post about places to share visiting experience on-the-go. A large place graph not only helps users explore interesting destinations, but also provides opportunities for understanding and modeling the real world. To improve coverage and flexibility of the place graph, many platforms import places data from multiple sources, which unfortunately leads to the emergence of numerous duplicated places that severely hinder subsequent location-related services. In this work, we take the anonymous place graph from Facebook as an example to systematically study the problem of place deduplication: We carefully formulate the problem, study its connections to various related tasks that lead to several promising basic models, and arrive at a systematic two-step data-driven pipeline based on place embedding with multiple novel techniques that works significantly better than the state-of-the-art.

32.Generative One-Shot Face Recognition ⬇️

One-shot face recognition measures the ability to identify persons with only seeing them at one glance, and is a hallmark of human visual intelligence. It is challenging for conventional machine learning approaches to mimic this way, since limited data are hard to effectively represent the data variance. The goal of one-shot face recognition is to learn a large-scale face recognizer, which is capable to fight off the data imbalance challenge. In this paper, we propose a novel generative adversarial one-shot face recognizer, attempting to synthesize meaningful data for one-shot classes by adapting the data variances from other normal classes. Specifically, we target at building a more effective general face classifier for both normal persons and one-shot persons. Technically, we design a new loss function by formulating knowledge transfer generator and a general classifier into a unified framework. Such a two-player minimax optimization can guide the generation of more effective data, which effectively promote the underrepresented classes in the learned model and lead to a remarkable improvement in face recognition performance. We evaluate our proposed model on the MS-Celeb-1M one-shot learning benchmark task, where we could recognize 94.98% of the test images at the precision of 99% for the one-shot classes, keeping an overall Top1 accuracy at $99.80%$ for the normal classes. To the best of our knowledge, this is the best performance among all the published methods using this benchmark task with the same setup, including all the participants in the recent MS-Celeb-1M challenge at ICCV 2017\footnote{this http URL}.

33.The Detection of Distributional Discrepancy for Text Generation ⬇️

The text generated by neural language models is not as good as the real text. This means that their distributions are different. Generative Adversarial Nets (GAN) are used to alleviate it. However, some researchers argue that GAN variants do not work at all. When both sample quality (such as Bleu) and sample diversity (such as self-Bleu) are taken into account, the GAN variants even are worse than a well-adjusted language model. But, Bleu and self-Bleu can not precisely measure this distributional discrepancy. In fact, how to measure the distributional discrepancy between real text and generated text is still an open problem. In this paper, we theoretically propose two metric functions to measure the distributional difference between real text and generated text. Besides that, a method is put forward to estimate them. First, we evaluate language model with these two functions and find the difference is huge. Then, we try several methods to use the detected discrepancy signal to improve the generator. However the difference becomes even bigger than before. Experimenting on two existing language GANs, the distributional discrepancy between real text and generated text increases with more adversarial learning rounds. It demonstrates both of these language GANs fail.

34.Training-Free Uncertainty Estimation for Neural Networks ⬇️

Uncertainty estimation is an essential step in the evaluation of the robustness for deep learning models in computer vision, especially when applied in risk-sensitive areas. However, most state-of-the-art deep learning models either fail to obtain uncertainty estimation or need significant modification (e.g., formulating a proper Bayesian treatment) to obtain it. None of the previous methods are able to take an arbitrary model off the shelf and generate uncertainty estimation without retraining or redesigning it. To address this gap, we perform the first systematic exploration into training-free uncertainty estimation.
We propose three simple and scalable methods to analyze the variance of output from a trained network under tolerable perturbations: infer-transformation, infer-noise, and infer-dropout. They operate solely during inference, without the need to re-train, re-design, or fine-tune the model, as typically required by other state-of-the-art uncertainty estimation methods. Surprisingly, even without involving such perturbations in training, our methods produce comparable or even better uncertainty estimation when compared to other training-required state-of-the-art methods. Last but not least, we demonstrate that the uncertainty from our proposed methods can be used to improve the neural network training.

35.Sampling the "Inverse Set" of a Neuron: An Approach to Understanding Neural Nets ⬇️

With the recent success of deep neural networks in computer vision, it is important to understand the internal working of these networks. What does a given neuron represent? The concepts captured by a neuron may be hard to understand or express in simple terms. The approach we propose in this paper is to characterize the region of input space that excites a given neuron to a certain level; we call this the inverse set. This inverse set is a complicated high dimensional object that we explore by an optimization-based sampling approach. Inspection of samples of this set by a human can reveal regularities that help to understand the neuron. This goes beyond approaches which were limited to finding an image which maximally activates the neuron or using Markov chain Monte Carlo to sample images, but this is very slow, generates samples with little diversity and lacks control over the activation value of the generated samples. Our approach also allows us to explore the intersection of inverse sets of several neurons and other variations.

36.Video-Based Convolutional Attention for Person Re-Identification ⬇️

In this paper we consider the problem of video-based person re-identification, which is the task of associating videos of the same person captured by different and non-overlapping cameras. We propose a Siamese framework in which video frames of the person to re-identify and of the candidate one are processed by two identical networks which produce a similarity score. We introduce an attention mechanisms to capture the relevant information both at frame level (spatial information) and at video level (temporal information given by the importance of a specific frame within the sequence). One of the novelties of our approach is given by a joint concurrent processing of both frame and video levels, providing in such a way a very simple architecture. Despite this fact, our approach achieves better performance than the state-of-the-art on the challenging iLIDS-VID dataset.

37.Expression, Affect, Action Unit Recognition: Aff-Wild2, Multi-Task Learning and ArcFace ⬇️

Affective computing has been largely limited in terms of available data resources. The need to collect and annotate diverse in-the-wild datasets has become apparent with the rise of deep learning models, as the default approach to address any computer vision task. Some in-the-wild databases have been recently proposed. However: i) their size is small, ii) they are not audiovisual, iii) only a small part is manually annotated, iv) they contain a small number of subjects, or v) they are not annotated for all main behavior tasks (valence-arousal estimation, action unit detection and basic expression classification). To address these, we substantially extend the largest available in-the-wild database (Aff-Wild) to study continuous emotions such as valence and arousal. Furthermore, we annotate parts of the database with basic expressions and action units. As a consequence, for the first time, this allows the joint study of all three types of behavior states. We call this database Aff-Wild2. We conduct extensive experiments with CNN and CNN-RNN architectures that use visual and audio modalities; these networks are trained on Aff-Wild2 and their performance is then evaluated on 10 publicly available emotion databases. We show that the networks achieve state-of-the-art performance for the emotion recognition tasks. Additionally, we adapt the ArcFace loss function in the emotion recognition context and use it for training two new networks on Aff-Wild2 and then re-train them in a variety of diverse expression recognition databases. The networks are shown to improve the existing state-of-the-art. The database, emotion recognition models and source code are available at this http URL.

38.Improving a Quality of 3D Object Detection by Spatial Transformation Mechanism ⬇️

We present an endpoint box regression module(epBRM), which is designed for predicting precise 3D bounding boxes using raw LiDAR 3D point clouds. The proposed epBRM is built with sequence of small networks and is computationally lightweight. Our approach can improve a 3D object detection performance by predicting more precise 3D bounding box coordinates. The proposed approach requires 40 minutes of training to improve the detection performance. Moreover, epBRM imposes less than 12ms to network inference time for up-to 20 objects. The proposed approach utilizes a spatial transformation mechanism to simplify the box regression task. Adopting spatial transformation mechanism into epBRM makes it possible to improve the quality of detection with a small sized network. We conduct in-depth analysis of the effect of various spatial transformation mechanisms applied on raw LiDAR 3D point clouds. We also evaluate the proposed epBRM by applying it to several state-of-the-art 3D object detection systems. We evaluate our approach on KITTI dataset, a standard 3D object detection benchmark for autonomous vehicles. The proposed epBRM enhances the overlaps between ground truth bounding boxes and detected bounding boxes, and improves 3D object detection. Our proposed method evaluated in KITTI test server outperforms current state-of-the-art approaches.

39.Addressing Failure Prediction by Learning Model Confidence ⬇️

Assessing reliably the confidence of a deep neural network and predicting its failures is of primary importance for the practical deployment of these models. In this paper, we propose a new target criterion for model confidence, corresponding to the True Class Probability (TCP). We show how using the TCP is more suited than relying on the classic Maximum Class Probability (MCP). We provide in addition theoretical guarantees for TCP in the context of failure prediction. Since the true class is by essence unknown at test time, we propose to learn TCP criterion on the training set, introducing a specific learning scheme adapted to this context. Extensive experiments are conducted for validating the relevance of the proposed approach. We study various network architectures, small and large scale datasets for image classification and semantic segmentation. We show that our approach consistently outperforms several strong methods, from MCP to Bayesian uncertainty, as well as recent approaches specifically designed for failure prediction.

40.CompareNet: Anatomical Segmentation Network with Deep Non-local Label Fusion ⬇️

Label propagation is a popular technique for anatomical segmentation. In this work, we propose a novel deep framework for label propagation based on non-local label fusion. Our framework, named CompareNet, incorporates subnets for both extracting discriminating features, and learning the similarity measure, which lead to accurate segmentation. We also introduce the voxel-wise classification as an unary potential to the label fusion function, for alleviating the search failure issue of the existing non-local fusion strategies. Moreover, CompareNet is end-to-end trainable, and all the parameters are learnt together for the optimal performance. By evaluating CompareNet on two public datasets IBSRv2 and MICCAI 2012 for brain segmentation, we show it outperforms state-of-the-art methods in accuracy, while being robust to pathologies.

41.Dynamic Spectral Residual Superpixels ⬇️

We consider the problem of segmenting an image into superpixels in the context of $k$-means clustering, in which we wish to decompose an image into local, homogeneous regions corresponding to the underlying objects. Our novel approach builds upon the widely used Simple Linear Iterative Clustering (SLIC), and incorporate a measure of objects' structure based on the spectral residual of an image. Based on this combination, we propose a modified initialisation scheme and search metric, which helps keeps fine-details. This combination leads to better adherence to object boundaries, while preventing unnecessary segmentation of large, uniform areas, while remaining computationally tractable in comparison to other methods. We demonstrate through numerical and visual experiments that our approach outperforms the state-of-the-art techniques.

42.Measuring robustness of Visual SLAM ⬇️

Simultaneous localization and mapping (SLAM) is an essential component of robotic systems. In this work we perform a feasibility study of RGB-D SLAM for the task of indoor robot navigation. Recent visual SLAM methods, e.g. ORBSLAM2 \cite{mur2017orb}, demonstrate really impressive accuracy, but the experiments in the papers are usually conducted on just a few sequences, that makes it difficult to reason about the robustness of the methods. Another problem is that all available RGB-D datasets contain the trajectories with very complex camera motions. In this work we extensively evaluate ORBSLAM2 to better understand the state-of-the-art. First, we conduct experiments on the popular publicly available datasets for RGB-D SLAM across the conventional metrics. We perform statistical analysis of the results and find correlations between the metrics and the attributes of the trajectories. Then, we introduce a new large and diverse HomeRobot dataset where we model the motions of a simple home robot. Our dataset is created using physically-based rendering with realistic lighting and contains the scenes composed by human designers. It includes thousands of sequences, that is two orders of magnitude greater than in previous works. We find that while in many cases the accuracy of SLAM is very good, the robustness is still an issue.

43.A Generative Approach Towards Improved Robotic Detection of Marine Litter ⬇️

This paper presents an approach to address data scarcity problems in underwater image datasets for visual detection of marine debris. The proposed approach relies on a two-stage variational autoencoder (VAE) and a binary classifier to evaluate the generated imagery for quality and realism. From the images generated by the two-stage VAE, the binary classifier selects "good quality" images and augments the given dataset with them. Lastly, a multi-class classifier is used to evaluate the impact of the augmentation process by measuring the accuracy of an object detector trained on combinations of real and generated trash images. Our results show that the classifier trained with the augmented data outperforms the one trained only with the real data. This approach will not only be valid for the underwater trash classification problem presented in this paper, but it will also be useful for any data-dependent task for which collecting more images is challenging or infeasible.

44.Aff-Wild Database and AffWildNet ⬇️

In the context of HCI, building an automatic system to recognize affect of human facial expression in real-world condition is very crucial to make machine interact naturallisticaly with a man. However, existing facial emotion databases usually contain expression in the limited scenario under well-controlled condition. Aff-Wild is currently the largest database consisting of spontaneous facial expression in the wild annotated with valence and arousal. The first contribution of this project is the completion of extending Aff-Wild database which is fulfilled by collecting videos from YouTube on which the videos have spontaneous facial expressions in the wild, annotating videos with valence and arousal ranging in [-1,1], detecting faces in frames using FFLD2 detector and partitioning the whole data set into train, validate and test set, with 527056, 94223 and 135145 frames. The diversity is guaranteed regarding age, ethnicity and values of valence and arousal. The ratio of male to female is close to 1. Regarding the techniques used to build the automatic system, deep learning is outstanding since almost all winning methods in emotion challenges adopt DNN techniques. The second contribution of this project is that an end-to-end DNN is constructed to have joint CNN and RNN block and gives the estimation on valence and arousal for each frame in sequential data. VGGFace, ResNet, DenseNet with the corresponding pre-trained model for CNN block and LSTM, GRU, IndRNN, Attention mechanism for RNN block are experimented aiming to find the best combination. Fine tuning and transfer learning techniques are also tried out. By comparing the CCC evaluation value on test data, the best model is found to be pre-trained VGGFace connected with 2 layers GRU with attention mechanism. The models test performance is 0.555 CCC for valence with sequence length 80 and 0.499 CCC for arousal with sequence length 70.

45.Edge AI: On-Demand Accelerating Deep Neural Network Inference via Edge Computing ⬇️

As a key technology of enabling Artificial Intelligence (AI) applications in 5G era, Deep Neural Networks (DNNs) have quickly attracted widespread attention. However, it is challenging to run computation-intensive DNN-based tasks on mobile devices due to the limited computation resources. What's worse, traditional cloud-assisted DNN inference is heavily hindered by the significant wide-area network latency, leading to poor real-time performance as well as low quality of user experience. To address these challenges, in this paper, we propose Edgent, a framework that leverages edge computing for DNN collaborative inference through device-edge synergy. Edgent exploits two design knobs: (1) DNN partitioning that adaptively partitions computation between device and edge for purpose of coordinating the powerful cloud resource and the proximal edge resource for real-time DNN inference; (2) DNN right-sizing that further reduces computing latency via early exiting inference at an appropriate intermediate DNN layer. In addition, considering the potential network fluctuation in real-world deployment, Edgentis properly design to specialize for both static and dynamic network environment. Specifically, in a static environment where the bandwidth changes slowly, Edgent derives the best configurations with the assist of regression-based prediction models, while in a dynamic environment where the bandwidth varies dramatically, Edgent generates the best execution plan through the online change point detection algorithm that maps the current bandwidth state to the optimal configuration. We implement Edgent prototype based on the Raspberry Pi and the desktop PC and the extensive experimental evaluations demonstrate Edgent's effectiveness in enabling on-demand low-latency edge intelligence.

46.Map Matching Algorithm for Large-scale Datasets ⬇️

GPS receivers embedded in cell phones and connected vehicles generate a series of location measurements that can be used for various analytical purposes. A common pre-processing step of this data is the so-called map matching. The goal of map matching is to infer the trajectory that the device followed in a road network from a potentially sparse series of noisy location measurements. Although accurate and robust map matching algorithms based on probabilistic models exist, they are computationally heavy and thus impractical for processing of large datasets. In this paper, we present a scalable map-matching algorithm based on Dijkstra shortest path method, that is both accurate and applicable to large datasets. Our experiments on a publicly-available dataset showed that the proposed method achieves accuracy that is comparable to that of the existing map matching methods using only a fraction of computational resources. In result, our algorithm can be used to efficiently process large datasets of noisy and potentially sparse location data that would be unexploitable using existing techniques due to their high computational requirements.

47.Communications and Networking Technologies for Intelligent Drone Cruisers ⬇️

Future mobile communication networks require an Aerial Base Station (ABS) with fast mobility and long-term hovering capabilities. At present, unmanned aerial vehicles (UAV) or drones do not have long flight times and are mainly used for monitoring, surveillance, and image post-processing. On the other hand, the traditional airship is too large and not easy to take off and land. Therefore, we propose to develop an "Artificial Intelligence (AI) Drone-Cruiser" base station that can help 5G mobile communication systems and beyond quickly recover the network after a disaster and handle the instant communications by the flash crowd. The drone-cruiser base station can overcome the communications problem for three types of flash crowds, such as in stadiums, parades, and large plaza so that an appropriate number of aerial base stations can be accurately deployed to meet large and dynamic traffic demands. Artificial intelligence can solve these problems by analyzing the collected data, and then adjust the system parameters in the framework of Self-Organizing Network (SON) to achieve the goals of self-configuration, self-optimization, and self-healing. With the help of AI technologies, 5G networks can become more intelligent. This paper aims to provide a new type of service, On-Demand Aerial Base Station as a Service. This work needs to overcome the following five technical challenges: innovative design of drone-cruisers for the long-time hovering, crowd estimation and prediction, rapid 3D wireless channel learning and modeling, 3D placement of aerial base stations and the integration of WiFi front-haul and millimeter wave/WiGig back-haul networks.

48.Spectral Graph Wavelet Transform as Feature Extractor for Machine Learning in Neuroimaging ⬇️

Graph Signal Processing has become a very useful framework for signal operations and representations defined on irregular domains. Exploiting transformations that are defined on graph models can be highly beneficial when the graph encodes relationships between signals. In this work, we present the benefits of using Spectral Graph Wavelet Transform (SGWT) as a feature extractor for machine learning on brain graphs. First, we consider a synthetic regression problem in which the smooth graph signals are generated as input with additive noise, and the target is derived from the input without noise. This enables us to optimize the spectrum coverage using different wavelet shapes. Finally, we present the benefits obtained by SGWT on a functional Magnetic Resonance Imaging (fMRI) open dataset on human subjects, with several graphs and wavelet shapes, by demonstrating significant performance improvements compared to the state of the art.

49.Single Image BRDF Parameter Estimation with a Conditional Adversarial Network ⬇️

Creating plausible surfaces is an essential component in achieving a high degree of realism in rendering. To relieve artists, who create these surfaces in a time-consuming, manual process, automated retrieval of the spatially-varying Bidirectional Reflectance Distribution Function (SVBRDF) from a single mobile phone image is desirable. By leveraging a deep neural network, this casual capturing method can be achieved. The trained network can estimate per pixel normal, base color, metallic and roughness parameters from the Disney BRDF. The input image is taken with a mobile phone lit by the camera flash. The network is trained to compensate for environment lighting and thus learned to reduce artifacts introduced by other light sources. These losses contain a multi-scale discriminator with an additional perceptual loss, a rendering loss using a differentiable renderer, and a parameter loss. Besides the local precision, this loss formulation generates material texture maps which are globally more consistent. The network is set up as a generator network trained in an adversarial fashion to ensure that only plausible maps are produced. The estimated parameters not only reproduce the material faithfully in rendering but capture the style of hand-authored materials due to the more global loss terms compared to previous works without requiring additional post-processing. Both the resolution and the quality is improved.

50.Improving Generalization and Robustness with Noisy Collaboration in Knowledge Distillation ⬇️

Inspired by trial-to-trial variability in the brain that can result from multiple noise sources, we introduce variability through noise at different levels in a knowledge distillation framework. We introduce "Fickle Teacher" which provides variable supervision signals to the student for the same input. We observe that the response variability from the teacher results in a significant generalization improvement in the student. We further propose "Soft-Randomization" as a novel technique for improving robustness to input variability in the student. This minimizes the dissimilarity between the student's distribution on noisy data with teacher's distribution on clean data. We show that soft-randomization, even with low noise intensity, improves the robustness significantly with minimal drop in generalization. Lastly, we propose a new technique, "Messy-collaboration", which introduces target variability, whereby student and/or teacher are trained with randomly corrupted labels. We find that supervision from a corrupted teacher improves the adversarial robustness of student significantly while preserving its generalization and natural robustness. Our extensive empirical results verify the effectiveness of adding constructive noise in the knowledge distillation framework for improving the generalization and robustness of the model.

51.A sub-Riemannian model of the visual cortex with frequency and phase ⬇️

In this paper we present a novel model of the primary visual cortex (V1) based on orientation, frequency and phase selective behavior of the V1 simple cells. We start from the first level mechanisms of visual perception: receptive profiles. The model interprets V1 as a fiber bundle over the 2-dimensional retinal plane by introducing orientation, frequency and phase as intrinsic variables. Each receptive profile on the fiber is mathematically interpreted as a rotated, frequency modulated and phase shifted Gabor function. We start from the Gabor function and show that it induces in a natural way the model geometry and the associated horizontal connectivity modeling the neural connectivity patterns in V1. We provide an image enhancement algorithm employing the model framework. The algorithm is capable of exploiting not only orientation but also frequency and phase information existing intrinsically in a 2-dimensional input image. We provide the experimental results corresponding to the enhancement algorithm.

52.Multi-modal Deep Analysis for Multimedia ⬇️

With the rapid development of Internet and multimedia services in the past decade, a huge amount of user-generated and service provider-generated multimedia data become available. These data are heterogeneous and multi-modal in nature, imposing great challenges for processing and analyzing them. Multi-modal data consist of a mixture of various types of data from different modalities such as texts, images, videos, audios etc. In this article, we present a deep and comprehensive overview for multi-modal analysis in multimedia. We introduce two scientific research problems, data-driven correlational representation and knowledge-guided fusion for multimedia analysis. To address the two scientific problems, we investigate them from the following aspects: 1) multi-modal correlational representation: multi-modal fusion of data across different modalities, and 2) multi-modal data and knowledge fusion: multi-modal fusion of data with domain knowledge. More specifically, on data-driven correlational representation, we highlight three important categories of methods, such as multi-modal deep representation, multi-modal transfer learning, and multi-modal hashing. On knowledge-guided fusion, we discuss the approaches for fusing knowledge with data and four exemplar applications that require various kinds of domain knowledge, including multi-modal visual question answering, multi-modal video summarization, multi-modal visual pattern mining and multi-modal recommendation. Finally, we bring forward our insights and future research directions.

53.Adversarial Pulmonary Pathology Translation for Pairwise Chest X-ray Data Augmentation ⬇️

Recent works show that Generative Adversarial Networks (GANs) can be successfully applied to chest X-ray data augmentation for lung disease recognition. However, the implausible and distorted pathology features generated from the less than perfect generator may lead to wrong clinical decisions. Why not keep the original pathology region? We proposed a novel approach that allows our generative model to generate high quality plausible images that contain undistorted pathology areas. The main idea is to design a training scheme based on an image-to-image translation network to introduce variations of new lung features around the pathology ground-truth area. Moreover, our model is able to leverage both annotated disease images and unannotated healthy lung images for the purpose of generation. We demonstrate the effectiveness of our model on two tasks: (i) we invite certified radiologists to assess the quality of the generated synthetic images against real and other state-of-the-art generative models, and (ii) data augmentation to improve the performance of disease localisation.

54.Scene-level Pose Estimation for Multiple Instances of Densely Packed Objects ⬇️

This paper introduces key machine learning operations that allow the realization of robust, joint 6D pose estimation of multiple instances of objects either densely packed or in unstructured piles from RGB-D data. The first objective is to learn semantic and instance-boundary detectors without manual labeling. An adversarial training framework in conjunction with physics-based simulation is used to achieve detectors that behave similarly in synthetic and real data. Given the stochastic output of such detectors, candidates for object poses are sampled. The second objective is to automatically learn a single score for each pose candidate that represents its quality in terms of explaining the entire scene via a gradient boosted tree. The proposed method uses features derived from surface and boundary alignment between the observed scene and the object model placed at hypothesized poses. Scene-level, multi-instance pose estimation is then achieved by an integer linear programming process that selects hypotheses that maximize the sum of the learned individual scores, while respecting constraints, such as avoiding collisions. To evaluate this method, a dataset of densely packed objects with challenging setups for state-of-the-art approaches is collected. Experiments on this dataset and a public one show that the method significantly outperforms alternatives in terms of 6D pose accuracy while trained only with synthetic datasets.

55.Coarse-To-Fine Visual Localization Using Semantic Compact Map ⬇️

Robust visual localization for urban vehicles remains challenging and unsolved. The limitation of computation efficiency and memory size has made it harder for large-scale applications. Since semantic information serves as a stable and compact representation of the environment, we propose a coarse-to-fine localization system based on a semantic compact map. Pole-like objects are stored in the compact map, then are extracted from semantically segmented images as observations. Localization is performed by a particle filter, followed by a pose alignment module decoupling translation and rotation to achieve better accuracy. We evaluate our system both on synthetic and realistic datasets and compare it with two baselines, a state-of-art semantic feature-based system and a traditional SIFT feature-based system. Experiments demonstrate that even with a significantly small map, such as a 10 KB map for a 3.7 km long trajectory, our system provides a comparable accuracy with the baselines.

56.Deep Learning for Prostate Pathology ⬇️

The current study detects different morphologies related to prostate pathology using deep learning models; these models were evaluated on 2,121 hematoxylin and eosin (H&E) stain histology images that spanned a variety of image qualities, origins (Whole-slide, Tissue micro array, Whole mount, Internet), scanning machines, timestamps, H&E staining protocols, and institutions. All histology images were captured using the bright field microscopy. For case use, these models were applied for the annotation tasks in clinician-oriented (cMDX) reports for prostatectomy specimens. The true positive rate (TPR) for slides with prostate cancer was 99.1%. The F1-scores of Gleason patterns reported in cMDX reports range between 0.795 and 1.0 at case level (n=55). TPR was 93.6% for the cribriform morphology and 72.6% for the ductal morphology. The R-squared for the relative tumor volume was 0.987 between the ground truth and the prediction. Our models cover the major prostate pathology and successfully accomplish the annotation tasks.

57.Coloring the Black Box: Visualizing neural network behavior with a self-introspective model ⬇️

The following work presents how autoencoding all the possible hidden activations of a network for a given problem can provide insight about its structure, behavior, and vulnerabilities. The method, termed self-introspection, can show that a trained model showcases similar activation patterns (albeit randomly distributed due to initialization) when shown data belonging to the same category, and classification errors occur in fringe areas where the activations are not as clearly defined, suggesting some form of random, slowly varying, implicit encoding occurring within deep networks, that can be observed with this representation. Additionally, obtaining a low-dimensional representation of all the activations allows for (1) real-time model evaluation in the context of a multiclass classification problem, (2) the rearrangement of all hidden layers by their relevance in obtaining a specific output, and (3) the obtainment of a framework where studying possible counter-measures to noise and adversarial attacks is possible. Self-introspection can show how damaged input data can modify the hidden activations, producing an erroneous response. A few illustrative are implemented for feedforward and convolutional models and the MNIST and CIFAR-10 datasets, showcasing its capabilities as a model evaluation framework.

58.Estimating localized complexity of white-matter wiring with GANs ⬇️

In-vivo examination of the physical connectivity of axonal projections through the white matter of the human brain is made possible by diffusion weighted magnetic resonance imaging (dMRI) Analysis of dMRI commonly considers derived scalar metrics such as fractional anisotrophy as proxies for "white matter integrity," and differences of such measures have been observed as significantly correlating with various neurological diagnosis and clinical measures such as executive function, presence of multiple sclerosis, and genetic similarity.
The analysis of such voxel measures is confounded in areas of more complicated fiber wiring due to crossing, kissing, and dispersing fibers. Recently, Volz et al. introduced a simple probabilistic measure of the count of distinct fiber populations within a voxel, which was shown to reduce variance in group comparisons. We propose a complementary measure that considers the complexity of a voxel in context of its local region, with an aim to quantify the localized wiring complexity of every part of white matter. This allows, for example, identification of particularly ambiguous regions of the brain for tractographic approaches of modeling global wiring connectivity.
Our method builds on recent advances in image inpainting, in which the task is to plausibly fill in a missing region of an image. Our proposed method builds on a Bayesian estimate of heteroscedastic aleatoric uncertainty of a region of white matter by inpainting it from its context. We define the localized wiring complexity of white matter as how accurately and confidently a well-trained model can predict the missing patch. In our results, we observe low aleatoric uncertainty along major neuronal pathways which increases at junctions and towards cortex boundaries. This directly quantifies the difficulty of lesion inpainting of dMRI images at all parts of white matter.

59.Automatic Segmentation of Muscle Tissue and Inter-muscular Fat in Thigh and Calf MRI Images ⬇️

Magnetic resonance imaging (MRI) of thigh and calf muscles is one of the most effective techniques for estimating fat infiltration into muscular dystrophies. The infiltration of adipose tissue into the diseased muscle region varies in its severity across, and within, patients. In order to efficiently quantify the infiltration of fat, accurate segmentation of muscle and fat is needed. An estimation of the amount of infiltrated fat is typically done visually by experts. Several algorithmic solutions have been proposed for automatic segmentation. While these methods may work well in mild cases, they struggle in moderate and severe cases due to the high variability in the intensity of infiltration, and the tissue's heterogeneous nature. To address these challenges, we propose a deep-learning approach, producing robust results with high Dice Similarity Coefficient (DSC) of 0.964, 0.917 and 0.933 for muscle-region, healthy muscle and inter-muscular adipose tissue (IMAT) segmentation, respectively.

60.Deep Imitation Learning of Sequential Fabric Smoothing Policies ⬇️

Sequential pulling policies to flatten and smooth fabrics have applications from surgery to manufacturing to home tasks such as bed making and folding clothes. Due to the complexity of fabric states and dynamics, we apply deep imitation learning to learn policies that, given color or depth images of a rectangular fabric sample, estimate pick points and pull vectors to spread the fabric to maximize coverage. To generate data, we develop a fabric simulator and an algorithmic demonstrator that has access to complete state information. We train policies in simulation using domain randomization and dataset aggregation (DAgger) on three tiers of difficulty in the initial randomized configuration. We present results comparing five baseline policies to learned policies and report systematic comparisons of color vs. depth images as inputs. In simulation, learned policies achieve comparable or superior performance to analytic baselines. In 120 physical experiments with the da Vinci Research Kit (dVRK) surgical robot, policies trained in simulation attain 86% and 69% final coverage for color and depth inputs, respectively, suggesting the feasibility of learning fabric smoothing policies from simulation. Supplementary material is available at this https URL fabric-smoothing.

61.Training Multiscale-CNN for Large Microscopy Image Classification in One Hour ⬇️

Existing approaches to train neural networks that use large images require to either crop or down-sample data during pre-processing, use small batch sizes, or split the model across devices mainly due to the prohibitively limited memory capacity available on GPUs and emerging accelerators. These techniques often lead to longer time to convergence or time to train (TTT), and in some cases, lower model accuracy. CPUs, on the other hand, can leverage significant amounts of memory. While much work has been done on parallelizing neural network training on multiple CPUs, little attention has been given to tune neural network training with large images on CPUs. In this work, we train a multi-scale convolutional neural network (M-CNN) to classify large biomedical images for high content screening in one hour. The ability to leverage large memory capacity on CPUs enables us to scale to larger batch sizes without having to crop or down-sample the input images. In conjunction with large batch sizes, we find a generalized methodology of linearly scaling of learning rate and train M-CNN to state-of-the-art (SOTA) accuracy of 99% within one hour. We achieve fast time to convergence using 128 two socket Intel Xeon 6148 processor nodes with 192GB DDR4 memory connected with 100Gbps Intel Omnipath architecture.

62.ErrorNet: Learning error representations from limited data to improve vascular segmentation ⬇️

Deep convolutional neural networks have proved effective in segmenting lesions and anatomies in various medical imaging modalities. However, in the presence of small sample size and domain shift problems, these models often produce masks with non-intuitive segmentation mistakes. In this paper, we propose a segmentation framework called ErrorNet, which learns to correct these segmentation mistakes through the repeated process of injecting systematic segmentation errors to a segmentation mask based on a learned shape prior, followed by attempting to predict the injected error. During inference, ErrorNet corrects the segmentation mistakes by adding the predicted error map to the initial segmentation mask. ErrorNet has advantages over alternatives based on domain adaptation or CRF-based post processing, because it requires neither domain-specific parameter tuning nor any data from the target domains. We have evaluated ErrorNet using five public datasets for the task of retinal vessel segmentation. The selected datasets differ in size and patient population, allowing us to evaluate the effectiveness of ErrorNet in handling small sample size and domain shift problems. Our experiments demonstrate that ErrorNet outperforms a base segmentation model, a CRF-based post processing scheme, and a domain adaptation method, with a greater performance gain in the presence of dataset limitations above.

63.Video Summarization using Keyframe Extraction and Video Skimming ⬇️

Video is one of the robust sources of information and the consumption of online and offline videos has reached an unprecedented level in the last few years. A fundamental challenge of extracting information from videos is a viewer has to go through the complete video to understand the context, as opposed to an image where the viewer can extract information from a single frame. In this work, we attempt to employ different Algorithmic methodologies including local features and deep neural networks along with multiple clustering methods to find an effective way of summarizing a video by interesting keyframe extraction.

64.Combining Geometric and Topological Information in Image Segmentation ⬇️

A fundamental problem in computer vision is image segmentation, where the goal is to delineate the boundary of the object in the image. The focus of this work is on the segmentation of grayscale images and its purpose is two-fold. First, we conduct an in-depth study comparing active contour and topologically-based methods, two popular approaches for boundary detection of 2-dimensional images. Certain properties of the image dataset may favor one method over the other, both from an interpretability perspective as well as through evaluation of performance measures. Second, we propose the use of topological knowledge to assist an active contour method, which can potentially incorporate prior shape information. The latter is known to be extremely sensitive to algorithm initialization, and thus, we use a topological model to provide an automatic initialization. In addition, our proposed model can handle objects in images with more complex topological structures. We demonstrate this on artificially-constructed image datasets from computer vision, as well as real medical image data.

65.Improving sample diversity of a pre-trained, class-conditional GAN by changing its class embeddings ⬇️

Mode collapse is a well-known issue with Generative Adversarial Networks (GANs) and is a byproduct of unstable GAN training. We propose to improve the sample diversity of a pre-trained class-conditional generator by modifying its class embeddings in the direction of maximizing the log probability outputs of a classifier pre-trained on the same dataset. We improved the sample diversity of state-of-the-art ImageNet BigGANs at both 128x128 and 256x256 resolutions. By replacing the embeddings, we can also synthesize plausible images for Places365 using a BigGAN pre-trained on ImageNet.

Files

20191014.md

Latest commit

History

20191014.md

File metadata and controls

ArXiv cs.CV --Mon, 14 Oct 2019

1.Shape Constrained Network for Eye Segmentation in the Wild ⬇️

2.Augmented Hard Example Mining for Generalizable Person Re-Identification ⬇️

3.Face Reflectance and Geometry Modeling via Differentiable Ray Tracing ⬇️

4.Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval ⬇️

5.Methods and open-source toolkit for analyzing and visualizing challenge results ⬇️

6.Rosetta: Large scale system for text detection and recognition in images ⬇️

7.CHD:Consecutive Horizontal Dropout for Human Gait Feature Extraction ⬇️

8.Shooting Labels: 3D Semantic Labeling by Virtual Reality ⬇️

9.End-to-End Defect Detection in Automated Fiber Placement Based on Artificially Generated Data ⬇️

10.Road Damage Detection Based on Unsupervised Disparity Map Segmentation ⬇️

11.Artistic Glyph Image Synthesis via One-Stage Few-Shot Learning ⬇️

12.VarGFaceNet: An Efficient Variable Group Convolutional Neural Network for Lightweight Face Recognition ⬇️

13.Estimating Solar Irradiance Using Sky Imagers ⬇️

14.Interaction Relational Network for Mutual Action Recognition ⬇️

15.An Automatic Digital Terrain Generation Technique for Terrestrial Sensing and Virtual Reality Applications ⬇️

16.FetusMap: Fetal Pose Estimation in 3D Ultrasound ⬇️

17.DiabDeep: Pervasive Diabetes Diagnosis based on Wearable Medical Sensors and Efficient Neural Networks ⬇️

18.From Species to Cultivar: Soybean Cultivar Recognition using Multiscale Sliding Chord Matching of Leaf Images ⬇️

19.Visual Natural Language Query Auto-Completion for Estimating Instance Probabilities ⬇️

20.Predicting Auction Price of Vehicle License Plate with Deep Residual Learning ⬇️

21.Bit Efficient Quantization for Deep Neural Networks ⬇️

22.FastEstimator: A Deep Learning Library for Fast Prototyping and Productization ⬇️

23.A Stereo Algorithm for Thin Obstacles and Reflective Objects ⬇️

24.Global visual localization in LiDAR-maps through shared 2D-3D embedding space ⬇️

25.Road scenes analysis in adverse weather conditions by polarization-encoded images and adapted deep learning ⬇️

26.Inferring and Improving Street Maps with Data-Driven Automation ⬇️

27.The Visual Task Adaptation Benchmark ⬇️

28.NASS-AI: Towards Digitization of Parliamentary Bills using Document Level Embedding and Bidirectional Long Short-Term Memory ⬇️

29.Brain-inspired automated visual object discovery and detection ⬇️

30.Vision-Based Autonomous Vehicle Control using the Two-Point Visual Driver Control Model ⬇️

31.Place Deduplication with Embeddings ⬇️

32.Generative One-Shot Face Recognition ⬇️

33.The Detection of Distributional Discrepancy for Text Generation ⬇️

34.Training-Free Uncertainty Estimation for Neural Networks ⬇️

35.Sampling the "Inverse Set" of a Neuron: An Approach to Understanding Neural Nets ⬇️

36.Video-Based Convolutional Attention for Person Re-Identification ⬇️

37.Expression, Affect, Action Unit Recognition: Aff-Wild2, Multi-Task Learning and ArcFace ⬇️

38.Improving a Quality of 3D Object Detection by Spatial Transformation Mechanism ⬇️

39.Addressing Failure Prediction by Learning Model Confidence ⬇️

40.CompareNet: Anatomical Segmentation Network with Deep Non-local Label Fusion ⬇️

41.Dynamic Spectral Residual Superpixels ⬇️

42.Measuring robustness of Visual SLAM ⬇️

43.A Generative Approach Towards Improved Robotic Detection of Marine Litter ⬇️

44.Aff-Wild Database and AffWildNet ⬇️

45.Edge AI: On-Demand Accelerating Deep Neural Network Inference via Edge Computing ⬇️

46.Map Matching Algorithm for Large-scale Datasets ⬇️

47.Communications and Networking Technologies for Intelligent Drone Cruisers ⬇️

48.Spectral Graph Wavelet Transform as Feature Extractor for Machine Learning in Neuroimaging ⬇️

49.Single Image BRDF Parameter Estimation with a Conditional Adversarial Network ⬇️

50.Improving Generalization and Robustness with Noisy Collaboration in Knowledge Distillation ⬇️

51.A sub-Riemannian model of the visual cortex with frequency and phase ⬇️

52.Multi-modal Deep Analysis for Multimedia ⬇️

53.Adversarial Pulmonary Pathology Translation for Pairwise Chest X-ray Data Augmentation ⬇️

54.Scene-level Pose Estimation for Multiple Instances of Densely Packed Objects ⬇️

55.Coarse-To-Fine Visual Localization Using Semantic Compact Map ⬇️

56.Deep Learning for Prostate Pathology ⬇️

57.Coloring the Black Box: Visualizing neural network behavior with a self-introspective model ⬇️

58.Estimating localized complexity of white-matter wiring with GANs ⬇️

59.Automatic Segmentation of Muscle Tissue and Inter-muscular Fat in Thigh and Calf MRI Images ⬇️

60.Deep Imitation Learning of Sequential Fabric Smoothing Policies ⬇️

61.Training Multiscale-CNN for Large Microscopy Image Classification in One Hour ⬇️

62.ErrorNet: Learning error representations from limited data to improve vascular segmentation ⬇️

63.Video Summarization using Keyframe Extraction and Video Skimming ⬇️

64.Combining Geometric and Topological Information in Image Segmentation ⬇️

65.Improving sample diversity of a pre-trained, class-conditional GAN by changing its class embeddings ⬇️