Skip to content
This repository has been archived by the owner on Apr 21, 2024. It is now read-only.

Latest commit

 

History

History
87 lines (87 loc) · 58.6 KB

20191223.md

File metadata and controls

87 lines (87 loc) · 58.6 KB

ArXiv cs.CV --Mon, 23 Dec 2019

1.Identity Document to Selfie Face Matching Across Adolescence ⬇️

Matching live images (``selfies'') to images from ID documents is a problem that can arise in various applications. A challenging instance of the problem arises when the face image on the ID document is from early adolescence and the live image is from later adolescence. We explore this problem using a private dataset called Chilean Young Adult (CHIYA) dataset, where we match live face images taken at age 18-19 to face images on ID documents created at ages 9 to 18. State-of-the-art deep learning face matchers (e.g., ArcFace) have relatively poor accuracy for document-to-selfie face matching. To achieve higher accuracy, we fine-tune the best available open-source model with triplet loss for a few-shot learning. Experiments show that our approach achieves higher accuracy than the DocFace+ model recently developed for this problem. Our fine-tuned model was able to improve the true acceptance rate for the most difficult (largest age span) subset from 62.92% to 96.67% at a false acceptance rate of 0.01%. Our fine-tuned model is available for use by other researchers.

2.TreyNet: A Neural Model for Text Localization, Transcription and Named Entity Recognition in Full Pages ⬇️

In the last years, the consolidation of deep neural network architectures for information extraction in document images has brought big improvements in the performance of each of the tasks involved in this process, consisting of text localization, transcription, and named entity recognition. However, this process is traditionally performed with separate methods for each task. In this work we propose an end-to-end model that jointly performs handwritten text detection, transcription, and named entity recognition at page level, capable of benefiting from shared features for these tasks. We exhaustively evaluate our approach on different datasets, discussing its advantages and limitations compared to sequential approaches.

3.Advanced Variations of Two-Dimensional Principal Component Analysis for Face Recognition ⬇️

The two-dimensional principal component analysis (2DPCA) has become one of the most powerful tools of artificial intelligent algorithms. In this paper, we review 2DPCA and its variations, and propose a general ridge regression model to extract features from both row and column directions. To enhance the generalization ability of extracted features, a novel relaxed 2DPCA (R2DPCA) is proposed with a new ridge regression model. R2DPCA generates a weighting vector with utilizing the label information, and maximizes a relaxed criterion with applying an optimal algorithm to get the essential features. The R2DPCA-based approaches for face recognition and image reconstruction are also proposed and the selected principle components are weighted to enhance the role of main components. Numerical experiments on well-known standard databases indicate that R2DPCA has high generalization ability and can achieve a higher recognition rate than the state-of-the-art methods, including in the deep learning methods such as CNNs, DBNs, and DNNs.

4.Something-Else: Compositional Action Recognition with Spatial-Temporal Interaction Networks ⬇️

Human action is naturally compositional: humans can easily recognize and perform actions with objects that are different from those used in training demonstrations. In this paper, we study the compositionality of action by looking into the dynamics of subject-object interactions. We propose a novel model which can explicitly reason about the geometric relations between constituent objects and an agent performing an action. To train our model, we collect dense object box annotations on the Something-Something dataset. We propose a novel compositional action recognition task where the training combinations of verbs and nouns do not overlap with the test set. The novel aspects of our model are applicable to activities with prominent object interaction dynamics and to objects which can be tracked using state-of-the-art approaches; for activities without clearly defined spatial object-agent interactions, we rely on baseline scene-level spatio-temporal representations. We show the effectiveness of our approach not only on the proposed compositional action recognition task, but also in a few-shot compositional setting which requires the model to generalize across both object appearance and action category.

5.Analysis of Video Feature Learning in Two-Stream CNNs on the Example of Zebrafish Swim Bout Classification ⬇️

Semmelhack et al. (2014) have achieved high classification accuracy in distinguishing swim bouts of zebrafish using a Support Vector Machine (SVM). Convolutional Neural Networks (CNNs) have reached superior performance in various image recognition tasks over SVMs, but these powerful networks remain a black box. Reaching better transparency helps to build trust in their classifications and makes learned features interpretable to experts. Using a recently developed technique called Deep Taylor Decomposition, we generated heatmaps to highlight input regions of high relevance for predictions. We find that our CNN makes predictions by analyzing the steadiness of the tail's trunk, which markedly differs from the manually extracted features used by Semmelhack et al. (2014). We further uncovered that the network paid attention to experimental artifacts. Removing these artifacts ensured the validity of predictions. After correction, our best CNN beats the SVM by 6.12%, achieving a classification accuracy of 96.32%. Our work thus demonstrates the utility of AI explainability for CNNs.

6.MFPN: A Novel Mixture Feature Pyramid Network of Multiple Architectures for Object Detection ⬇️

Feature pyramids are widely exploited in many detectors to solve the scale variation problem for object detection. In this paper, we first investigate the Feature Pyramid Network (FPN) architectures and briefly categorize them into three typical fashions: top-down, bottom-up and fusing-splitting, which have their own merits for detecting small objects, large objects, and medium-sized objects, respectively. Further, we design three FPNs of different architectures and propose a novel Mixture Feature Pyramid Network (MFPN) which inherits the merits of all these three kinds of FPNs, by assembling the three kinds of FPNs in a parallel multi-branch architecture and mixing the features. MFPN can significantly enhance both one-stage and two-stage FPN-based detectors with about 2 percent Average Precision(AP) increment on the MS-COCO benchmark, at little sacrifice in running time latency. By simply assembling MFPN with the one-stage and two-stage baseline detectors, we achieve competitive single-model detection results on the COCO detection benchmark without bells and whistles.

7.Vertex Feature Encoding and Hierarchical Temporal Modeling in a Spatial-Temporal Graph Convolutional Network for Action Recognition ⬇️

This paper extends the Spatial-Temporal Graph Convolutional Network (ST-GCN) for skeleton-based action recognition by introducing two novel modules, namely, the Graph Vertex Feature Encoder (GVFE) and the Dilated Hierarchical Temporal Convolutional Network (DH-TCN). On the one hand, the GVFE module learns appropriate vertex features for action recognition by encoding raw skeleton data into a new feature space. On the other hand, the DH-TCN module is capable of capturing both short-term and long-term temporal dependencies using a hierarchical dilated convolutional network. Experiments have been conducted on the challenging NTU RGB-D-60 and NTU RGB-D 120 datasets. The obtained results show that our method competes with state-of-the-art approaches while using a smaller number of layers and parameters; thus reducing the required training time and memory.

8.Adversarial Representation Active Learning ⬇️

Active learning aims to develop label-efficient algorithms by querying the most informative samples to be labeled by an oracle. The design of efficient training methods that require fewer labels is an important research direction that allows more effective use of computational and human resources for labeling and training deep neural networks. In this work, we demonstrate how we can use recent advances in deep generative models, to outperform the state-of-the-art in achieving the highest classification accuracy using as few labels as possible. Unlike previous approaches, our approach uses not only labeled images to train the classifier but also unlabeled images and generated images for co-training the whole model. Our experiments show that the proposed method significantly outperforms existing approaches in active learning on a wide range of datasets (MNIST, CIFAR-10, SVHN, CelebA, and ImageNet).

9.DeepSFM: Structure From Motion Via Deep Bundle Adjustment ⬇️

Structure from motion (SfM) is an essential computer vision problem which has not been well handled by deep learning. One of the promising trends is to apply explicit structural constraint, e.g. 3D cost volume, into the this http URL this work, we design a physical driven architecture, namely DeepSFM, inspired by traditional Bundle Adjustment (BA), which consists of two cost volume based architectures for depth and pose estimation respectively, iteratively running to improve this http URL each cost volume, we encode not only photo-metric consistency across multiple input images, but also geometric consistency to ensure that depths from multiple views agree with each other.The explicit constraints on both depth (structure) and pose (motion), when combined with the learning components, bring the merit from both traditional BA and emerging deep learning technology.Extensive experiments on various datasets show that our model achieves the state-of-the-art performance on both depth and pose estimation with superior robustness against less number of inputs and the noise in initialization.

10.Controllable Face Aging ⬇️

Motivated by the following two observations: 1) people are aging differently under different conditions for changeable facial attributes, e.g., skin color may become darker when working outside, and 2) it needs to keep some unchanged facial attributes during the aging process, e.g., race and gender, we propose a controllable face aging method via attribute disentanglement generative adversarial network. To offer fine control over the synthesized face images, first, an individual embedding of the face is directly learned from an image that contains the desired facial attribute. Second, since the image may contain other unwanted attributes, an attribute disentanglement network is used to separate the individual embedding and learn the common embedding that contains information about the face attribute (e.g., race). With the common embedding, we can manipulate the generated face image with the desired attribute in an explicit manner. Experimental results on two common benchmarks demonstrate that our proposed generator achieves comparable performance on the aging effect with state-of-the-art baselines while gaining more flexibility for attribute control. Code is available at supplementary material.

11.Segmentations-Leak: Membership Inference Attacks and Defenses in Semantic Image Segmentation ⬇️

Today's success of state of the art methods for semantic segmentation is driven by large datasets. Data is considered an important asset that needs to be protected, as the collection and annotation of such datasets comes at significant efforts and associated costs. In addition, visual data might contain private or sensitive information, that makes it equally unsuited for public release. Unfortunately, recent work on membership inference in the broader area of adversarial machine learning and inference attacks on machine learning models has shown that even black box classifiers leak information on the dataset that they were trained on. We present the first attacks and defenses for complex, state of the art models for semantic segmentation. In order to mitigate the associated risks, we also study a series of defenses against such membership inference attacks and find effective counter measures against the existing risks. Finally, we extensively evaluate our attacks and defenses on a range of relevant real-world datasets: Cityscapes, BDD100K, and Mapillary Vistas.

12.IRS: A Large Synthetic Indoor Robotics Stereo Dataset for Disparity and Surface Normal Estimation ⬇️

Indoor robotics localization, navigation and interaction heavily rely on scene understanding and reconstruction. Compared to monocular vision which usually does not explicitly introduce any geometrical constraint, stereo vision based schemes are more promising and robust to produce accurate geometrical information, such as surface normal and depth/disparity. Besides, deep learning models trained with large-scale datasets have shown their superior performance in many stereo vision tasks. However, existing stereo datasets rarely contain the high-quality surface normal and disparity ground truth, which hardly satisfy the demand of training a prospective deep model for indoor scenes. To this end, we introduce a large-scale synthetic indoor robotics stereo (IRS) dataset with over 100K stereo RGB images and high-quality surface normal and disparity maps. Leveraging the advanced rendering techniques of our customized rendering engine, the dataset is considerably close to the real-world captured images and covers several visual effects, such as brightness changes, light reflection/transmission, lens flare, vivid shadow, etc. We compare the data distribution of IRS with existing stereo datasets to illustrate the typical visual attributes of indoor scenes. In addition, we present a new deep model DispNormNet to simultaneously infer surface normal and disparity from stereo images. Compared to existing models trained on other datasets, DispNormNet trained with IRS produces much better estimation of surface normal and disparity for indoor scenes.

13.Spatial and Temporal Consistency-Aware Dynamic Adaptive Streaming for 360-Degree Videos ⬇️

The 360-degree video allows users to enjoy the whole scene by interactively switching viewports. However, the huge data volume of the 360-degree video limits its remote applications via network. To provide high quality of experience (QoE) for remote web users, this paper presents a tile-based adaptive streaming method for 360-degree videos. First, we propose a simple yet effective rate adaptation algorithm to determine the requested bitrate for downloading the current video segment by considering the balance between the buffer length and video quality. Then, we propose to use a Gaussian model to predict the field of view at the beginning of each requested video segment. To deal with the circumstance that the view angle is switched during the display of a video segment, we propose to download all the tiles in the 360-degree video with different priorities based on a Zipf model. Finally, in order to allocate bitrates for all the tiles, a two-stage optimization algorithm is proposed to preserve the quality of tiles in FoV and guarantee the spatial and temporal smoothness. Experimental results demonstrate the effectiveness and advantage of the proposed method compared with the state-of-the-art methods. That is, our method preserves both the quality and the smoothness of tiles in FoV, thus providing the best QoE for users.

14.Bridging adversarial samples and adversarial networks ⬇️

Generative adversarial networks have achieved remarkable performance on various tasks but suffer from training instability. In this paper, we investigate this problem from the perspective of adversarial samples. We find that adversarial training on fake samples has been implemented in vanilla GAN but that on real samples does not exist, which makes adversarial training unsymmetric. Consequently, discriminator is vulnerable to adversarial perturbation and the gradient given by discriminator contains uninformative adversarial noise. Adversarial noise can not improve the fidelity of generated samples but can drastically change the prediction of discriminator, which can hinder generator from catching the pattern of real samples and cause instability in training. To this end, we further incorporate adversarial training of discriminator on real samples into vanilla GANs. This scheme can make adversarial training symmetric and make discriminator more robust. Robust discriminator can give more informative gradient with less adversarial noise, which can stabilize training and accelerate convergence. We validate the proposed method on image generation on CIFAR-10 , CelebA, and LSUN with varied network architectures. Experiments show that training is stabilized and FID scores of generated samples are improved by $10% \sim 50%$ relative to the baseline with additional $25%$ computation cost.

15.AdaBits: Neural Network Quantization with Adaptive Bit-Widths ⬇️

Deep neural networks with adaptive configurations have gained increasing attention due to the instant and flexible deployment of these models on platforms with different resource budgets. In this paper, we investigate a novel option to achieve this goal by enabling adaptive bit-widths of weights and activations in the model. We first examine the benefits and challenges of training quantized model with adaptive bit-widths, and then experiment with several approaches including direct adaptation, progressive training and joint training. We discover that joint training is able to produce comparable performance on the adaptive model as individual models. We further propose a new technique named Switchable Clipping Level (S-CL) to further improve quantized models at the lowest bit-width. With our proposed techniques applied on a bunch of models including MobileNet-V1/V2 and ResNet-50, we demonstrate that bit-width of weights and activations is a new option for adaptively executable deep neural networks, offering a distinct opportunity for improved accuracy-efficiency trade-off as well as instant adaptation according to the platform constraints in real-world applications.

16.JSNet: Joint Instance and Semantic Segmentation of 3D Point Clouds ⬇️

In this paper, we propose a novel joint instance and semantic segmentation approach, which is called JSNet, in order to address the instance and semantic segmentation of 3D point clouds simultaneously. Firstly, we build an effective backbone network to extract robust features from the raw point clouds. Secondly, to obtain more discriminative features, a point cloud feature fusion module is proposed to fuse the different layer features of the backbone network. Furthermore, a joint instance semantic segmentation module is developed to transform semantic features into instance embedding space, and then the transformed features are further fused with instance features to facilitate instance segmentation. Meanwhile, this module also aggregates instance features into semantic feature space to promote semantic segmentation. Finally, the instance predictions are generated by applying a simple mean-shift clustering on instance embeddings. As a result, we evaluate the proposed JSNet on a large-scale 3D indoor point cloud dataset S3DIS and a part dataset ShapeNet, and compare it with existing approaches. Experimental results demonstrate our approach outperforms the state-of-the-art method in 3D instance segmentation with a significant improvement in 3D semantic prediction and our method is also beneficial for part segmentation. The source code for this work is available at this https URL.

17.ICDAR 2019 Robust Reading Challenge on Reading Chinese Text on Signboard ⬇️

Chinese scene text reading is one of the most challenging problems in computer vision and has attracted great interest. Different from English text, Chinese has more than 6000 commonly used characters and Chinesecharacters can be arranged in various layouts with numerous fonts. The Chinese signboards in street view are a good choice for Chinese scene text images since they have different backgrounds, fonts and layouts. We organized a competition called ICDAR2019-ReCTS, which mainly focuses on reading Chinese text on signboard. This report presents the final results of the competition. A large-scale dataset of 25,000 annotated signboard images, in which all the text lines and characters are annotated with locations and transcriptions, were released. Four tasks, namely character recognition, text line recognition, text line detection and end-to-end recognition were set up. Besides, considering the Chinese text ambiguity issue, we proposed a multi ground truth (multi-GT) evaluation method to make evaluation fairer. The competition started on March 1, 2019 and ended on April 30, 2019. 262 submissions from 46 teams are received. Most of the participants come from universities, research institutes, and tech companies in China. There are also some participants from the United States, Australia, Singapore, and Korea. 21 teams submit results for Task 1, 23 teams submit results for Task 2, 24 teams submit results for Task 3, and 13 teams submit results for Task 4. The official website for the competition is this http URL.

18.AtomNAS: Fine-Grained End-to-End Neural Architecture Search ⬇️

Designing of search space is a critical problem for neural architecture search (NAS) algorithms. We propose a fine-grained search space comprised of atomic blocks, a minimal search unit much smaller than the ones used in recent NAS algorithms. This search space facilitates direct selection of channel numbers and kernel sizes in convolutions. In addition, we propose a resource-aware architecture search algorithm which dynamically selects atomic blocks during training. The algorithm is further accelerated by a dynamic network shrinkage technique. Instead of a search-and-retrain two-stage paradigm, our method can simultaneously search and train the target architecture in an end-to-end manner. Our method achieves state-of-the-art performance under several FLOPS configurations on ImageNet with a negligible searching cost. We open our entire codebase at: this https URL

19.AutoScale: Learning to Scale for Crowd Counting ⬇️

Crowd counting in images is a widely explored but challenging task. Though recent convolutional neural network (CNN) methods have achieved great progress, it is still difficult to accurately count and even to precisely localize people in very dense regions. A major issue is that dense regions usually consist of many instances of small size, and thus exhibit very different density patterns compared with sparse regions. Localizing or detecting dense small objects is also very delicate. In this paper, instead of processing image pyramid and aggregating multi-scale features, we propose a simple yet effective Learning to Scale (L2S) module to cope with significant scale variations in both regression and localization. Specifically, L2S module aims to automatically scale dense regions into similar and reasonable scale levels. This alleviates the density pattern shift for density regression methods and facilitates the localization of small instances. Besides, we also introduce a novel distance label map combined with a customized adapted cross-entropy loss for precise person localization. Extensive experiments demonstrate that the proposed method termed AutoScale consistently improves upon state-of-the-art methods in both regression and localization benchmarks on three widely used datasets. The proposed AutoScale also demonstrates a noteworthy transferability under cross-dataset validation on different datasets.

20.Exploring the Capacity of Sequential-free Box Discretization Network for Omnidirectional Scene Text Detection ⬇️

Omnidirectional scene text detection has received increasing research attention. Previous methods directly predict words or text lines of quadrilateral shapes. However, most methods neglect the significance of consistent labeling, which is important to maintain a stable training process, especially when a large amount of data are included. For the first time, we solve the problem in this paper by proposing a novel method termed Sequential-free Box Discretization (SBD). The proposed SBD first discretizes the quadrilateral box into several key edges, which contains all potential horizontal and vertical positions. In order to decode accurate vertex positions, a simple yet effective matching procedure is proposed to reconstruct the quadrilateral bounding boxes. It departs from the learning ambiguity which has a significant influence during the learning process. Exhaustive ablation studies have been conducted to quantitatively validate the effectiveness of our proposed method. More importantly, built upon SBD, we provide a detailed analysis of the impact of a collection of refinements, in the hope to inspire others to build state-of-the-art networks. Combining both SBD and these useful refinements, we achieve state-of-the-art performance on various benchmarks, including ICDAR 2015, and MLT. Our method also wins the first place in text detection task of the recent ICDAR2019 Robust Reading Challenge on Reading Chinese Text on Signboard, further demonstrating its powerful generalization ability. Code is available at this https URL.

21.C2FNAS: Coarse-to-Fine Neural Architecture Search for 3D Medical Image Segmentation ⬇️

3D convolution neural networks (CNN) have been proved very successful in parsing organs or tumours in 3D medical images, but it remains sophisticated and time-consuming to choose or design proper 3D networks given different task contexts. Recently, Neural Architecture Search (NAS) is proposed to solve this problem by searching for the best network architecture automatically. However, the inconsistency between search stage and deployment stage often exists in NAS algorithms due to memory constraints and large search space, which could become more serious when applying NAS to some memory and time consuming tasks, such as 3D medical image segmentation. In this paper, we propose coarse-to-fine neural architecture search (C2FNAS) to automatically search a 3D segmentation network from scratch without inconsistency on network size or input size. Specifically, we divide the search procedure into two stages: 1) the coarse stage, where we search the macro-level topology of the network, i.e. how each convolution module is connected to other modules; 2) the fine stage, where we search at micro-level for operations in each cell based on previous searched macro-level topology. The coarse-to-fine manner divides the search procedure into two consecutive stages and meanwhile resolves the inconsistency. We evaluate our method on 10 public datasets from Medical Segmentation Decalthon (MSD) challenge, and achieve state-of-the-art performance with the network searched using one dataset, which demonstrates the effectiveness and generalization of our searched models.

22.Learning Semantic Neural Tree for Human Parsing ⬇️

The majority of existing human parsing methods formulate the task as semantic segmentation, which regard each semantic category equally and fail to exploit the intrinsic physiological structure of human body, resulting in inaccurate results. In this paper, we design a novel semantic neural tree for human parsing, which uses a tree architecture to encode physiological structure of human body, and designs a coarse to fine process in a cascade manner to generate accurate results. Specifically, the semantic neural tree is designed to segment human regions into multiple semantic subregions (e.g., face, arms, and legs) in a hierarchical way using a new designed attention routing module. Meanwhile, we introduce the semantic aggregation module to combine multiple hierarchical features to exploit more context information for better performance. Our semantic neural tree can be trained in an end-to-end fashion by standard stochastic gradient descent (SGD) with back-propagation. Several experiments conducted on four challenging datasets for both single and multiple human parsing, i.e., LIP, PASCAL-Person-Part, CIHP and MHP-v2, demonstrate the effectiveness of the proposed method. Code can be found at this https URL.

23.Divide and Conquer: an Accurate Machine Learning Algorithm to Process Split Videos on a Parallel Processing Infrastructure ⬇️

Every day the number of traffic cameras in cities rapidly increase and huge amount of video data are generated. Parallel processing infrastruture, such as Hadoop, and programming models, such as MapReduce, are being used to promptly process that amount of data. The common approach for video processing by using Hadoop MapReduce is to process an entire video on only one node, however, in order to avoid parallelization problems, such as load imbalance, we propose to process videos by splitting it into equal parts and processing each resulting chunk on a different node. We used some machine learning techniques to detect and track the vehicles. However, video division may produce inaccurate results. To solve this problem we proposed a heuristic algorithm to avoid process a vehicle in more than one chunk.

24.Line Drawings of Natural Scenes Guide Visual Attention ⬇️

Visual search is an important strategy of the human visual system for fast scene perception. The guided search theory suggests that the global layout or other top-down sources of scenes play a crucial role in guiding object searching. In order to verify the specific roles of scene layout and regional cues in guiding visual attention, we executed a psychophysical experiment to record the human fixations on line drawings of natural scenes with an eye-tracking system in this work. We collected the human fixations of ten subjects from 498 natural images and of another ten subjects from the corresponding 996 human-marked line drawings of boundaries (two boundary maps per image) under free-viewing condition. The experimental results show that with the absence of some basic features like color and luminance, the distribution of the fixations on the line drawings has a high correlation with that on the natural images. Moreover, compared to the basic cues of regions, subjects pay more attention to the closed regions of line drawings which are usually related to the dominant objects of the scenes. Finally, we built a computational model to demonstrate that the fixation information on the line drawings can be used to significantly improve the performances of classical bottom-up models for fixation prediction in natural scenes. These results support that Gestalt features and scene layout are important cues for guiding fast visual object searching.

25.High Resolution Millimeter Wave Imaging For Self-Driving Cars ⬇️

Recent years have witnessed much interest in expanding the use of networking signals beyond communication to sensing, localization, robotics, and autonomous systems. This paper explores how we can leverage recent advances in 5G millimeter wave (mmWave) technology for imaging in self-driving cars. Specifically, the use of mmWave in 5G has led to the creation of compact phased arrays with hundreds of antenna elements that can be electronically steered. Such phased arrays can expand the use of mmWave beyond vehicular communications and simple ranging sensors to a full-fledged imaging system that enables self-driving cars to see through fog, smog, snow, etc. Unfortunately, using mmWave signals for imaging in self-driving cars is challenging due to the very low resolution, the presence of fake artifacts resulting from multipath reflections and the absence of portions of the car due to specularity. This paper presents HawkEye, a system that can enable high resolution mmWave imaging in self driving cars. HawkEye addresses the above challenges by leveraging recent advances in deep learning known as Generative Adversarial Networks (GANs). HawkEye introduces a GAN architecture that is customized to mmWave imaging and builds a system that can significantly enhance the quality of mmWave images for self-driving cars.

26.Deep Exemplar Networks for VQA and VQG ⬇️

In this paper, we consider the problem of solving semantic tasks such as Visual Question Answering' (VQA), where one aims to answers related to an image and Visual Question Generation' (VQG), where one aims to generate a natural question pertaining to an image. Solutions for VQA and VQG tasks have been proposed using variants of encoder-decoder deep learning based frameworks that have shown impressive performance. Humans however often show generalization by relying on exemplar based approaches. For instance, the work by Tversky and Kahneman suggests that humans use exemplars when making categorizations and decisions. In this work, we propose the incorporation of exemplar based approaches towards solving these problems. Specifically, we incorporate exemplar based approaches and show that an exemplar based module can be incorporated in almost any of the deep learning architectures proposed in the literature and the addition of such a block results in improved performance for solving these tasks. Thus, just as the incorporation of attention is now considered de facto useful for solving these tasks, similarly, incorporating exemplars also can be considered to improve any proposed architecture for solving this task. We provide extensive empirical analysis for the same through various architectures, ablations, and state of the art comparisons.

27.LS-Net: Fast Single-Shot Line-Segment Detector ⬇️

In low-altitude Unmanned Aerial Vehicle (UAV) flights, power lines are considered as one of the most threatening hazards and one of the most difficult obstacles to avoid. In recent years, many vision-based techniques have been proposed to detect power lines to facilitate self-driving UAVs and automatic obstacle avoidance. However, most of the proposed methods are typically based on a common three-step approach: (i) edge detection, (ii) the Hough transform, and (iii) spurious line elimination based on power line constrains. These approaches not only are slow and inaccurate but also require a huge amount of effort in post-processing to distinguish between power lines and spurious lines. In this paper, we introduce LS-Net, a fast single-shot line-segment detector, and apply it to power line detection. The LS-Net is by design fully convolutional and consists of three modules: (i) a fully convolutional feature extractor, (ii) a classifier, and (iii) a line segment regressor. Due to the unavailability of large datasets with annotations of power lines, we render synthetic images of power lines using the Physically Based Rendering (PBR) approach and propose a series of effective data augmentation techniques to generate more training data. With a customized version of the VGG-16 network as the backbone, the proposed approach outperforms existing state-of-the-art approaches. In addition, the LS-Net can detect power lines in near real-time (20.4 FPS). This suggests that our proposed approach has a promising role in automatic obstacle avoidance and as a valuable component of self-driving UAVs, especially for automatic autonomous power line inspection.

28.secml: A Python Library for Secure and Explainable Machine Learning ⬇️

We present secml, an open-source Python library for secure and explainable machine learning. It implements the most popular attacks against machine learning, including not only test-time evasion attacks to generate adversarial examples against deep neural networks, but also training-time poisoning attacks against support vector machines and many other algorithms. These attacks enable evaluating the security of learning algorithms and of the corresponding defenses under both white-box and black-box threat models. To this end, secml provides built-in functions to compute security evaluation curves, showing how quickly classification performance decreases against increasing adversarial perturbations of the input data. secml also includes explainability methods to help understand why adversarial attacks succeed against a given model, by visualizing the most influential features and training prototypes contributing to each decision. It is distributed under the Apache License 2.0, and hosted at this https URL.

29.Automated and Network Structure Preserving Segmentation of Optical Coherence Tomography Angiograms ⬇️

Optical coherence tomography angiography (OCTA) is a novel non-invasive imaging modality for the visualisation of microvasculature in vivo. OCTA has encountered broad adoption in retinal research. OCTA potential in the assessment of pathological conditions and the reproducibility of studies relies on the quality of the image analysis. However, automated segmentation of parafoveal OCTA images is still an open problem in the field. In this study, we generate the first open dataset of retinal parafoveal OCTA images with associated ground truth manual segmentations. Furthermore, we establish a standard for OCTA image segmentation by surveying a broad range of state-of-the-art vessel enhancement and binarisation procedures. We provide the most comprehensive comparison of these methods under a unified framework to date. Our results show that, for the set of images considered, the U-Net machine learning (ML) architecture achieves the best performance with a Dice similarity coefficient of 0.89. For applications where manually segmented data is not available to retrain this ML approach, our findings suggest that optimal oriented flux is the best handcrafted filter enhancement method for OCTA images from those considered. Furthermore, we report on the importance of preserving network connectivity in the segmentation to enable vascular network phenotyping. We introduce a new metric for network connectivity evaluations in segmented angiograms and report an accuracy of up to 0.94 in preserving the morphological structure of the network in our segmentations. Finally, we release our data and source code to support standardisation efforts in OCTA image segmentation.

30.Attributed Relational SIFT-based Regions Graph (ARSRG): concepts and applications ⬇️

Graphs are widely adopted tools for encoding information. Generally, they are applied to disparate research fields where data needs to be represented in terms of local and spatial connections. In this context, a structure for ditigal image representation, called Attributed Relational SIFT-based Regions Graph (ARSRG), previously introduced, is presented. ARSRG has not been explored in detail in previous works and for this reason the goal is to investigate unknown aspects. The study is divided into two parts. A first, theoretical, introducing formal definitions, not yet specified previously, with purpose to clarify its structural configuration. A second, experimental, which provides fundamental elements about its adaptability and flexibility regarding different applications. The theoretical vision combined with the experimental one shows how the structure is adaptable to image representation including contents of different nature.

31.HiLLoC: Lossless Image Compression with Hierarchical Latent Variable Models ⬇️

We make the following striking observation: fully convolutional VAE models trained on 32x32 ImageNet can generalize well, not just to 64x64 but also to far larger photographs, with no changes to the model. We use this property, applying fully convolutional models to lossless compression, demonstrating a method to scale the VAE-based 'Bits-Back with ANS' algorithm for lossless compression to large color photographs, and achieving state of the art for compression of full size ImageNet images. We release Craystack, an open source library for convenient prototyping of lossless compression using probabilistic models, along with full implementations of all of our compression results.

32.A Calibration Scheme for Non-Line-of-Sight Imaging Setups ⬇️

The recent years have given rise to a large number of techniques for "looking around corners", i.e., for reconstructing occluded objects from time-resolved measurements of indirect light reflections off a wall. While the direct view of cameras is routinely calibrated in computer vision applications, the calibration of non-line-of-sight setups has so far relied on manual measurement of the most important dimensions (device positions, wall position and orientation, etc.). In this paper, we propose a semi-automatic method for calibrating such systems that relies on mirrors as known targets. A roughly determined initialization is refined in order to optimize a spatio-temporal consistency. Our system is general enough to be applicable to a variety of sensing scenarios ranging from single sources/detectors via scanning arrangements to large-scale arrays. It is robust towards bad initialization and the achieved accuracy is proportional to the depth resolution of the camera system. We demonstrate this capability with a real-world setup and despite a large number of dead pixels and very low temporal resolution achieve a result that outperforms a manual calibration.

33.Heterogeneous tissue characterization using ultrasound: a comparison of fractal analysis backscatter models on liver tumors ⬇️

Assessing tumor tissue heterogeneity via ultrasound has recently been suggested for predicting early response to treatment. The ultrasound backscattering characteristics can assist in better understanding the tumor texture by highlighting local concentration and spatial arrangement of tissue scatterers. However, it is challenging to quantify the various tissue heterogeneities ranging from fine-to-coarse of the echo envelope peaks in tumor texture. Local parametric fractal features extracted via maximum likelihood estimation from five well-known statistical model families are evaluated for the purpose of ultrasound tissue characterization. The fractal dimension (self-similarity measure) was used to characterize the spatial distribution of scatterers, while the Lacunarity (sparsity measure) was applied to determine scatterer number density. Performance was assessed based on 608 cross-sectional clinical ultrasound RF images of liver tumors (230 and 378 demonstrating respondent and non-respondent cases, respectively). Crossvalidation via leave-one-tumor-out and with different k-folds methodologies using a Bayesian classifier were employed for validation. The fractal properties of the backscattered echoes based on the Nakagami model (Nkg) and its extend four-parameter Nakagami-generalized inverse Gaussian (NIG) distribution achieved best results - with nearly similar performance - for characterizing liver tumor tissue. Accuracy, sensitivity and specificity for the Nkg/NIG were: 85.6%/86.3%, 94.0%/96.0%, and 73.0%/71.0%, respectively. Other statistical models, such as the Rician, Rayleigh, and K-distribution were found to not be as effective in characterizing the subtle changes in tissue texture as an indication of response to treatment. Employing the most relevant and practical statistical model could have potential consequences for the design of an early and effective clinical therapy.

34.Transfer Learning with Edge Attention for Prostate MRI Segmentation ⬇️

Prostate cancer is one of the common diseases in men, and it is the most common malignant tumor in developed countries. Studies have shown that the male prostate incidence rate is as high as 2.5% to 16%, Currently, the inci-dence of prostate cancer in Asia is lower than that in the West, but it is increas-ing rapidly. If prostate cancer can be found as early as possible and treated in time, it will have a high survival rate. Therefore, it is of great significance for the diagnosis and treatment of prostate cancer. In this paper, we propose a trans-fer learning method based on deep neural network for prostate MRI segmenta-tion. In addition, we design a multi-level edge attention module using wavelet decomposition to overcome the difficulty of ambiguous boundary in prostate MRI segmentation tasks. The prostate images were provided by MICCAI Grand Challenge-Prostate MR Image Segmentation 2012 (PROMISE 12) challenge dataset.

35.Taxonomy and Evaluation of Structured Compression of Convolutional Neural Networks ⬇️

The success of deep neural networks in many real-world applications is leading to new challenges in building more efficient architectures. One effective way of making networks more efficient is neural network compression. We provide an overview of existing neural network compression methods that can be used to make neural networks more efficient by changing the architecture of the network. First, we introduce a new way to categorize all published compression methods, based on the amount of data and compute needed to make the methods work in practice. These are three 'levels of compression solutions'. Second, we provide a taxonomy of tensor factorization based and probabilistic compression methods. Finally, we perform an extensive evaluation of different compression techniques from the literature for models trained on ImageNet. We show that SVD and probabilistic compression or pruning methods are complementary and give the best results of all the considered methods. We also provide practical ways to combine them.

36.Triple Generative Adversarial Networks ⬇️

Generative adversarial networks (GANs) have shown promise in image generation and classification given limited supervision. Existing methods extend the unsupervised GAN framework to incorporate supervision heuristically. Specifically, a single discriminator plays two incompatible roles of identifying fake samples and predicting labels and it only estimates the data without considering the labels. The formulation intrinsically causes two problems: (1) the generator and the discriminator (i.e., the classifier) may not converge to the data distribution at the same time; and (2) the generator cannot control the semantics of the generated samples. In this paper, we present the triple generative adversarial network (Triple-GAN), which consists of three players---a generator, a classifier, and a discriminator. The generator and the classifier characterize the conditional distributions between images and labels, and the discriminator solely focuses on identifying fake image-label pairs. We design compatible objective functions to ensure that the distributions characterized by the generator and the classifier converge to the data distribution. We evaluate Triple-GAN in two challenging settings, namely, semi-supervised learning and the extreme low data regime. In both settings, Triple-GAN can achieve state-of-the-art classification results among deep generative models and generate meaningful samples in a specific class simultaneously.

37.A Comprehensive Study and Comparison of Core Technologies for MPEG 3D Point Cloud Compression ⬇️

Point cloud based 3D visual representation is becoming popular due to its ability to exhibit the real world in a more comprehensive and immersive way. However, under a limited network bandwidth, it is very challenging to communicate this kind of media due to its huge data volume. Therefore, the MPEG have launched the standardization for point cloud compression (PCC), and proposed three model categories, i.e., TMC1, TMC2, and TMC3. Because the 3D geometry compression methods of TMC1 and TMC3 are similar, TMC1 and TMC3 are further merged into a new platform namely TMC13. In this paper, we first introduce some basic technologies that are usually used in 3D point cloud compression, then review the encoder architectures of these test models in detail, and finally analyze their rate distortion performance as well as complexity quantitatively for different cases (i.e., lossless geometry and lossless color, lossless geometry and lossy color, lossy geometry and lossy color) by using 16 benchmark 3D point clouds that are recommended by MPEG. Experimental results demonstrate that the coding efficiency of TMC2 is the best on average (especially for lossy geometry and lossy color compression) for dense point clouds while TMC13 achieves the optimal coding performance for sparse and noisy point clouds with lower time complexity.

38.Understanding Deep Neural Network Predictions for Medical Imaging Applications ⬇️

Computer-aided detection has been a research area attracting great interest in the past decade. Machine learning algorithms have been utilized extensively for this application as they provide a valuable second opinion to the doctors. Despite several machine learning models being available for medical imaging applications, not many have been implemented in the real-world due to the uninterpretable nature of the decisions made by the network. In this paper, we investigate the results provided by deep neural networks for the detection of malaria, diabetic retinopathy, brain tumor, and tuberculosis in different imaging modalities. We visualize the class activation mappings for all the applications in order to enhance the understanding of these networks. This type of visualization, along with the corresponding network performance metrics, would aid the data science experts in better understanding of their models as well as assisting doctors in their decision-making process.

39.Non-congruent non-degenerate curves with identical signatures ⬇️

We construct examples of non-congruent, non-degenerate simple planar closed curves with identical Euclidean signatures, thus disproving a claim made in Hickman (J. Math Imaging Vis. 43:206-213, 2012) that all such curves must be congruent. Our examples include closed $C^\infty$ curves of the same length and the same symmetry group. We show a general mechanism for constructing such examples by exploiting the self-intersection points of the signature. We state an updated congruence criterion for simple closed non-degenerate curves and confirm that for curves with simple signatures the claim made by Hickman holds.

40.Interactive Open-Ended Learning for 3D Object Recognition ⬇️

The thesis contributes in several important ways to the research area of 3D object category learning and recognition. To cope with the mentioned limitations, we look at human cognition, in particular at the fact that human beings learn to recognize object categories ceaselessly over time. This ability to refine knowledge from the set of accumulated experiences facilitates the adaptation to new environments. Inspired by this capability, we seek to create a cognitive object perception and perceptual learning architecture that can learn 3D object categories in an open-ended fashion. In this context, ``open-ended'' implies that the set of categories to be learned is not known in advance, and the training instances are extracted from actual experiences of a robot, and thus become gradually available, rather than being available since the beginning of the learning process. In particular, this architecture provides perception capabilities that will allow robots to incrementally learn object categories from the set of accumulated experiences and reason about how to perform complex tasks. This framework integrates detection, tracking, teaching, learning, and recognition of objects. An extensive set of systematic experiments, in multiple experimental settings, was carried out to thoroughly evaluate the described learning approaches. Experimental results show that the proposed system is able to interact with human users, learn new object categories over time, as well as perform complex tasks. The contributions presented in this thesis have been fully implemented and evaluated on different standard object and scene datasets and empirically evaluated on different robotic platforms.

41.UrbanLoco: A Full Sensor Suite Dataset for Mapping and Localization in Urban Scenes ⬇️

Mapping and localization is a critical module of autonomous driving, and significant achievements have been reached in this field. Beyond Global Navigation Satellite System (GNSS), research in point cloud registration, visual feature matching, and inertia navigation has greatly enhanced the accuracy and robustness of mapping and localization in different scenarios. However, highly urbanized scenes are still challenging: LIDAR- and camera-based methods perform poorly with numerous dynamic objects; the GNSS-based solutions experience signal loss and multipath problems; the inertia measurement units (IMU) suffer from drifting. Unfortunately, current public datasets either do not adequately address this urban challenge or do not provide enough sensor information related to mapping and localization. Here we present UrbanLoco: a mapping/localization dataset collected in highly-urbanized environments with a full sensor-suite. The dataset includes 13 trajectories collected in San Francisco and Hong Kong, covering a total length of over 40 kilometers. Our dataset includes a wide variety of urban terrains: urban canyons, bridges, tunnels, sharp turns, etc. More importantly, our dataset includes information from LIDAR, cameras, IMU, and GNSS receivers. Now the dataset is publicly available through the link in the footnote. Dataset Link: this https URL.

42.An Application of Generative Adversarial Networks for Super Resolution Medical Imaging ⬇️

Acquiring High Resolution (HR) Magnetic Resonance (MR) images requires the patient to remain still for long periods of time, which causes patient discomfort and increases the probability of motion induced image artifacts. A possible solution is to acquire low resolution (LR) images and to process them with the Super Resolution Generative Adversarial Network (SRGAN) to create an HR version. Acquiring LR images requires a lower scan time than acquiring HR images, which allows for higher patient comfort and scanner throughput. This work applies SRGAN to MR images of the prostate to improve the in-plane resolution by factors of 4 and 8. The term 'super resolution' in the context of this paper defines the post processing enhancement of medical images as opposed to 'high resolution' which defines native image resolution acquired during the MR acquisition phase. We also compare the SRGAN to three other models: SRCNN, SRResNet, and Sparse Representation. While the SRGAN results do not have the best Peak Signal to Noise Ratio (PSNR) or Structural Similarity (SSIM) metrics, they are the visually most similar to the original HR images, as portrayed by the Mean Opinion Score (MOS) results.

43.Anisotropic Super Resolution in Prostate MRI using Super Resolution Generative Adversarial Networks ⬇️

Acquiring High Resolution (HR) Magnetic Resonance (MR) images requires the patient to remain still for long periods of time, which causes patient discomfort and increases the probability of motion induced image artifacts. A possible solution is to acquire low resolution (LR) images and to process them with the Super Resolution Generative Adversarial Network (SRGAN) to create a super-resolved version. This work applies SRGAN to MR images of the prostate and performs three experiments. The first experiment explores improving the in-plane MR image resolution by factors of 4 and 8, and shows that, while the PSNR and SSIM (Structural SIMilarity) metrics are lower than the isotropic bicubic interpolation baseline, the SRGAN is able to create images that have high edge fidelity. The second experiment explores anisotropic super-resolution via synthetic images, in that the input images to the network are anisotropically downsampled versions of HR images. This experiment demonstrates the ability of the modified SRGAN to perform anisotropic super-resolution, with quantitative image metrics that are comparable to those of the anisotropic bicubic interpolation baseline. Finally, the third experiment applies a modified version of the SRGAN to super-resolve anisotropic images obtained from the through-plane slices of the volumetric MR data. The output super-resolved images contain a significant amount of high frequency information that make them visually close to their HR counterparts. Overall, the promising results from each experiment show that super-resolution for MR images is a successful technique and that producing isotropic MR image volumes from anisotropic slices is an achievable goal.