1.Beyond Human Parts: Dual Part-Aligned Representations for Person Re-Identification ⬇️
Person re-identification is a challenging task due to various complex factors. Recent studies have attempted to integrate human parsing results or externally defined attributes to help capture human parts or important object regions. On the other hand, there still exist many useful contextual cues that do not fall into the scope of predefined human parts or attributes. In this paper, we address the missed contextual cues by exploiting both the accurate human parts and the coarse non-human parts. In our implementation, we apply a human parsing model to extract the binary human part masks \emph{and} a self-attention mechanism to capture the soft latent (non-human) part masks. We verify the effectiveness of our approach with new state-of-the-art performances on three challenging benchmarks: Market-1501, DukeMTMC-reID and CUHK03. Our implementation is available at this https URL.
2.Torchreid: A Library for Deep Learning Person Re-Identification in Pytorch ⬇️
Person re-identification (re-ID), which aims to re-identify people across different camera views, has been significantly advanced by deep learning in recent years, particularly with convolutional neural networks (CNNs). In this paper, we present Torchreid, a software library built on PyTorch that allows fast development and end-to-end training and evaluation of deep re-ID models. As a general-purpose framework for person re-ID research, Torchreid provides (1) unified data loaders that support 15 commonly used re-ID benchmark datasets covering both image and video domains, (2) streamlined pipelines for quick development and benchmarking of deep re-ID models, and (3) implementations of the latest re-ID CNN architectures along with their pre-trained models to facilitate reproducibility as well as future research. With a high-level modularity in its design, Torchreid offers a great flexibility to allow easy extension to new datasets, CNN models and loss functions.
3.Gaze360: Physically Unconstrained Gaze Estimation in the Wild ⬇️
Understanding where people are looking is an informative social cue. In this work, we present Gaze360, a large-scale gaze-tracking dataset and method for robust 3D gaze estimation in unconstrained images. Our dataset consists of 238 subjects in indoor and outdoor environments with labelled 3D gaze across a wide range of head poses and distances. It is the largest publicly available dataset of its kind by both subject and variety, made possible by a simple and efficient collection method. Our proposed 3D gaze model extends existing models to include temporal information and to directly output an estimate of gaze uncertainty. We demonstrate the benefits of our model via an ablation study, and show its generalization performance via a cross-dataset evaluation against other recent gaze benchmark datasets. We furthermore propose a simple self-supervised approach to improve cross-dataset domain adaptation. Finally, we demonstrate an application of our model for estimating customer attention in a supermarket setting. Our dataset and models are available at this http URL .
4.Predictive Coding Networks Meet Action Recognition ⬇️
Action recognition is a key problem in computer vision that labels videos with a set of predefined actions. Capturing both, semantic content and motion, along the video frames is key to achieve high accuracy performance on this task. Most of the state-of-the-art methods rely on RGB frames for extracting the semantics and pre-computed optical flow fields as a motion cue. Then, both are combined using deep neural networks. Yet, it has been argued that such models are not able to leverage the motion information extracted from the optical flow, but instead the optical flow allows for better recognition of people and objects in the video. This urges the need to explore different cues or models that can extract motion in a more informative fashion. To tackle this issue, we propose to explore the predictive coding network, so called PredNet, a recurrent neural network that propagates predictive coding errors across layers and time steps. We analyze whether PredNet can better capture motions in videos by estimating over time the representations extracted from pre-trained networks for action recognition. In this way, the model only relies on the video frames, and does not need pre-processed optical flows as input. We report the effectiveness of our proposed model on UCF101 and HMDB51 datasets.
5.Attacking Optical Flow ⬇️
Deep neural nets achieve state-of-the-art performance on the problem of optical flow estimation. Since optical flow is used in several safety-critical applications like self-driving cars, it is important to gain insights into the robustness of those techniques. Recently, it has been shown that adversarial attacks easily fool deep neural networks to misclassify objects. The robustness of optical flow networks to adversarial attacks, however, has not been studied so far. In this paper, we extend adversarial patch attacks to optical flow networks and show that such attacks can compromise their performance. We show that corrupting a small patch of less than 1% of the image size can significantly affect optical flow estimates. Our attacks lead to noisy flow estimates that extend significantly beyond the region of the attack, in many cases even completely erasing the motion of objects in the scene. While networks using an encoder-decoder architecture are very sensitive to these attacks, we found that networks using a spatial pyramid architecture are less affected. We analyse the success and failure of attacking both architectures by visualizing their feature maps and comparing them to classical optical flow techniques which are robust to these attacks. We also demonstrate that such attacks are practical by placing a printed pattern into real scenes.
6.Unsupervised particle sorting for high-resolution single-particle cryo-EM ⬇️
Single-particle cryo-Electron Microscopy (EM) has become a popular technique for determining the structure of challenging biomolecules that are inaccessible to other technologies. Recent advances in automation, both in data collection and data processing, have significantly lowered the barrier for non-expert users to successfully execute the structure determination workflow. Many critical data processing steps, however, still require expert user intervention in order to converge to the correct high-resolution structure. In particular, strategies to identify homogeneous populations of particles rely heavily on subjective criteria that are not always consistent or reproducible among different users. Here, we explore the use of unsupervised strategies for particle sorting that are compatible with the autonomous operation of the image processing pipeline. More specifically, we show that particles can be successfully sorted based on a simple statistical model for the distribution of scores assigned during refinement. This represents an important step towards the development of automated workflows for protein structure determination using single-particle cryo-EM.
7.Human Action Recognition in Drone Videos using a Few Aerial Training Examples ⬇️
Drones are enabling new forms of human actions surveillance due to their low cost and fast mobility. However, using deep neural networks for automatic aerial action recognition is difficult due to the need for the humongous number of aerial human action videos needed for training. Collecting a large collection of human action aerial videos is costly, time-consuming and difficult. In this paper, we explore two alternative data sources to improve aerial action classification when only a few training aerial examples are available. As a first data source, we resort to video games. We collect plenty of ground and aerial videos pairs of human actions from video games. For the second data source, we generate discriminative fake aerial examples using conditional Wasserstein Generative Adversarial Networks. We integrate features from both game action videos and fake aerial examples with a few available aerial training examples using disjoint multitask learning. We validate the proposed approach on several aerial action datasets and demonstrate that aerial games and generated fake aerial examples can be extremely useful for an improved action recognition in real aerial videos when only a few aerial training examples are available.
8.Towards Automatic Annotation for Semantic Segmentation in Drone Videos ⬇️
Semantic segmentation is a crucial task for robot navigation and safety. However, it requires huge amounts of pixelwise annotations to yield accurate results. While recent progress in computer vision algorithms has been heavily boosted by large ground-level datasets, the labeling time has hampered progress in low altitude UAV applications, mostly due to the difficulty imposed by large object scales and pose variations. Motivated by the lack of a large video aerial dataset, we introduce a new one, with high resolution (4K) images and manually-annotated dense labels every 50 frames. To help the video labeling process, we make an important step towards automatic annotation and propose SegProp, an iterative flow-based method with geometric constrains to propagate the semantic labels to frames that lack human annotations. This results in a dataset with more than 50k annotated frames - the largest of its kind, to the best of our knowledge. Our experiments show that SegProp surpasses current state-of-the-art label propagation methods by a significant margin. Furthermore, when training a semantic segmentation deep neural net using the automatically annotated frames, we obtain a compelling overall performance boost at test time of 16.8% mean F-measure over a baseline trained only with manually-labeled frames.
Our Ruralscapes dataset, the label propagation code and a fast segmentation tool are available at our website: this https URL
9.Vehicle detection and counting from VHR satellite images: efforts and open issues ⬇️
Detection of new infrastructures (commercial, logistics, industrial or residential) from satellite images constitutes a proven method to investigate and follow economic and urban growth. The level of activities or exploitation of these sites may be hardly determined by building inspection, but could be inferred from vehicle presence from nearby streets and parking lots. We present in this paper two deep learning-based models for vehicle counting from optical satellite images coming from the Pleiades sensor at 50-cm spatial resolution. Both segmentation (Tiramisu) and detection (YOLO) architectures were investigated. These networks were adapted, trained and validated on a data set including 87k vehicles, annotated using an interactive semi-automatic tool developed by the authors. Experimental results show that both segmentation and detection models could achieve a precision rate higher than 85% with a recall rate also high (76.4% and 71.9% for Tiramisu and YOLO respectively).
10.Deep Set-to-Set Matching and Learning ⬇️
Matching two sets of items, called set-to-set matching problem, is being recently raised. The difficulties of set-to-set matching over ordinary data matching lie in the exchangeability in 1) set-feature extraction and 2) set-matching score; the pair of sets and the items in each set should be exchangeable. In this paper, we propose a deep learning architecture for the set-to-set matching that overcomes the above difficulties, including two novel modules: 1) a cross-set transformation and 2) cross-similarity function. The former provides the exchangeable set-feature through interactions between two sets in intermediate layers, and the latter provides the exchangeable set matching through calculating the cross-feature similarity of items between two sets. We evaluate the methods through experiments with two industrial applications, fashion set recommendation, and group re-identification. Through these experiments, we show that the proposed methods perform better than a baseline given by an extension of the Set Transformer, the state-of-the-art set-input function.
11.4-Connected Shift Residual Networks ⬇️
The shift operation was recently introduced as an alternative to spatial convolutions. The operation moves subsets of activations horizontally and/or vertically. Spatial convolutions are then replaced with shift operations followed by point-wise convolutions, significantly reducing computational costs. In this work, we investigate how shifts should best be applied to high accuracy CNNs. We apply shifts of two different neighbourhood groups to ResNet on ImageNet: the originally introduced 8-connected (8C) neighbourhood shift and the less well studied 4-connected (4C) neighbourhood shift. We find that when replacing ResNet's spatial convolutions with shifts, both shift neighbourhoods give equal ImageNet accuracy, showing the sufficiency of small neighbourhoods for large images. Interestingly, when incorporating shifts to all point-wise convolutions in residual networks, 4-connected shifts outperform 8-connected shifts. Such a 4-connected shift setup gives the same accuracy as full residual networks while reducing the number of parameters and FLOPs by over 40%. We then highlight that without spatial convolutions, ResNet's downsampling/upsampling bottleneck channel structure is no longer needed. We show a new, 4C shift-based residual network, much shorter than the original ResNet yet with a higher accuracy for the same computational cost. This network is the highest accuracy shift-based network yet shown, demonstrating the potential of shifting in deep neural networks.
12.Weakly-Supervised Completion Moment Detection using Temporal Attention ⬇️
Monitoring the progression of an action towards completion offers fine grained insight into the actor's behaviour. In this work, we target detecting the completion moment of actions, that is the moment when the action's goal has been successfully accomplished. This has potential applications from surveillance to assistive living and human-robot interactions. Previous effort required human annotations of the completion moment for training (i.e. full supervision). In this work, we present an approach for moment detection from weak video-level labels. Given both complete and incomplete sequences, of the same action, we learn temporal attention, along with accumulated completion prediction from all frames in the sequence. We also demonstrate how the approach can be used when completion moment supervision is available. We evaluate and compare our approach on actions from three datasets, namely HMDB, UCF101 and RGBD-AC, and show that temporal attention improves detection in both weakly-supervised and fully-supervised settings.
13.WeatherNet: Recognising weather and visual conditions from street-level images using deep residual learning ⬇️
Extracting information related to weather and visual conditions at a given time and space is indispensable for scene awareness, which strongly impacts our behaviours, from simply walking in a city to riding a bike, driving a car, or autonomous drive-assistance. Despite the significance of this subject, it is still not been fully addressed by the machine intelligence relying on deep learning and computer vision to detect the multi-labels of weather and visual conditions with a unified method that can be easily used for practice. What has been achieved to-date is rather sectorial models that address limited number of labels that do not cover the wide spectrum of weather and visual conditions. Nonetheless, weather and visual conditions are often addressed individually. In this paper, we introduce a novel framework to automatically extract this information from street-level images relying on deep learning and computer vision using a unified method without any pre-defined constraints in the processed images. A pipeline of four deep Convolutional Neural Network (CNN) models, so-called the WeatherNet, is trained, relying on residual learning using ResNet50 architecture, to extract various weather and visual conditions such as Dawn/dusk, day and night for time detection, and glare for lighting conditions, and clear, rainy, snowy, and foggy for weather conditions. The WeatherNet shows strong performance in extracting this information from user-defined images or video streams that can be used not limited to: autonomous vehicles and drive-assistance systems, tracking behaviours, safety-related research, or even for better understanding cities through images for policy-makers.
14.Hetero-Center Loss for Cross-Modality Person Re-Identification ⬇️
Cross-modality person re-identification is a challenging problem which retrieves a given pedestrian image in RGB modality among all the gallery images in infrared modality. The task can address the limitation of RGB-based person Re-ID in dark environments. Existing researches mainly focus on enlarging inter-class differences of feature to solve the problem. However, few studies investigate improving intra-class cross-modality similarity, which is important for this issue. In this paper, we propose a novel loss function, called Hetero-Center loss (HC loss) to reduce the intra-class cross-modality variations. Specifically, HC loss can supervise the network learning the cross-modality invariant information by constraining the intra-class center distance between two heterogenous modalities. With the joint supervision of Cross-Entropy (CE) loss and HC loss, the network is trained to achieve two vital objectives, inter-class discrepancy and intra-class cross-modality similarity as much as possible. Besides, we propose a simple and high-performance network architecture to learn local feature representations for cross-modality person re-identification, which can be a baseline for future research. Extensive experiments indicate the effectiveness of the proposed methods, which outperform state-of-the-art methods by a wide margin.
15.Structure Matters: Towards Generating Transferable Adversarial Images ⬇️
Recent works on adversarial examples for image classification focus on directly modifying pixels with minor perturbations. The small perturbation requirement is imposed to ensure the generated adversarial examples being natural and realistic to humans, which, however, puts a curb on the attack space thus limiting the attack ability and transferability especially for systems protected by a defense mechanism. In this paper, we propose the novel concepts of structure patterns and structure-aware perturbations that relax the small perturbation constraint while still keeping images natural. The key idea of our approach is to allow perceptible deviation in adversarial examples while keeping structure patterns that are central to a human classifier. Built upon these concepts, we propose a \emph{structure-preserving attack (SPA)} for generating natural adversarial examples with extremely high transferability. Empirical results on the MNIST and the CIFAR10 datasets show that SPA adversarial images can easily bypass strong PGD-based adversarial training and are still effective against SPA-based adversarial training. Further, they transfer well to other target models with little or no loss of successful attack rate, thus exhibiting competitive black-box attack performance. Our code is available at \url{this https URL}.
16.A low-power end-to-end hybrid neuromorphic framework for surveillance applications ⬇️
With the success of deep learning, object recognition systems that can be deployed for real-world applications are becoming commonplace. However, inference that needs to largely take place on the `edge' (not processed on servers), is a highly computational and memory intensive workload, making it intractable for low-power mobile nodes and remote security applications. To address this challenge, this paper proposes a low-power (5W) end-to-end neuromorphic framework for object tracking and classification using event-based cameras that possess desirable properties such as low power consumption (5-14 mW) and high dynamic range (120 dB). Nonetheless, unlike traditional approaches of using event-by-event processing, this work uses a mixed frame and event approach to get energy savings with high performance. Using a frame-based region proposal method based on the density of foreground events, a hardware-friendly object tracking is implemented using the apparent object velocity while tackling occlusion scenarios. For low-power classification of the tracked objects, the event camera is interfaced to IBM TrueNorth, which is time-multiplexed to tackle up to eight instances for a traffic monitoring application. The frame-based object track input is converted back to spikes for Truenorth classification via the energy efficient deep network (EEDN) pipeline. Using originally collected datasets, we train the TrueNorth model on the hardware track outputs, instead of using ground truth object locations as commonly done, and demonstrate the efficacy of our system to handle practical surveillance scenarios. Finally, we compare the proposed methodologies to state-of-the-art event-based systems for object tracking and classification, and demonstrate the use case of our neuromorphic approach for low-power applications without sacrificing on performance.
17.J Regularization Improves Imbalanced Multiclass Segmentation ⬇️
We propose a new loss formulation to further advance the multiclass segmentation of cluttered cells under weakly supervised conditions.
We improve the separation of touching and immediate cells, obtaining sharp segmentation boundaries with high adequacy, when we add Youden's$J$ statistic regularization term to the cross entropy loss. This regularization intrinsically supports class imbalance thus eliminating the necessity of explicitly using weights to balance training. Simulations demonstrate this capability and show how the regularization leads to better results by helping advancing the optimization when cross entropy stalls.
We build upon our previous work on multiclass segmentation by adding yet another training class representing gaps between adjacent cells.
This addition helps the classifier identify narrow gaps as background and no longer as touching regions.
We present results of our methods for 2D and 3D images, from bright field to confocal stacks containing different types of cells, and we show that they accurately segment individual cells after training with a limited number of annotated images, some of which are poorly annotated.
18.Self-Correction for Human Parsing ⬇️
Labeling pixel-level masks for fine-grained semantic segmentation tasks, e.g. human parsing, remains a challenging task. The ambiguous boundary between different semantic parts and those categories with similar appearance usually are confusing, leading to unexpected noises in ground truth masks. To tackle the problem of learning with label noises, this work introduces a purification strategy, called Self-Correction for Human Parsing (SCHP), to progressively promote the reliability of the supervised labels as well as the learned models. In particular, starting from a model trained with inaccurate annotations as initialization, we design a cyclically learning scheduler to infer more reliable pseudo-masks by iteratively aggregating the current learned model with the former optimal one in an online manner. Besides, those correspondingly corrected labels can in turn to further boost the model performance. In this way, the models and the labels will reciprocally become more robust and accurate during the self-correction learning cycles. Benefiting from the superiority of SCHP, we achieve the best performance on two popular single-person human parsing benchmarks, including LIP and Pascal-Person-Part datasets. Our overall system ranks 1st in CVPR2019 LIP Challenge. Code is available at this https URL.
19.A Review of Visual Trackers and Analysis of its Application to Mobile Robot ⬇️
Computer vision has received a significant attention in recent year, which is one of the important parts for robots to obtain information about the external environment. Visual trackers can provide the necessary physical and environmental parameters for the mobile robot, and their performance is related to the actual application of the robot. This study provides a comprehensive survey on visual trackers. Following a brief introduction, we first analyzed the basic framework and difficulties of visual trackers. Then the structure of generative and discriminative methods is introduced, and summarized the feature descriptors, modeling methods, and learning methods which be used in tracker. Later we reviewed and evaluated the state-of-the-art progress on discriminative trackers from three directions: correlation filter, deep learning and convolutional features. Finally, we analyzed the research direction of visual tracker used in mobile robot, as well as outlined the future trends for visual tracker on mobile robot.
20.Assessment of the Local Tchebichef Moments Method for Texture Classification by Fine Tuning Extraction Parameters ⬇️
In this paper we use machine learning to study the application of Local Tchebichef Moments (LTM) to the problem of texture classification. The original LTM method was proposed by Mukundan (2014).
The LTM method can be used for texture analysis in many different ways, either using the moment values directly, or more simply creating a relationship between the moment values of different orders, producing a histogram similar to those of Local Binary Pattern (LBP) based methods. The original method was not fully tested with large datasets, and there are several parameters that should be characterised for performance. Among these parameters are the kernel size, the moment orders and the weights for each moment.
We implemented the LTM method in a flexible way in order to allow for the modification of the parameters that can affect its performance. Using four subsets from the Outex dataset (a popular benchmark for texture analysis), we used Random Forests to create models and to classify texture images, recording the standard metrics for each classifier. We repeated the process using several variations of the LBP method for comparison. This allowed us to find the best combination of orders and weights for the LTM method for texture classification.
21.Drivers Drowsiness Detection using Condition-Adaptive Representation Learning Framework ⬇️
We propose a condition-adaptive representation learning framework for the driver drowsiness detection based on 3D-deep convolutional neural network. The proposed framework consists of four models: spatio-temporal representation learning, scene condition understanding, feature fusion, and drowsiness detection. The spatio-temporal representation learning extracts features that can describe motions and appearances in video simultaneously. The scene condition understanding classifies the scene conditions related to various conditions about the drivers and driving situations such as statuses of wearing glasses, illumination condition of driving, and motion of facial elements such as head, eye, and mouth. The feature fusion generates a condition-adaptive representation using two features extracted from above models. The detection model recognizes drivers drowsiness status using the condition-adaptive representation. The condition-adaptive representation learning framework can extract more discriminative features focusing on each scene condition than the general representation so that the drowsiness detection method can provide more accurate results for the various driving situations. The proposed framework is evaluated with the NTHU Drowsy Driver Detection video dataset. The experimental results show that our framework outperforms the existing drowsiness detection methods based on visual analysis.
22.Notable Site Recognition on Mobile Devices using Deep Learning with Crowd-sourced Imagery ⬇️
In this work we design a mobile system that is able to automatically recognise sites of interest and project relevant information to a user that navigates the city. First, we build a collection of notable sites to bootstrap our system using Wikipedia. We then exploit online services such as Google Images and Flickr to collect large collections of crowdsourced imagery describing those sites. These images are then used to train minimal deep learning architectures that can effectively be transmitted and deployed to mobile devices becoming accessible to users through a dedicated application. Conducting an evaluation performing a series of online and real world experiments we enlist a number of key challenges that make the successful deployment of site recognition system difficult and highlight the importance of incorporating mobile contextual information to facilitate the visual recognition task. Similarity in the feature maps of objects that undergo identification, the presence of noise in crowdsourced imagery and arbitrary user induced inputs are among the factors the impede correct classification for deep learning models. We show how curating the training data through the application of a class-specific image denoising method and the incorporation of information such as user location, orientation and attention patterns can allow for significant improvement in classification accuracy and the election of a system that can effectively be used to recognise sites in the wild two out of three times.
23.CPWC: Contextual Point Wise Convolution for Object Recognition ⬇️
Convolutional layers are a major driving force behind the successes of deep learning. Pointwise convolution (PWC) is a 1x1 convolutional filter that is primarily used for parameter reduction. However, the PWC ignores the spatial information around the points it is processing. This design is by choice, in order to reduce the overall parameters and computations. However, we hypothesize that this shortcoming of PWC has a significant impact on the network performance. We propose an alternative design for pointwise convolution, which uses spatial information from the input efficiently. Our design significantly improves the performance of the networks without substantially increasing the number of parameters and computations. We experimentally show that our design results in significant improvement in the performance of the network for classification as well as detection.
24.The SWAX Benchmark: Attacking Biometric Systems with Wax Figures ⬇️
A face spoofing attack occurs when an intruder attempts to impersonate someone who carries a gainful authentication clearance. It is a trending topic due to the increasing demand for biometric authentication on mobile devices, high-security areas, among others. This work introduces a new database named Sense Wax Attack dataset (SWAX), comprised of real human and wax figure images and videos that endorse the problem of face spoofing detection. The dataset consists of more than 1800 face images and 110 videos of 55 people/waxworks, arranged in training, validation and test sets with a large range in expression, illumination and pose variations. Experiments performed with baseline methods show that despite the progress in recent years, advanced spoofing methods are still vulnerable to high-quality violation attempts.
25.Conquering the CNN Over-Parameterization Dilemma: A Volterra Filtering Approach for Action Recognition ⬇️
The importance of inference in Machine Learning (ML) has led to an explosive number of different proposals in ML, and particularly in Deep Learning. In an attempt to reduce the complexity of Convolutional Neural Networks, we propose a Volterra filter-inspired Network architecture. This architecture introduces controlled non-linearities in the form of interactions between the delayed input samples of data. We propose a cascaded implementation of Volterra Filter so as to significantly reduce the number of parameters required to carry out the same classification task as that of a conventional Neural Network. We demonstrate an efficient parallel implementation of this new Volterra network, along with its remarkable performance while retaining a relatively simpler and potentially more tractable structure. Furthermore, we show a rather sophisticated adaptation of this network to nonlinearly fuse the RGB (spatial) information and the Optical Flow (temporal) information of a video sequence for action recognition. The proposed approach is evaluated on UCF-101 and HMDB-51 datasets for action recognition, and is shown to outperform state of the art when trained on the datasets from scratch (i.e. without pre-training on a larger dataset).
26.Establishing an Evaluation Metric to Quantify Climate Change Image Realism ⬇️
With success on controlled tasks, generative models are being increasingly applied to humanitarian applications [1,2]. In this paper, we focus on the evaluation of a conditional generative model that illustrates the consequences of climate change-induced flooding to encourage public interest and awareness on the issue. Because metrics for comparing the realism of different modes in a conditional generative model do not exist, we propose several automated and human-based methods for evaluation. To do this, we adapt several existing metrics, and assess the automated metrics against gold standard human evaluation. We find that using Fréchet Inception Distance (FID) with embeddings from an intermediary Inception-V3 layer that precedes the auxiliary classifier produces results most correlated with human realism. While insufficient alone to establish a human-correlated automatic evaluation metric, we believe this work begins to bridge the gap between human and automated generative evaluation procedures.
27.The Practicality of Stochastic Optimization in Imaging Inverse Problems ⬇️
In this work we investigate the practicality of stochastic gradient descent and recently introduced variants with variance-reduction techniques in imaging inverse problems. Such algorithms have been shown in the machine learning literature to have optimal complexities in theory, and provide great improvement empirically over the deterministic gradient methods. Surprisingly, in some tasks such as image deblurring, many of such methods fail to converge faster than the accelerated deterministic gradient methods, even in terms of epoch counts. We investigate this phenomenon and propose a theory-inspired mechanism to characterize whether an inverse problem should be preferred to be solved by stochastic optimization techniques. We derive conditions on the structure of the inverse problem for being a suitable application of stochastic gradient methods, using standard tools in numerical linear algebra. Based on our analysis, we provide the practitioners convenient ways to examine whether they should use stochastic gradient methods or the classical deterministic gradient methods to solve a given inverse problem. Our results also provide guidance on choosing appropriately the partition minibatch schemes. Finally, we propose an accelerated primal-dual SGD algorithm in order to tackle another key bottleneck of stochastic optimization which is the heavy computation of proximal operators. The proposed method has fast convergence rate in practice, and is able to efficiently handle non-smooth regularization terms which are coupled with linear operators.
28.Image processing in DNA ⬇️
The main obstacles for the practical deployment of DNA-based data storage platforms are the prohibitively high cost of synthetic DNA and the large number of errors introduced during synthesis. In particular, synthetic DNA products contain both individual oligo (fragment) symbol errors as well as missing DNA oligo errors, with rates that exceed those of modern storage systems by orders of magnitude. These errors can be corrected either through the use of a large number of redundant oligos or through cycles of writing, reading, and rewriting of information that eliminate the errors. Both approaches add to the overall storage cost and are hence undesirable. Here we propose the first method for storing quantized images in DNA that uses signal processing and machine learning techniques to deal with error and cost issues without resorting to the use of redundant oligos or rewriting. Our methods rely on decoupling the RGB channels of images, performing specialized quantization and compression on the individual color channels, and using new discoloration detection and image inpainting techniques. We demonstrate the performance of our approach experimentally on a collection of movie posters stored in DNA.
29.Scanner Invariant Multiple Sclerosis Lesion Segmentation from MRI ⬇️
This paper presents a simple and effective generalization method for magnetic resonance imaging (MRI) segmentation when data is collected from multiple MRI scanning sites and as a consequence is affected by (site-)domain shifts. We propose to integrate a traditional encoder-decoder network with a regularization network. This added network includes an auxiliary loss term which is responsible for the reduction of the domain shift problem and for the resulting improved generalization. The proposed method was evaluated on multiple sclerosis lesion segmentation from MRI data. We tested the proposed model on an in-house clinical dataset including 117 patients from 56 different scanning sites. In the experiments, our method showed better generalization performance than other baseline networks.
30.Image recovery from rotational and translational invariants ⬇️
We introduce a framework for recovering an image from its rotationally and translationally invariant features based on autocorrelation analysis. This work is an instance of the multi-target detection statistical model, which is mainly used to study the mathematical and computational properties of single-particle reconstruction using cryo-electron microscopy (cryo-EM) at low signal-to-noise ratios. We demonstrate with synthetic numerical experiments that an image can be reconstructed from rotationally and translationally invariant features and show that the reconstruction is robust to noise. These results constitute an important step towards the goal of structure determination of small biomolecules using cryo-EM.
31.Learning Adaptive Regularization for Image Labeling Using Geometric Assignment ⬇️
We study the inverse problem of model parameter learning for pixelwise image labeling, using the linear assignment flow and training data with ground truth. This is accomplished by a Riemannian gradient flow on the manifold of parameters that determine the regularization properties of the assignment flow. Using the symplectic partitioned Runge--Kutta method for numerical integration, it is shown that deriving the sensitivity conditions of the parameter learning problem and its discretization commute. A convenient property of our approach is that learning is based on exact inference. Carefully designed experiments demonstrate the performance of our approach, the expressiveness of the mathematical model as well as its limitations, from the viewpoint of statistical learning and optimal control.
32.A Locating Model for Pulmonary Tuberculosis Diagnosis in Radiographs ⬇️
Objective: We propose an end-to-end CNN-based locating model for pulmonary tuberculosis (TB) diagnosis in radiographs. This model makes full use of chest radiograph (X-ray) for its improved accessibility, reduced cost and high accuracy for TB disease. Methods: Several specialized improvements are proposed for detection task in medical field. A false positive (FP) restrictor head is introduced for FP reduction. Anchor-oriented network heads is proposed in the position regression section. An optimization of loss function is designed for hard example mining. Results: The experimental results show that when the threshold of intersection over union (IoU) is set to 0.3, the average precision (AP) of two test data sets provided by different hospitals reaches 0.9023 and 0.9332. Ablation experiments shows that hard example mining and change of regressor heads contribute most in this work, but FP restriction is necessary in a CAD diagnose system. Conclusion: The results prove the high precision and good generalization ability of our proposed model comparing to previous works. Significance: We first make full use of the feature extraction ability of CNNs in TB diagnostic field and make exploration in localization of TB, when the previous works focus on the weaker task of healthy-sick subject classification.
33.Fixed Pattern Noise Reduction for Infrared Images Based on Cascade Residual Attention CNN ⬇️
Existing fixed pattern noise reduction (FPNR) methods are easily affected by the motion state of the scene and working condition of the image sensor, which leads to over smooth effects, ghosting artifacts as well as slow convergence rate. To address these issues, we design an innovative cascade convolution neural network (CNN) model with residual skip connections to realize single frame blind FPNR operation without any parameter tuning. Moreover, a coarse-fine convolution (CF-Conv) unit is introduced to extract complementary features in various scales and fuse them to pick more spatial information. Inspired by the success of the visual attention mechanism, we further propose a particular spatial-channel noise attention unit (SCNAU) to separate the scene details from fixed pattern noise more thoroughly and recover the real scene more accurately. Experimental results on test data demonstrate that the proposed cascade CNN-FPNR method outperforms the existing FPNR methods in both of visual effect and quantitative assessment.
34.Towards best practice in explaining neural network decisions with LRP ⬇️
Within the last decade, neural network based predictors have demonstrated impressive - and at times super-human - capabilities. This performance is often paid for with an intransparent prediction process and thus has sparked numerous contributions in the novel field of explainable artificial intelligence (XAI). In this paper, we focus on a popular and widely used method of XAI, the Layer-wise Relevance Propagation (LRP). Since its initial proposition LRP has evolved as a method, and a best practice for applying the method has tacitly emerged, based on humanly observed evidence. We investigate - and for the first time quantify - the effect of this current best practice on feedforward neural networks in a visual object detection setting. The results verify that the current, layer-dependent approach to LRP applied in recent literature better represents the model's reasoning, and at the same time increases the object localization and class discriminativity of LRP.
35.Improving Siamese Networks for One Shot Learning using Kernel Based Activation functions ⬇️
The lack of a large amount of training data has always been the constraining factor in solving a lot of problems in machine learning, making One Shot Learning one of the most intriguing ideas in machine learning. It aims to learn information about object categories from one, or only a few training examples. This process of learning in deep learning is usually accomplished by proper objective function, i.e; loss function and embeddings extraction i.e; architecture. In this paper, we discussed about metrics based deep learning architectures for one shot learning such as Siamese neural networks and present a method to improve on their accuracy using Kafnets (kernel-based non-parametric activation functions for neural networks) by learning proper embeddings with relatively less number of epochs. Using kernel activation functions, we are able to achieve strong results which exceed those of ReLU based deep learning models in terms of embeddings structure, loss convergence, and accuracy.
36.Robust Training with Ensemble Consensus ⬇️
Since deep neural networks are over-parametrized, they may memorize noisy examples. We address such memorizing issue under the existence of annotation noise. From the fact that deep neural networks cannot generalize neighborhoods of the features acquired via memorization, we find that noisy examples do not consistently incur small losses on the network in the presence of perturbation. Based on this, we propose a novel training method called Learning with Ensemble Consensus (LEC) whose goal is to prevent overfitting noisy examples by eliminating them identified via consensus of an ensemble of perturbed networks. One of the proposed LECs, LTEC outperforms the current state-of-the-art methods on MNIST, CIFAR-10, and CIFAR-100 despite its efficient memory.
37.Trident Segmentation CNN: A Spatiotemporal Transformation CNN for Punctate White Matter Lesions Segmentation in Preterm Neonates ⬇️
Accurate segmentation of punctate white matter lesions (PWML) in preterm neonates by an automatic algorithm can better assist doctors in diagnosis. However, the existing algorithms have many limitations, such as low detection accuracy and large resource consumption. In this paper, a novel spatiotemporal transformation deep learning method called Trident Segmentation CNN (TS-CNN) is proposed to segment PWML in MR images. It can convert spatial information into temporal information, which reduces the consumption of computing resources. Furthermore, a new improved training loss called Self-balancing Focal Loss (SBFL) is proposed to balance the loss during the training process. The whole model is evaluated on a dataset of 704 MR images. Overall the method achieves median DSC, sensitivity, specificity, and Hausdorff distance of 0.6355, 0.7126, 0.9998, and 24.5836 mm which outperforms the state-of-the-art algorithm. (The code is now available on this https URL)
38.Face representation by deep learning: a linear encoding in a parameter space? ⬇️
Recently, Convolutional Neural Networks (CNNs) have achieved tremendous performances on face recognition, and one popular perspective regarding CNNs' success is that CNNs could learn discriminative face representations from face images with complex image feature encoding. However, it is still unclear what is the intrinsic mechanism of face representation in CNNs. In this work, we investigate this problem by formulating face images as points in a shape-appearance parameter space, and our results demonstrate that: (i) The encoding and decoding of the neuron responses (representations) to face images in CNNs could be achieved under a linear model in the parameter space, in agreement with the recent discovery in primate IT face neurons, but different from the aforementioned perspective on CNNs' face representation with complex image feature encoding; (ii) The linear model for face encoding and decoding in the parameter space could achieve close or even better performances on face recognition and verification than state-of-the-art CNNs, which might provide new lights on the design strategies for face recognition systems; (iii) The neuron responses to face images in CNNs could not be adequately modelled by the axis model, a model recently proposed on face modelling in primate IT cortex. All these results might shed some lights on the often complained blackbox nature behind CNNs' tremendous performances on face recognition.
39.Single Versus Union: Non-parallel Support Vector Machine Frameworks ⬇️
Considering the classification problem, we summarize the nonparallel support vector machines with the nonparallel hyperplanes to two types of frameworks. The first type constructs the hyperplanes separately. It solves a series of small optimization problems to obtain a series of hyperplanes, but is hard to measure the loss of each sample. The other type constructs all the hyperplanes simultaneously, and it solves one big optimization problem with the ascertained loss of each sample. We give the characteristics of each framework and compare them carefully. In addition, based on the second framework, we construct a max-min distance-based nonparallel support vector machine for multiclass classification problem, called NSVM. It constructs hyperplanes with large distance margin by solving an optimization problem. Experimental results on benchmark data sets and human face databases show the advantages of our NSVM.
40.Convolutional Prototype Learning for Zero-Shot Recognition ⬇️
Zero-shot learning (ZSL) has received increasing attention in recent years especially in areas of fine-grained object recognition, retrieval, and image captioning. The key to ZSL is to transfer knowledge from the seen to the unseen classes via auxiliary class attribute vectors. However, the popularly learned projection functions in previous works cannot generalize well since they assume the distribution consistency between seen and unseen domains at sample-level.Besides, the provided non-visual and unique class attributes can significantly degrade the recognition performance in semantic space. In this paper, we propose a simple yet effective convolutional prototype learning (CPL) framework for zero-shot recognition. By assuming distribution consistency at task-level, our CPL is capable of transferring knowledge smoothly to recognize unseen samples.Furthermore, inside each task, discriminative visual prototypes are learned via a distance based training mechanism. Consequently, we can perform recognition in visual space, instead of semantic space. An extensive group of experiments are then carefully designed and presented, demonstrating that CPL obtains more favorable effectiveness, over currently available alternatives under various settings.
41.Penalizing small errors using an Adaptive Logarithmic Loss ⬇️
Loss functions are error metrics that quantify the difference between a prediction and its corresponding ground truth. Fundamentally, they define a functional landscape for traversal by gradient descent. Although numerous loss functions have been proposed to date in order to handle various machine learning problems, little attention has been given to enhancing these functions to better traverse the loss landscape. In this paper, we simultaneously and significantly mitigate two prominent problems in medical image segmentation namely: i) class imbalance between foreground and background pixels and ii) poor loss function convergence. To this end, we propose an adaptive logarithmic loss function. We compare this loss function with the existing state-of-the-art on the ISIC 2018 dataset, the nuclei segmentation dataset as well as the DRIVE retinal vessel segmentation dataset. We measure the performance of our methodology on benchmark metrics and demonstrate state-of-the-art performance. More generally, we show that our system can be used as a framework for better training of deep neural networks.
42.A deep active learning system for species identification and counting in camera trap images ⬇️
Biodiversity conservation depends on accurate, up-to-date information about wildlife population distributions. Motion-activated cameras, also known as camera traps, are a critical tool for population surveys, as they are cheap and non-intrusive. However, extracting useful information from camera trap images is a cumbersome process: a typical camera trap survey may produce millions of images that require slow, expensive manual review. Consequently, critical information is often lost due to resource limitations, and critical conservation questions may be answered too slowly to support decision-making. Computer vision is poised to dramatically increase efficiency in image-based biodiversity surveys, and recent studies have harnessed deep learning techniques for automatic information extraction from camera trap images. However, the accuracy of results depends on the amount, quality, and diversity of the data available to train models, and the literature has focused on projects with millions of relevant, labeled training images. Many camera trap projects do not have a large set of labeled images and hence cannot benefit from existing machine learning techniques. Furthermore, even projects that do have labeled data from similar ecosystems have struggled to adopt deep learning methods because image classification models overfit to specific image backgrounds (i.e., camera locations). In this paper, we focus not on automating the labeling of camera trap images, but on accelerating this process. We combine the power of machine intelligence and human intelligence to build a scalable, fast, and accurate active learning system to minimize the manual work required to identify and count animals in camera trap images. Our proposed scheme can match the state of the art accuracy on a 3.2 million image dataset with as few as 14,100 manual labels, which means decreasing manual labeling effort by over 99.5%.
43.Discriminative Neural Clustering for Speaker Diarisation ⬇️
This paper proposes a novel method for supervised data clustering. The clustering procedure is modelled by a discriminative sequence-to-sequence neural network that learns from examples. The effectiveness of the Transformer-based Discriminative Neural Clustering (DNC) model is validated on a speaker diarisation task using the challenging AMI data set, where audio segments need to be clustered into an unknown number of speakers. The AMI corpus contains only 147 meetings as training examples for the DNC model, which is very limited for training an encoder-decoder neural network. Data scarcity is mitigated through three data augmentation schemes proposed in this paper, including Diaconis Augmentation, a novel technique proposed for discriminative embeddings trained using cosine similarities. Comparing between DNC and the commonly used spectral clustering algorithm for speaker diarisation shows that the DNC approach outperforms its unsupervised counterpart by 29.4% relative. Furthermore, DNC requires no explicit definition of a similarity measure between samples, which is a significant advantage considering that such a measure might be difficult to specify.
44.Learning to Map Natural Language Instructions to Physical Quadcopter Control using Simulated Flight ⬇️
We propose a joint simulation and real-world learning framework for mapping navigation instructions and raw first-person observations to continuous control. Our model estimates the need for environment exploration, predicts the likelihood of visiting environment positions during execution, and controls the agent to both explore and visit high-likelihood positions. We introduce Supervised Reinforcement Asynchronous Learning (SuReAL). Learning uses both simulation and real environments without requiring autonomous flight in the physical environment during training, and combines supervised learning for predicting positions to visit and reinforcement learning for continuous control. We evaluate our approach on a natural language instruction-following task with a physical quadcopter, and demonstrate effective execution and exploration behavior.
45.GANspection ⬇️
Generative Adversarial Networks (GANs) have been used extensively and quite successfully for unsupervised learning. As GANs don't approximate an explicit probability distribution, it's an interesting study to inspect the latent space representations learned by GANs. The current work seeks to push the boundaries of such inspection methods to further understand in more detail the manifold being learned by GANs. Various interpolation and extrapolation techniques along with vector arithmetic is used to understand the learned manifold. We show through experiments that GANs indeed learn a data probability distribution rather than memorize images/data. Further, we prove that GANs encode semantically relevant information in the learned probability distribution. The experiments have been performed on two publicly available datasets - Large Scale Scene Understanding (LSUN) and CelebA.
46.Icentia11K: An Unsupervised Representation Learning Dataset for Arrhythmia Subtype Discovery ⬇️
We release the largest public ECG dataset of continuous raw signals for representation learning containing 11 thousand patients and 2 billion labelled beats. Our goal is to enable semi-supervised ECG models to be made as well as to discover unknown subtypes of arrhythmia and anomalous ECG signal events. To this end, we propose an unsupervised representation learning task, evaluated in a semi-supervised fashion. We provide a set of baselines for different feature extractors that can be built upon. Additionally, we perform qualitative evaluations on results from PCA embeddings, where we identify some clustering of known subtypes indicating the potential for representation learning in arrhythmia sub-type discovery.