1.Learning to Train with Synthetic Humans ⬇️
Neural networks need big annotated datasets for training. However, manual annotation can be too expensive or even unfeasible for certain tasks, like multi-person 2D pose estimation with severe occlusions. A remedy for this is synthetic data with perfect ground truth. Here we explore two variations of synthetic data for this challenging problem; a dataset with purely synthetic humans and a real dataset augmented with synthetic humans. We then study which approach better generalizes to real data, as well as the influence of virtual humans in the training loss. Using the augmented dataset, without considering synthetic humans in the loss, leads to the best results. We observe that not all synthetic samples are equally informative for training, while the informative samples are different for each training stage. To exploit this observation, we employ an adversarial student-teacher framework; the teacher improves the student by providing the hardest samples for its current state as a challenge. Experiments show that the student-teacher framework outperforms normal training on the purely synthetic dataset.
2.Captioning Near-Future Activity Sequences ⬇️
Most of the existing works on human activity analysis focus on recognition or early recognition of the activity labels from complete or partial observations. Similarly, existing video captioning approaches focus on the observed events in videos. Predicting the labels and the captions of future activities where no frames of the predicted activities have been observed is a challenging problem, with important applications that require anticipatory response. In this work, we propose a system that can infer the labels and the captions of a sequence of future activities. Our proposed network for label prediction of a future activity sequence is similar to a hybrid Siamese network with three branches where the first branch takes visual features from the objects present in the scene, the second branch takes observed activity features and the third branch captures the last observed activity features. The predicted labels and the observed scene context are then mapped to meaningful captions using a sequence-to-sequence learning based method. Experiments on three challenging activity analysis datasets and a video description dataset demonstrate that both our label prediction framework and captioning framework outperforms the state-of-the-arts.
3.Effects of Illumination on the Categorization of Shiny Materials ⬇️
The present research was designed to examine how patterns of illumination influence the perceptual categorization of metal, shiny black, and shiny white materials. The stimuli depicted three possible objects that were illuminated by five possible HDRI light maps, which varied in their overall distributions of illuminant directions and intensities. The surfaces included a low roughness chrome material, a shiny black material, and a shiny white material with both diffuse and specular components. Observers rated each stimulus by adjusting four sliders to indicate their confidence that the depicted material was metal, shiny black, shiny white or something else, and these adjustments were constrained so that the sum of all four settings was always 100%. The results revealed that the metal and shiny black categories are easily confused. For example, metal materials with low intensity light maps or a narrow range of illuminant directions are often judged as shiny black, whereas shiny black materials with high intensity light maps or a wide range of illuminant directions are often judged as metal. A spherical harmonic analysis was performed on the different light maps in an effort to quantitatively predict how they would bias observers' judgments of metal and shiny black surfaces.
4.Pothole Detection Based on Disparity Transformation and Road Surface Modeling ⬇️
Pothole detection is one of the most important tasks for road maintenance. Computer vision approaches are generally based on either 2D road image analysis or 3D road surface modeling. However, these two categories are always used independently. Furthermore, the pothole detection accuracy is still far from satisfactory. Therefore, in this paper, we present a robust pothole detection algorithm that is both accurate and computationally efficient. A dense disparity map is first transformed to better distinguish between damaged and undamaged road areas. To achieve greater disparity transformation efficiency, golden section search and dynamic programming are utilized to estimate the transformation parameters. Otsu's thresholding method is then used to extract potential undamaged road areas from the transformed disparity map. The disparities in the extracted areas are modeled by a quadratic surface using least squares fitting. To improve disparity map modeling robustness, the surface normal is also integrated into the surface modeling process. Furthermore, random sample consensus is utilized to reduce the effects caused by outliers. By comparing the difference between the actual and modeled disparity maps, the potholes can be detected accurately. Finally, the point clouds of the detected potholes are extracted from the reconstructed 3D road surface. The experimental results show that the successful detection accuracy of the proposed system is around 98.7% and the overall pixel-level accuracy is approximately 99.6%.
5.An Evaluation of Action Recognition Models on EPIC-Kitchens ⬇️
We benchmark contemporary action recognition models (TSN, TRN, and TSM) on the recently introduced EPIC-Kitchens dataset and release pretrained models on GitHub (this https URL) for others to build upon. In contrast to popular action recognition datasets like Kinetics, Something-Something, UCF101, and HMDB51, EPIC-Kitchens is shot from an egocentric perspective and captures daily actions in-situ. In this report, we aim to understand how well these models can tackle the challenges present in this dataset, such as its long tail class distribution, unseen environment test set, and multiple tasks (verb, noun and, action classification). We discuss the models' shortcomings and avenues for future research.
6.Adversarial Camera Alignment Network for Unsupervised Cross-camera Person Re-identification ⬇️
In person re-identification (Re-ID), supervised methods usually need a large amount of expensive label information, while unsupervised ones are still unable to deliver satisfactory identification performance. In this paper, we introduce a novel person Re-ID task called unsupervised cross-camera person Re-ID, which only needs the within-camera (intra-camera) label information but not cross-camera (inter-camera) labels which are more expensive to obtain. In real-world applications, the intra-camera label information can be easily captured by tracking algorithms or few manual annotations. In this situation, the main challenge becomes the distribution discrepancy across different camera views, caused by the various body pose, occlusion, image resolution, illumination conditions, and background noises in different cameras. To address this situation, we propose a novel Adversarial Camera Alignment Network (ACAN) for unsupervised cross-camera person Re-ID. It consists of the camera-alignment task and the supervised within-camera learning task. To achieve the camera alignment, we develop a Multi-Camera Adversarial Learning (MCAL) to map images of different cameras into a shared subspace. Particularly, we investigate two different schemes, including the existing GRL (i.e., gradient reversal layer) scheme and the proposed scheme called "other camera equiprobability" (OCE), to conduct the multi-camera adversarial task. Based on this shared subspace, we then leverage the within-camera labels to train the network. Extensive experiments on five large-scale datasets demonstrate the superiority of ACAN over multiple state-of-the-art unsupervised methods that take advantage of labeled source domains and generated images by GAN-based models. In particular, we verify that the proposed multi-camera adversarial task does contribute to the significant improvement.
7.Distilling Knowledge From a Deep Pose Regressor Network ⬇️
This paper presents a novel method to distill knowledge from a deep pose regressor network for efficient Visual Odometry (VO). Standard distillation relies on "dark knowledge" for successful knowledge transfer. As this knowledge is not available in pose regression and the teacher prediction is not always accurate, we propose to emphasize the knowledge transfer only when we trust the teacher. We achieve this by using teacher loss as a confidence score which places variable relative importance on the teacher prediction. We inject this confidence score to the main training task via Attentive Imitation Loss (AIL) and when learning the intermediate representation of the teacher through Attentive Hint Training (AHT) approach. To the best of our knowledge, this is the first work which successfully distill the knowledge from a deep pose regression network. Our evaluation on the KITTI and Malaga dataset shows that we can keep the student prediction close to the teacher with up to 92.95% parameter reduction and 2.12x faster in computation time.
8.Learning the Model Update for Siamese Trackers ⬇️
Siamese approaches address the visual tracking problem by extracting an appearance template from the current frame, which is used to localize the target in the next frame. In general, this template is linearly combined with the accumulated template from the previous frame, resulting in an exponential decay of information over time. While such an approach to updating has led to improved results, its simplicity limits the potential gain likely to be obtained by learning to update. Therefore, we propose to replace the handcrafted update function with a method which learns to update. We use a convolutional neural network, called UpdateNet, which given the initial template, the accumulated template and the template of the current frame aims to estimate the optimal template for the next frame. The UpdateNet is compact and can easily be integrated into existing Siamese trackers. We demonstrate the generality of the proposed approach by applying it to two Siamese trackers, SiamFC and DaSiamRPN. Extensive experiments on VOT2016, VOT2018, LaSOT, and TrackingNet datasets demonstrate that our UpdateNet effectively predicts the new target template, outperforming the standard linear update. On the large-scale TrackingNet dataset, our UpdateNet improves the results of DaSiamRPN with an absolute gain of 3.9% in terms of success score.
9.Learning Lightweight Lane Detection CNNs by Self Attention Distillation ⬇️
Training deep models for lane detection is challenging due to the very subtle and sparse supervisory signals inherent in lane annotations. Without learning from much richer context, these models often fail in challenging scenarios, e.g., severe occlusion, ambiguous lanes, and poor lighting conditions. In this paper, we present a novel knowledge distillation approach, i.e., Self Attention Distillation (SAD), which allows a model to learn from itself and gains substantial improvement without any additional supervision or labels. Specifically, we observe that attention maps extracted from a model trained to a reasonable level would encode rich contextual information. The valuable contextual information can be used as a form of 'free' supervision for further representation learning through performing topdown and layer-wise attention distillation within the network itself. SAD can be easily incorporated in any feedforward convolutional neural networks (CNN) and does not increase the inference time. We validate SAD on three popular lane detection benchmarks (TuSimple, CULane and BDD100K) using lightweight models such as ENet, ResNet-18 and ResNet-34. The lightest model, ENet-SAD, performs comparatively or even surpasses existing algorithms. Notably, ENet-SAD has 20 x fewer parameters and runs 10 x faster compared to the state-of-the-art SCNN, while still achieving compelling performance in all benchmarks. Our code is available at this https URL.
10.A Structural Graph-Based Method for MRI Analysis ⬇️
The importance of imaging exams, such as Magnetic Resonance Imaging (MRI), for the diagnostic and follow-up of pediatric pathologies and the assessment of anatomical structures' development has been increasingly highlighted in recent times. Manual analysis of MRIs is time-consuming, subjective, and requires significant expertise. To mitigate this, automatic techniques are necessary. Most techniques focus on adult subjects, while pediatric MRI has specific challenges such as the ongoing anatomical and histological changes related to normal development of the organs, reduced signal-to-noise ratio due to the smaller bodies, motion artifacts and cooperation issues, especially in long exams, which can in many cases preclude common analysis methods developed for use in adults. Therefore, the development of a robust technique to aid in pediatric MRI analysis is necessary. This paper presents the current development of a new method based on the learning and matching of structural relational graphs (SRGs). The experiments were performed on liver MRI sequences of one patient from ICr-HC-FMUSP, and preliminary results showcased the viability of the project. Future experiments are expected to culminate with an application for pediatric liver substructure and brain tumor segmentation.
11.DAWN: Dual Augmented Memory Network for Unsupervised Video Object Tracking ⬇️
Psychological studies have found that human visual tracking system involves learning, memory, and planning. Despite recent successes, not many works have focused on memory and planning in deep learning based tracking. We are thus interested in memory augmented network, where an external memory remembers the evolving appearance of the target (foreground) object without backpropagation for updating weights. Our Dual Augmented Memory Network (DAWN) is unique in remembering both target and background, and using an improved attention LSTM memory to guide the focus on memorized features. DAWN is effective in unsupervised tracking in handling total occlusion, severe motion blur, abrupt changes in target appearance, multiple object instances, and similar foreground and background features. We present extensive quantitative and qualitative experimental comparison with state-of-the-art methods including top contenders in recent VOT challenges. Notably, despite the straightforward implementation, DAWN is ranked third in both VOT2016 and VOT2017 challenges with excellent success rate among all VOT fast trackers running at fps > 10 in unsupervised tracking in both challenges. We propose DAWN-RPN, where we simply augment our memory and attention LSTM modules to the state-of-the-art SiamRPN, and report immediate performance gain, thus demonstrating DAWN can work well with and directly benefit other models to handle difficult cases as well.
12.L2G Auto-encoder: Understanding Point Clouds by Local-to-Global Reconstruction with Hierarchical Self-Attention ⬇️
Auto-encoder is an important architecture to understand point clouds in an encoding and decoding procedure of self reconstruction. Current auto-encoder mainly focuses on the learning of global structure by global shape reconstruction, while ignoring the learning of local structures. To resolve this issue, we propose Local-to-Global auto-encoder (L2G-AE) to simultaneously learn the local and global structure of point clouds by local to global reconstruction. Specifically, L2G-AE employs an encoder to encode the geometry information of multiple scales in a local region at the same time. In addition, we introduce a novel hierarchical self-attention mechanism to highlight the important points, scales and regions at different levels in the information aggregation of the encoder. Simultaneously, L2G-AE employs a recurrent neural network (RNN) as decoder to reconstruct a sequence of scales in a local region, based on which the global point cloud is incrementally reconstructed. Our outperforming results in shape classification, retrieval and upsampling show that L2G-AE can understand point clouds better than state-of-the-art methods.
13.Entry-Exit event detection and learning ⬇️
The notion of Entry-Exit Surveillance provides scope for monitoring subjects entering and exiting 'private areas' (places such as wash rooms and changing rooms where cameras are forbidden). The proposal here is to design a conceptual model that accurately detects the type of event such as entry/exit of subjects at the entrances and is robust to possible occlusions between subjects. A novel Entry-Exit event detection method that analyzes the three dimensional layout of the camera view as well as the transition of subjects through frames and determines the type of event as entry or exit or miscellaneous is presented in this paper. Extensive experiments on the benchmark EnEx,CAVIAR and PAMELA-UANDES datasets demonstrate the efficacy of the model.
14.Scale Matters: Temporal Scale Aggregation Network for Precise Action Localization in Untrimmed Videos ⬇️
Temporal action localization is a recently-emerging task, aiming to localize video segments from untrimmed videos that contain specific actions. Despite the remarkable recent progress, most two-stage action localization methods still suffer from imprecise temporal boundaries of action proposals. This work proposes a novel integrated temporal scale aggregation network (TSA-Net). Our main insight is that ensembling convolution filters with different dilation rates can effectively enlarge the receptive field with low computational cost, which inspires us to devise multi-dilation temporal convolution (MDC) block. Furthermore, to tackle video action instances with different durations, TSA-Net consists of multiple branches of sub-networks. Each of them adopts stacked MDC blocks with different dilation parameters, accomplishing a temporal receptive field specially optimized for specific-duration actions. We follow the formulation of boundary point detection, novelly detecting three kinds of critical points (ie, starting / mid-point / ending) and pairing them for proposal generation. Comprehensive evaluations are conducted on two challenging video benchmarks, THUMOS14 and ActivityNet-1.3. Our proposed TSA-Net demonstrates clear and consistent better performances and re-calibrates new state-of-the-art on both benchmarks. For example, our new record on THUMOS14 is 46.9% while the previous best is 42.8% under [email protected].
15.AdvGAN++ : Harnessing latent layers for adversary generation ⬇️
Adversarial examples are fabricated examples, indistinguishable from the original image that mislead neural networks and drastically lower their performance. Recently proposed AdvGAN, a GAN based approach, takes input image as a prior for generating adversaries to target a model. In this work, we show how latent features can serve as better priors than input images for adversary generation by proposing AdvGAN++, a version of AdvGAN that achieves higher attack rates than AdvGAN and at the same time generates perceptually realistic images on MNIST and CIFAR-10 datasets.
16.Real Time Visual Tracking using Spatial-Aware Temporal Aggregation Network ⬇️
More powerful feature representations derived from deep neural networks benefit visual tracking algorithms widely. However, the lack of exploitation on temporal information prevents tracking algorithms from adapting to appearances changing or resisting to drift. This paper proposes a correlation filter based tracking method which aggregates historical features in a spatial-aligned and scale-aware paradigm. The features of historical frames are sampled and aggregated to search frame according to a pixel-level alignment module based on deformable convolutions. In addition, we also use a feature pyramid structure to handle motion estimation at different scales, and address the different demands on feature granularity between tracking losses and deformation offset learning. By this design, the tracker, named as Spatial-Aware Temporal Aggregation network (SATA), is able to assemble appearances and motion contexts of various scales in a time period, resulting in better performance compared to a single static image. Our tracker achieves leading performance in OTB2013, OTB2015, VOT2015, VOT2016 and LaSOT, and operates at a real-time speed of 26 FPS, which indicates our method is effective and practical. Our code will be made publicly available at \href{this https URL}{this https URL}.
17.Indices Matter: Learning to Index for Deep Image Matting ⬇️
We show that existing upsampling operators can be unified with the notion of the index function. This notion is inspired by an observation in the decoding process of deep image matting where indices-guided unpooling can recover boundary details much better than other upsampling operators such as bilinear interpolation. By looking at the indices as a function of the feature map, we introduce the concept of learning to index, and present a novel index-guided encoder-decoder framework where indices are self-learned adaptively from data and are used to guide the pooling and upsampling operators, without the need of supervision. At the core of this framework is a flexible network module, termed IndexNet, which dynamically predicts indices given an input. Due to its flexibility, IndexNet can be used as a plug-in applying to any off-the-shelf convolutional networks that have coupled downsampling and upsampling stages.
We demonstrate the effectiveness of IndexNet on the task of natural image matting where the quality of learned indices can be visually observed from predicted alpha mattes. Results on the Composition-1k matting dataset show that our model built on MobileNetv2 exhibits at least$16.1%$ improvement over the seminal VGG-16 based deep matting baseline, with less training data and lower model capacity. Code and models has been made available at: this https URL
18.Recognizing Image Objects by Relational Analysis Using Heterogeneous Superpixels and Deep Convolutional Features ⬇️
Superpixel-based methodologies have become increasingly popular in computer vision, especially when the computation is too expensive in time or memory to perform with a large number of pixels or features. However, rarely is superpixel segmentation examined within the context of deep convolutional neural network architectures. This paper presents a novel neural architecture that exploits the superpixel feature space. The visual feature space is organized using superpixels to provide the neural network with a substructure of the images. As the superpixels associate the visual feature space with parts of the objects in an image, the visual feature space is transformed into a structured vector representation per superpixel. It is shown that it is feasible to learn superpixel features using capsules and it is potentially beneficial to perform image analysis in such a structured manner. This novel deep learning architecture is examined in the context of an image classification task, highlighting explicit interpretability (explainability) of the network's decision making. The results are compared against a baseline deep neural model, as well as among superpixel capsule networks with a variety of hyperparameter settings.
19.Fitting, Comparison, and Alignment of Trajectories on Positive Semi-Definite Matrices with Application to Action Recognition ⬇️
In this paper, we tackle the problem of action recognition using body skeletons extracted from video sequences. Our approach lies in the continuity of recent works representing video frames by Gramian matrices that describe a trajectory on the Riemannian manifold of positive-semidefinite matrices of fixed rank. In comparison with previous works, the manifold of fixed-rank positive-semidefinite matrices is here endowed with a different metric, and we resort to different algorithms for the curve fitting and temporal alignment steps. We evaluated our approach on three publicly available datasets (UTKinect-Action3D, KTH-Action and UAV-Gesture). The results of the proposed approach are competitive with respect to state-of-the-art methods, while only involving body skeletons.
20.Neural Architecture based on Fuzzy Perceptual Representation For Online Multilingual Handwriting Recognition ⬇️
Due to the omnipresence of mobile devices, online handwritten scripts have become the most important feeding input to smartphones and tablet devices. To increase online handwriting recognition performance, deeper neural networks have extensively been used. In this context, our paper handles the problem of online handwritten script recognition based on extraction features system and deep approach system for sequences classification. Many solutions have appeared in order to facilitate the recognition of handwriting. Accordingly, we used an existent method and combined with new classifiers in order to get a flexible system. Good results are achieved compared to online characters and words recognition system on Latin and Arabic scripts. The performance of our two proposed systems is assessed by using five databases. Indeed, the recognition rate exceeds 98%.
21.Multi-Scale Learned Iterative Reconstruction ⬇️
Model-based learned iterative reconstruction methods have recently been shown to outperform classical reconstruction methods. Applicability of these methods to large scale inverse problems is however limited by the available memory for training and extensive training times. As a possible solution to these restrictions we propose a multi-scale learned iterative reconstruction algorithm that computes iterates on discretisations of increasing resolution. This procedure does not only reduce memory requirements, it also considerably speeds up reconstruction and training times, but most importantly is scalable to large scale inverse problems, like those that arise in 3D tomographic imaging. Feasibility of the proposed method to speed up training and computation times in comparison to established learned reconstruction methods in 2D is demonstrated for low dose computed tomography (CT), for which we utilise the data base of abdominal CT scans provided for the 2016 AAPM low-dose CT grand challenge.
22.ConCORDe-Net: Cell Count Regularized Convolutional Neural Network for Cell Detection in Multiplex Immunohistochemistry Images ⬇️
In digital pathology, cell detection and classification are often prerequisites to quantify cell abundance and explore tissue spatial heterogeneity. However, these tasks are particularly challenging for multiplex immunohistochemistry (mIHC) images due to high levels of variability in staining, expression intensity, and inherent noise as a result of preprocessing artefacts. We proposed a deep learning method to detect and classify cells in mIHC whole-tumour slide images of breast cancer. Inspired by inception-v3, we developed Cell COunt RegularizeD Convolutional neural Network (ConCORDe-Net) which integrates conventional dice overlap and a new cell count loss function for optimizing cell detection, followed by a multi-stage convolutional neural network for cell classification. In total, 20447 cells, belonging to five cell classes were annotated by experts from 175 patches extracted from 6 whole-tumour mIHC images. These patches were randomly split into training, validation and testing sets. Using ConCORDe-Net, we obtained a cell detection F1 score of 0.873, which is the best score compared to three state of the art methods. In particular, ConCORDe-Net excels at detecting closely located and weakly stained cells compared to other methods. Incorporating cell count loss in the objective function regularizes the network to learn weak gradient boundaries and separate weakly stained cells from background artefacts. Moreover, cell classification accuracy of 96.5% was achieved. These results support that incorporating problem-specific knowledge such as cell count into deep learning-based cell detection architectures improve the robustness of the algorithm.
23.Exact and Fast Inversion of the Approximate Discrete Radon Transform ⬇️
We give an exact inversion formula for the approximate discrete Radon transform introduced in [Brady, SIAM J. Comput., 27(1), 107--119] that is of cost
$O(N \log N)$ for a square 2D image with N pixels.
24.Space-adaptive anisotropic bivariate Laplacian regularization for image restoration ⬇️
In this paper we present a new regularization term for variational image restoration which can be regarded as a space-variant anisotropic extension of the classical isotropic Total Variation (TV) regularizer. The proposed regularizer comes from the statistical assumption that the gradients of the target image distribute locally according to space-variant bivariate Laplacian distributions. The highly flexible variational structure of the corresponding regularizer encodes several free parameters which hold the potential for faithfully modelling the local geometry in the image and describing local orientation preferences. For an automatic estimation of such parameters, we design a robust maximum likelihood approach and report results on its reliability on synthetic data and natural images. A minimization algorithm based on the Alternating Direction Method of Multipliers (ADMM) is presented for the efficient numerical solution of the proposed variational model. Some experimental results are reported which demonstrate the high-quality of restorations achievable by the proposed model, in particular with respect to classical Total Variation regularization.
25.Uncertainty Quantification in Computer-Aided Diagnosis: Make Your Model say "I don't know" for Ambiguous Cases ⬇️
We evaluate two different methods for the integration of prediction uncertainty into diagnostic image classifiers to increase patient safety in deep learning. In the first method, Monte Carlo sampling is applied with dropout at test time to get a posterior distribution of the class labels (Bayesian ResNet). The second method extends ResNet to a probabilistic approach by predicting the parameters of the posterior distribution and sampling the final result from it (Variational ResNet).The variance of the posterior is used as metric for uncertainty.Both methods are trained on a data set of optical coherence tomography scans showing four different retinal conditions. Our results shown that cases in which the classifier predicts incorrectly correlate with a higher uncertainty. Mean uncertainty of incorrectly diagnosed cases was between 4.6 and 8.1 times higher than mean uncertainty of correctly diagnosed cases. Modeling of the prediction uncertainty in computer-aided diagnosis with deep learning yields more reliable results and is anticipated to increase patient safety.
26.Deformable Medical Image Registration Using a Randomly-Initialized CNN as Regularization Prior ⬇️
We present deformable unsupervised medical image registration using a randomly-initialized deep convolutional neural network (CNN) as regularization prior. Conventional registration methods predict a transformation by minimizing dissimilarities between an image pair. The minimization is usually regularized with manually engineered priors, which limits the potential of the registration. By learning transformation priors from a large dataset, CNNs have achieved great success in deformable registration. However, learned methods are restricted to domain-specific data and the required amounts of medical data are difficult to obtain. Our approach uses the idea of deep image priors to combine convolutional networks with conventional registration methods based on manually engineered priors. The proposed method is applied to brain MRI scans. We show that our approach registers image pairs with state-of-the-art accuracy by providing dense, pixel-wise correspondence maps. It does not rely on prior training and is therefore not limited to a specific image domain.
27.An amplified-target loss approach for photoreceptor layer segmentation in pathological OCT scans ⬇️
Segmenting anatomical structures such as the photoreceptor layer in retinal optical coherence tomography (OCT) scans is challenging in pathological scenarios. Supervised deep learning models trained with standard loss functions are usually able to characterize only the most common disease appeareance from a training set, resulting in suboptimal performance and poor generalization when dealing with unseen lesions. In this paper we propose to overcome this limitation by means of an augmented target loss function framework. We introduce a novel amplified-target loss that explicitly penalizes errors within the central area of the input images, based on the observation that most of the challenging disease appeareance is usually located in this area. We experimentally validated our approach using a data set with OCT scans of patients with macular diseases. We observe increased performance compared to the models that use only the standard losses. Our proposed loss function strongly supports the segmentation model to better distinguish photoreceptors in highly pathological scenarios.
28.Network with Sub-Networks ⬇️
We introduce network with sub-network, a neural network which their weight layer could be removed into sub-neural networks on demand during inference. This method provides selectivity in the number of weight layer. To develop the parameters which could be used in both base and sub-neural networks models, firstly, the weights and biases are copied from sub-models to base-model. Each model is forwarded separately. Gradients from both networks are averaged and, used to update both networks. From the empirical experiment, our base-model achieves the test-accuracy that is comparable to the regularly trained models, while it maintains the ability to remove the weight layers.
29.Integrating Spatial Configuration into Heatmap Regression Based CNNs for Landmark Localization ⬇️
In many medical image analysis applications, often only a limited amount of training data is available, which makes training of convolutional neural networks (CNNs) challenging. In this work on anatomical landmark localization, we propose a CNN architecture that learns to split the localization task into two simpler sub-problems, reducing the need for large training datasets. Our fully convolutional SpatialConfiguration-Net (SCN) dedicates one component to locally accurate but ambiguous candidate predictions, while the other component improves robustness to ambiguities by incorporating the spatial configuration of landmarks. In our experimental evaluation, we show that the proposed SCN outperforms related methods in terms of landmark localization error on size-limited datasets.
30.Learning Variations in Human Motion via Mix-and-Match Perturbation ⬇️
Human motion prediction is a stochastic process: Given an observed sequence of poses, multiple future motions are plausible. Existing approaches to modeling this stochasticity typically combine a random noise vector with information about the previous poses. This combination, however, is done in a deterministic manner, which gives the network the flexibility to learn to ignore the random noise. In this paper, we introduce an approach to stochastically combine the root of variations with previous pose information, which forces the model to take the noise into account. We exploit this idea for motion prediction by incorporating it into a recurrent encoder-decoder network with a conditional variational autoencoder block that learns to exploit the perturbations. Our experiments demonstrate that our model yields high-quality pose sequences that are much more diverse than those from state-of-the-art stochastic motion prediction techniques.
31.Road Context-aware Intrusion Detection System for Autonomous Cars ⬇️
Security is of primary importance to vehicles. The viability of performing remote intrusions onto the in-vehicle network has been manifested. In regard to unmanned autonomous cars, limited work has been done to detect intrusions for them while existing intrusion detection systems (IDSs) embrace limitations against strong adversaries. In this paper, we consider the very nature of autonomous car and leverage the road context to build a novel IDS, named Road context-aware IDS (RAIDS). When a computer-controlled car is driving through continuous roads, road contexts and genuine frames transmitted on the car's in-vehicle network should resemble a regular and intelligible pattern. RAIDS hence employs a lightweight machine learning model to extract road contexts from sensory information (e.g., camera images and distance sensor values) that are used to generate control signals for maneuvering the car. With such ongoing road context, RAIDS validates corresponding frames observed on the in-vehicle network. Anomalous frames that substantially deviate from road context will be discerned as intrusions. We have implemented a prototype of RAIDS with neural networks, and conducted experiments on a Raspberry Pi with extensive datasets and meaningful intrusion cases. Evaluations show that RAIDS significantly outperforms state-of-the-art IDS without using road context by up to 99.9% accuracy and short response time.
32.Combining learned skills and reinforcement learning for robotic manipulations ⬇️
Manipulation tasks such as preparing a meal or assembling furniture remain highly challenging for robotics and vision. The supervised approach of imitation learning can handle short tasks but suffers from compounding errors and the need of many demonstrations for longer and more complex tasks. Reinforcement learning (RL) can find solutions beyond demonstrations but requires tedious and task-specific reward engineering for multi-step problems. In this work we address the difficulties of both methods and explore their combination. To this end, we propose a RL policies operating on pre-trained skills, that can learn composite manipulations using no intermediate rewards and no demonstrations of full tasks. We also propose an efficient training of basic skills from few synthetic demonstrated trajectories by exploring recent CNN architectures and data augmentation. We show successful learning of policies for composite manipulation tasks such as making a simple breakfast. Notably, our method achieves high success rates on a real robot, while using synthetic training data only.
33.AutoML: A Survey of the State-of-the-Art ⬇️
Deep learning has penetrated all aspects of our lives and brought us great convenience. However, the process of building a high-quality deep learning system for a specific task is not only time-consuming but also requires lots of resources and relies on human expertise, which hinders the development of deep learning in both industry and academia. To alleviate this problem, a growing number of research projects focus on automated machine learning (AutoML). In this paper, we provide a comprehensive and up-to-date study on the state-of-the-art AutoML. First, we introduce the AutoML techniques in details according to the machine learning pipeline. Then we summarize existing Neural Architecture Search (NAS) research, which is one of the most popular topics in AutoML. We also compare the models generated by NAS algorithms with those human-designed models. Finally, we present several open problems for future research.
34.Greedy AutoAugment ⬇️
A major problem in data augmentation is the number of possibilities in the search space of operations. The search space includes mixtures of all of the possible data augmentation techniques, the magnitude of these operations, and the probability of applying data augmentation for each image. In this paper, we propose Greedy AutoAugment as a highly efficient searching algorithm to find the best augmentation policies. We combine the searching process with a simple procedure to increase the size of training data. Our experiments show that the proposed method can be used as a reliable addition to the ANN infrastructures for increasing the accuracy of classification results.
35.Attention-guided Low-light Image Enhancement ⬇️
Low-light image enhancement is a challenging task since various factors, including brightness, contrast, artifacts and noise, should be handled simultaneously and effectively. To address such a difficult problem, this paper proposes a novel attention-guided enhancement solution and delivers the corresponding end-to-end multi-branch CNNs. The key of our method is the computation of two attention maps to guide the exposure enhancement and denoising respectively. In particular, the first attention map distinguishes underexposed regions from normally exposed regions, while the second attention map distinguishes noises from real-world textures. Under their guidance, the proposed multi-branch enhancement network can work in an adaptive way. Other contributions of this paper include the "decomposition/multi-branch-enhancement/fusion" design of the enhancement network, the reinforcement-net for contrast enhancement, and the proposed large-scale low-light enhancement dataset. We evaluate the proposed method through extensive experiments, and the results demonstrate that our solution outperforms state-of-the-art methods by a large margin. We additionally show that our method is flexible and effective for other image processing tasks.
36.Robustifying deep networks for image segmentation ⬇️
Purpose: The purpose of this study is to investigate the robustness of a commonly-used convolutional neural network for image segmentation with respect to visually-subtle adversarial perturbations, and suggest new methods to make these networks more robust to such perturbations. Materials and Methods: In this retrospective study, the accuracy of brain tumor segmentation was studied in subjects with low- and high-grade gliomas. A three-dimensional UNet model was implemented to segment four different MR series (T1-weighted, post-contrast T1-weighted, T2- weighted, and T2-weighted FLAIR) into four pixelwise labels (Gd-enhancing tumor, peritumoral edema, necrotic and non-enhancing tumor, and background). We developed attack strategies based on the Fast Gradient Sign Method (FGSM), iterative FGSM (i-FGSM), and targeted iterative FGSM (ti-FGSM) to produce effective attacks. Additionally, we explored the effectiveness of distillation and adversarial training via data augmentation to counteract adversarial attacks. Robustness was measured by comparing the Dice coefficient for each attack method using Wilcoxon signed-rank tests. Results: Attacks based on FGSM, i-FGSM, and ti-FGSM were effective in significantly reducing the quality of image segmentation with reductions in Dice coefficient by up to 65%. For attack defenses, distillation performed significantly better than adversarial training approaches. However, all defense approaches performed worse compared to unperturbed test images. Conclusion: Segmentation networks can be adversely affected by targeted attacks that introduce visually minor (and potentially undetectable) modifications to existing images. With an increasing interest in applying deep learning techniques to medical imaging data, it is important to quantify the ramifications of adversarial inputs (either intentional or unintentional).
37.Deep Optics for Single-shot High-dynamic-range Imaging ⬇️
High-dynamic-range (HDR) imaging is crucial for many computer graphics and vision applications. Yet, acquiring HDR images with a single shot remains a challenging problem. Whereas modern deep learning approaches are successful at hallucinating plausible HDR content from a single low-dynamic-range (LDR) image, saturated scene details often cannot be faithfully recovered. Inspired by recent deep optical imaging approaches, we interpret this problem as jointly training an optical encoder and electronic decoder where the encoder is parameterized by the point spread function (PSF) of the lens, the bottleneck is the sensor with a limited dynamic range, and the decoder is a convolutional neural network (CNN). The lens surface is then jointly optimized with the CNN in a training phase; we fabricate this optimized optical element and attach it as a hardware add-on to a conventional camera during inference. In extensive simulations and with a physical prototype, we demonstrate that this end-to-end deep optical imaging approach to single-shot HDR imaging outperforms both purely CNN-based approaches and other PSF engineering approaches.
38.Improving localization-based approaches for breast cancer screening exam classification ⬇️
We trained and evaluated a localization-based deep CNN for breast cancer screening exam classification on over 200,000 exams (over 1,000,000 images). Our model achieves an AUC of 0.919 in predicting malignancy in patients undergoing breast cancer screening, reducing the error rate of the baseline (Wu et al., 2019a) by 23%. In addition, the models generates bounding boxes for benign and malignant findings, providing interpretable predictions.
39.StructureNet: Hierarchical Graph Networks for 3D Shape Generation ⬇️
The ability to generate novel, diverse, and realistic 3D shapes along with associated part semantics and structure is central to many applications requiring high-quality 3D assets or large volumes of realistic training data. A key challenge towards this goal is how to accommodate diverse shape variations, including both continuous deformations of parts as well as structural or discrete alterations which add to, remove from, or modify the shape constituents and compositional structure. Such object structure can typically be organized into a hierarchy of constituent object parts and relationships, represented as a hierarchy of n-ary graphs. We introduce StructureNet, a hierarchical graph network which (i) can directly encode shapes represented as such n-ary graphs; (ii) can be robustly trained on large and complex shape families; and (iii) can be used to generate a great diversity of realistic structured shape geometries. Technically, we accomplish this by drawing inspiration from recent advances in graph neural networks to propose an order-invariant encoding of n-ary graphs, considering jointly both part geometry and inter-part relations during network training. We extensively evaluate the quality of the learned latent spaces for various shape families and show significant advantages over baseline and competing methods. The learned latent spaces enable several structure-aware geometry processing applications, including shape generation and interpolation, shape editing, or shape structure discovery directly from un-annotated images, point clouds, or partial scans.