1.Image Captioning with Sparse Recurrent Neural Network ⬇️
Recurrent Neural Network (RNN) has been deployed as the de facto model to tackle a wide variety of language generation problems and achieved state-of-the-art (SOTA) performance. However despite its impressive results, the large number of parameters in the RNN model makes deployment in mobile and embedded devices infeasible. Driven by this problem, many works have proposed a number of pruning methods to reduce the sizes of the RNN model. In this work, we propose an end-to-end pruning method for image captioning models equipped with visual attention. Our proposed method is able to achieve sparsity levels up to 97.5% without significant performance loss relative to the baseline (around 1% loss at 40x compression of GRU model). Our method is also simple to use and tune, facilitating faster development times for neural network practitioners. We perform extensive experiments on the popular MS-COCO dataset in order to empirically validate the efficacy of our proposed method.
2.Facial age estimation by deep residual decision making ⬇️
Residual representation learning simplifies the optimization problem of learning complex functions and has been widely used by traditional convolutional neural networks. However, it has not been applied to deep neural decision forest (NDF). In this paper we incorporate residual learning into NDF and the resulting model achieves state-of-the-art level accuracy on three public age estimation benchmarks while requiring less memory and computation. We further employ gradient-based technique to visualize the decision-making process of NDF and understand how it is influenced by facial image inputs. The code and pre-trained models will be available at this https URL.
3.Fast Video Object Segmentation via Mask Transfer Network ⬇️
Accuracy and processing speed are two important factors that affect the use of video object segmentation (VOS) in real applications. With the advanced techniques of deep neural networks, the accuracy has been significantly improved, however, the speed is still far below the real-time needs because of the complicated network design, such as the requirement of the first frame fine-tuning step. To overcome this limitation, we propose a novel mask transfer network (MTN), which can greatly boost the processing speed of VOS and also achieve a reasonable accuracy. The basic idea of MTN is to transfer the reference mask to the target frame via an efficient global pixel matching strategy. The global pixel matching between the reference frame and the target frame is to ensure good matching results. To enhance the matching speed, we perform the matching on a downsampled feature map with 1/32 of the original frame size. At the same time, to preserve the detailed mask information in such a small feature map, a mask network is designed to encode the annotated mask information with 512 channels. Finally, an efficient feature warping method is used to transfer the encoded reference mask to the target frame. Based on this design, our method avoids the fine-tuning step on the first frame and does not rely on the temporal cues and particular object categories. Therefore, it runs very fast and can be conveniently trained only with images, as well as being robust to unseen objects. Experiments on the DAVIS datasets demonstrate that MTN can achieve a speed of 37 fps, and also shows a competitive accuracy in comparison to the state-of-the-art methods.
4.Cross-sensor Pore Detection in High-resolution Fingerprint Images using Unsupervised Domain Adaptation ⬇️
With the emergence of the high-resolution fingerprint sensors, the research community has focused on level-3 fingerprint features especially, pores for providing the next generation automated fingerprint recognition system (AFRS). Following the recent success of the deep learning approaches in various computer vision tasks, researchers have explored learning-based approaches for pore detection in high-resolution fingerprint images. These learning-based approaches provide better performance than the hand-crafted feature-based approaches. However, domain adaptability of existing learning-based pore detection methods has not been examined in the past. In this paper, we present the first study of domain adaptability of existing learning-based pore detection methods. For this purpose, we have generated an in-house ground truth dataset referred to as IITI-HRF-GT by using 1000 dpi fingerprint sensor and evaluated the performance of the existing learning-based pore detection approaches on it. Further, we have also proposed an approach for detecting pores in a cross sensor scenario referred to as DeepDomainPore using unsupervised domain adaptation technique. Specifically, DeepDomainPore is a combination of a convolutional neural network (CNN) based pore detection approach DeepResPore and an unsupervised domain adaptation approach included during the training process. The domain adaptation in the DeepDomainPore is achieved by embedding a gradient reversal layer between the DeepResPore and a domain classifier network. The results of all the existing and the proposed learning-based pore detection approaches are evaluated on IITI-HRF-GT. The DeepDomainPore provides a true detection rate of 88.09%and an F-score of 83.94% on IITI-HRF-GT. Most importantly, the proposed approach achieves state-of-the-art performance on the cross sensor dataset.
5.Explainable Video Action Reasoning via Prior Knowledge and State Transitions ⬇️
Human action analysis and understanding in videos is an important and challenging task. Although substantial progress has been made in past years, the explainability of existing methods is still limited. In this work, we propose a novel action reasoning framework that uses prior knowledge to explain semantic-level observations of video state changes. Our method takes advantage of both classical reasoning and modern deep learning approaches. Specifically, prior knowledge is defined as the information of a target video domain, including a set of objects, attributes and relationships in the target video domain, as well as relevant actions defined by the temporal attribute and relationship changes (i.e. state transitions). Given a video sequence, we first generate a scene graph on each frame to represent concerned objects, attributes and relationships. Then those scene graphs are linked by tracking objects across frames to form a spatio-temporal graph (also called video graph), which represents semantic-level video states. Finally, by sequentially examining each state transition in the video graph, our method can detect and explain how those actions are executed with prior knowledge, just like the logical manner of thinking by humans. Compared to previous works, the action reasoning results of our method can be explained by both logical rules and semantic-level observations of video content changes. Besides, the proposed method can be used to detect multiple concurrent actions with detailed information, such as who (particular objects), when (time), where (object locations) and how (what kind of changes). Experiments on a re-annotated dataset CAD-120 show the effectiveness of our method.
6.CASIA-SURF: A Large-scale Multi-modal Benchmark for Face Anti-spoofing ⬇️
Face anti-spoofing is essential to prevent face recognition systems from a security breach. Much of the progresses have been made by the availability of face anti-spoofing benchmark datasets in recent years. However, existing face anti-spoofing benchmarks have limited number of subjects (
$\le\negmedspace170$ ) and modalities ($\leq\negmedspace2$ ), which hinder the further development of the academic community. To facilitate face anti-spoofing research, we introduce a large-scale multi-modal dataset, namely CASIA-SURF, which is the largest publicly available dataset for face anti-spoofing in terms of both subjects and modalities. Specifically, it consists of$1,000$ subjects with$21,000$ videos and each sample has$3$ modalities (i.e., RGB, Depth and IR). We also provide comprehensive evaluation metrics, diverse evaluation protocols, training/validation/testing subsets and a measurement tool, developing a new benchmark for face anti-spoofing. Moreover, we present a novel multi-modal multi-scale fusion method as a strong baseline, which performs feature re-weighting to select the more informative channel features while suppressing the less useful ones for each modality across different scales. Extensive experiments have been conducted on the proposed dataset to verify its significance and generalization capability. The dataset is available at this https URL
7.Self-supervised blur detection from synthetically blurred scenes ⬇️
Blur detection aims at segmenting the blurred areas of a given image. Recent deep learning-based methods approach this problem by learning an end-to-end mapping between the blurred input and a binary mask representing the localization of its blurred areas. Nevertheless, the effectiveness of such deep models is limited due to the scarcity of datasets annotated in terms of blur segmentation, as blur annotation is labour intensive. In this work, we bypass the need for such annotated datasets for end-to-end learning, and instead rely on object proposals and a model for blur generation in order to produce a dataset of synthetically blurred images. This allows us to perform self-supervised learning over the generated image and ground truth blur mask pairs using CNNs, defining a framework that can be employed in purely self-supervised, weakly supervised or semi-supervised configurations. Interestingly, experimental results of such setups over the largest blur segmentation datasets available show that this approach achieves state of the art results in blur segmentation, even without ever observing any real blurred image.
8.Attention-based Fusion for Outfit Recommendation ⬇️
This paper describes an attention-based fusion method for outfit recommendation which fuses the information in the product image and description to capture the most important, fine-grained product features into the item representation. We experiment with different kinds of attention mechanisms and demonstrate that the attention-based fusion improves item understanding. We outperform state-of-the-art outfit recommendation results on three benchmark datasets.
9.BioFaceNet: Deep Biophysical Face Image Interpretation ⬇️
In this paper we present BioFaceNet, a deep CNN that learns to decompose a single face image into biophysical parameters maps, diffuse and specular shading maps as well as estimating the spectral power distribution of the scene illuminant and the spectral sensitivity of the camera. The network comprises a fully convolutional encoder for estimating the spatial maps with a fully connected branch for estimating the vector quantities. The network is trained using a self-supervised appearance loss computed via a model-based decoder. The task is highly underconstrained so we impose a number of model-based priors. Skin spectral reflectance is restricted to a biophysical model, we impose a statistical prior on camera spectral sensitivities, a physical constraint on illumination spectra, a sparsity prior on specular reflections and direct supervision on diffuse shading using a rough shape proxy. We show convincing qualitative results on in-the-wild data and introduce a benchmark for quantitative evaluation on this new task.
10.Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding ⬇️
Weakly supervised referring expression grounding aims at localizing the referential object in an image according to the linguistic query, where the mapping between the referential object and query is unknown in the training stage. To address this problem, we propose a novel end-to-end adaptive reconstruction network (ARN). It builds the correspondence between image region proposal and query in an adaptive manner: adaptive grounding and collaborative reconstruction. Specifically, we first extract the subject, location and context features to represent the proposals and the query respectively. Then, we design the adaptive grounding module to compute the matching score between each proposal and query by a hierarchical attention model. Finally, based on attention score and proposal features, we reconstruct the input query with a collaborative loss of language reconstruction loss, adaptive reconstruction loss, and attribute classification loss. This adaptive mechanism helps our model to alleviate the variance of different referring expressions. Experiments on four large-scale datasets show ARN outperforms existing state-of-the-art methods by a large margin. Qualitative results demonstrate that the proposed ARN can better handle the situation where multiple objects of a particular category situated together.
11.Online Sensor Hallucination via Knowledge Distillation for Multimodal Image Classification ⬇️
We deal with the problem of information fusion driven satellite image/scene classification and propose a generic hallucination architecture considering that all the available sensor information are present during training while some of the image modalities may be absent while testing. It is well-known that different sensors are capable of capturing complementary information for a given geographical area and a classification module incorporating information from all the sources are expected to produce an improved performance as compared to considering only a subset of the modalities. However, the classical classifier systems inherently require all the features used to train the module to be present for the test instances as well, which may not always be possible for typical remote sensing applications (say, disaster management). As a remedy, we provide a robust solution in terms of a hallucination module that can approximate the missing modalities from the available ones during the decision-making stage. In order to ensure better knowledge transfer during modality hallucination, we explicitly incorporate concepts of knowledge distillation for the purpose of exploring the privileged (side) information in our framework and subsequently introduce an intuitive modular training approach. The proposed network is evaluated extensively on a large-scale corpus of PAN-MS image pairs (scene recognition) as well as on a benchmark hyperspectral image dataset (image classification) where we follow different experimental scenarios and find that the proposed hallucination based module indeed is capable of capturing the multi-source information, albeit the explicit absence of some of the sensor information, and aid in improved scene characterization.
12.Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video ⬇️
Recent work has shown that CNN-based depth and ego-motion estimators can be learned using unlabelled monocular videos. However, the performance is limited by unidentified moving objects that violate the underlying static scene assumption in geometric image reconstruction. More significantly, due to lack of proper constraints, networks output scale-inconsistent results over different samples, i.e., the ego-motion network cannot provide full camera trajectories over a long video sequence because of the per-frame scale ambiguity. This paper tackles these challenges by proposing a geometry consistency loss for scale-consistent predictions, and an induced self-discovered mask for handling moving objects and occlusions. Since we do not leverage multi-task learning like recent works, our framework is much simpler and more efficient. Extensive evaluation results demonstrate that our depth estimator achieves the state-of-the-art performance on the standard KITTI and Make3D datasets. Moreover, we show that our ego-motion network is able to predict a globally scale-consistent camera trajectory for long video sequences, and the resulting visual odometry accuracy is competitive with the state-of-the-art model that is trained using stereo videos. To the best of our knowledge, this is the first work to show that deep networks trained using monocular video snippets can predict globally scale-consistent camera trajectories over a long video sequence.
13.A Global-Local Emebdding Module for Fashion Landmark Detection ⬇️
Detecting fashion landmarks is a fundamental technique for visual clothing analysis. Due to the large variation and non-rigid deformation of clothes, localizing fashion landmarks suffers from large spatial variances across poses, scales, and styles. Therefore, understanding contextual knowledge of clothes is required for accurate landmark detection. To that end, in this paper, we propose a fashion landmark detection network with a global-local embedding module. The global-local embedding module is based on a non-local operation for capturing long-range dependencies and a subsequent convolution operation for adopting local neighborhood relations. With this processing, the network can consider both global and local contextual knowledge for a clothing image. We demonstrate that our proposed method has an excellent ability to learn advanced deep feature representations for fashion landmark detection. Experimental results on two benchmark datasets show that the proposed network outperforms the state-of-the-art methods. Our code is available at this https URL.
14.Fingerspelling recognition in the wild with iterative visual attention ⬇️
Sign language recognition is a challenging gesture sequence recognition problem, characterized by quick and highly coarticulated motion. In this paper we focus on recognition of fingerspelling sequences in American Sign Language (ASL) videos collected in the wild, mainly from YouTube and Deaf social media. Most previous work on sign language recognition has focused on controlled settings where the data is recorded in a studio environment and the number of signers is limited. Our work aims to address the challenges of real-life data, reducing the need for detection or segmentation modules commonly used in this domain. We propose an end-to-end model based on an iterative attention mechanism, without explicit hand detection or segmentation. Our approach dynamically focuses on increasingly high-resolution regions of interest. It outperforms prior work by a large margin. We also introduce a newly collected data set of crowdsourced annotations of fingerspelling in the wild, and show that performance can be further improved with this additional data set.
15.SPair-71k: A Large-scale Benchmark for Semantic Correspondence ⬇️
Establishing visual correspondences under large intra-class variations, which is often referred to as semantic correspondence or semantic matching, remains a challenging problem in computer vision. Despite its significance, however, most of the datasets for semantic correspondence are limited to a small amount of image pairs with similar viewpoints and scales. In this paper, we present a new large-scale benchmark dataset of semantically paired images, SPair-71k, which contains 70,958 image pairs with diverse variations in viewpoint and scale. Compared to previous datasets, it is significantly larger in number and contains more accurate and richer annotations. We believe this dataset will provide a reliable testbed to study the problem of semantic correspondence and will help to advance research in this area. We provide the results of recent methods on our new dataset as baselines for further research. Our benchmark is available online at this http URL.
16.Orthogonal Center Learning with Subspace Masking for Person Re-Identification ⬇️
Person re-identification aims to identify whether pairs of images belong to the same person or not. This problem is challenging due to large differences in camera views, lighting and background. One of the mainstream in learning CNN features is to design loss functions which reinforce both the class separation and intra-class compactness. In this paper, we propose a novel Orthogonal Center Learning method with Subspace Masking for person re-identification. We make the following contributions: (i) we develop a center learning module to learn the class centers by simultaneously reducing the intra-class differences and inter-class correlations by orthogonalization; (ii) we introduce a subspace masking mechanism to enhance the generalization of the learned class centers; and (iii) we devise to integrate the average pooling and max pooling in a regularizing manner that fully exploits their powers. Extensive experiments show that our proposed method consistently outperforms the state-of-the-art methods on the large-scale ReID datasets including Market-1501, DukeMTMC-ReID, CUHK03 and MSMT17.
17.Adversarial Representation Learning for Text-to-Image Matching ⬇️
For many computer vision applications such as image captioning, visual question answering, and person search, learning discriminative feature representations at both image and text level is an essential yet challenging problem. Its challenges originate from the large word variance in the text domain as well as the difficulty of accurately measuring the distance between the features of the two modalities. Most prior work focuses on the latter challenge, by introducing loss functions that help the network learn better feature representations but fail to account for the complexity of the textual input. With that in mind, we introduce TIMAM: a Text-Image Modality Adversarial Matching approach that learns modality-invariant feature representations using adversarial and cross-modal matching objectives. In addition, we demonstrate that BERT, a publicly-available language model that extracts word embeddings, can successfully be applied in the text-to-image matching domain. The proposed approach achieves state-of-the-art cross-modal matching performance on four widely-used publicly-available datasets resulting in absolute improvements ranging from 2% to 5% in terms of rank-1 accuracy.
18.Image Harmonization Datasets: HCOCO, HAdobe5k, HFlickr, and Hday2night ⬇️
Image composition is an important operation in image processing, but the inconsistency between foreground and background significantly degrades the quality of composite image. Image harmonization, which aims to make the foreground compatible with the background, is a promising yet challenging task. However, the lack of high-quality public dataset for image harmonization, which significantly hinders the development of image harmonization techniques. Therefore, we create synthesized composite images based on existing COCO (resp., Adobe5k, day2night) dataset, leading to our HCOCO (resp., HAdobe5k, Hday2night) dataset. To enrich the diversity of our datasets, we also generate synthesized composite images based on our collected Flick images, leading to our HFlickr dataset. All four datasets are released in this https URL.
19.Consistent Cross-view Matching for Unsupervised Person Re-identification ⬇️
Most existing unsupervised person re-identificationmethods focus on learning an identity discriminative feature em-bedding for efficiently representing images of different persons.However, higher-order relationships across the entire cameranetwork are often ignored leading to contradictory outputs whenthe results of different camera pairs are combined. In this paper,we address this problem by proposing a consistent cross-viewmatching framework for unsupervised person re-identificationby exploiting more reliable positive image pairs in a cameranetwork. Specifically, we first construct a bipartite graph foreach pair of cameras, in which each node denotes a person, andthen graph matching is used to obtain optimal global matchesacross camera pairs. Thereafter, loop consistent and transitiveinference consistent constraints are introduced into the cross-view matches, which consider similarity relationshipsacross theentire camera networkto increase confidence in the matched/non-matched pairs. We then train distance metric models for eachcamera pair using the reliably matched image pairs. Finally,we embed the cross-view matching method into an iterativeupdating framework that iterates between the consistent cross-view matching and the cross-view distance metric learning. Wedemonstrate the superiority of the proposed method over thestate-of-the-art unsupervised person re-identification methodson three benchmark datasets such as Market1501, MARS andDukeMTMC-VideoReID datasets
20.Self-Supervised Representation Learning via Neighborhood-Relational Encoding ⬇️
In this paper, we propose a novel self-supervised representation learning by taking advantage of a neighborhood-relational encoding (NRE) among the training data. Conventional unsupervised learning methods only focused on training deep networks to understand the primitive characteristics of the visual data, mainly to be able to reconstruct the data from a latent space. They often neglected the relation among the samples, which can serve as an important metric for self-supervision. Different from the previous work, NRE aims at preserving the local neighborhood structure on the data manifold. Therefore, it is less sensitive to outliers. We integrate our NRE component with an encoder-decoder structure for learning to represent samples considering their local neighborhood information. Such discriminative and unsupervised representation learning scheme is adaptable to different computer vision tasks due to its independence from intense annotation requirements. We evaluate our proposed method for different tasks, including classification, detection, and segmentation based on the learned latent representations. In addition, we adopt the auto-encoding capability of our proposed method for applications like defense against adversarial example attacks and video anomaly detection. Results confirm the performance of our method is better or at least comparable with the state-of-the-art for each specific application, but with a generic and self-supervised approach.
21.Improving Visual Feature Extraction in Glacial Environments ⬇️
Glacial science could benefit tremendously from autonomous robots, but previous glacial robots have had perception issues in these colorless and featureless environments, specifically with visual feature extraction. Glaciologists use near-infrared imagery to reveal the underlying heterogeneous spatial structure of snow and ice, and we theorize that this hidden near-infrared structure could produce more and higher quality features than available in visible light. We took a custom camera rig to Igloo Cave at Mt. St. Helens to test our theory. The camera rig contains two identical machine vision cameras, one which was outfitted with multiple filters to see only near-infrared light. We extracted features from short video clips taken inside Igloo Cave at Mt. St. Helens, using three popular feature extractors (FAST, SIFT, and SURF). We quantified the number of features and their quality for visual navigation using feature correspondence and the epipolar constraint. Our results indicate that near-infrared imagery produces more features that tend to be of higher quality than that of visible light imagery.
22.An Effective and Efficient Method for Detecting Hands in Egocentric Videos for Rehabilitation Applications ⬇️
Objective: Individuals with spinal cord injury (SCI) report upper limb function as their top recovery priority. To accurately represent the true impact of new interventions on patient function and independence, evaluation should occur in a natural setting. Wearable cameras can be used to monitor hand function at home, using computer vision to automatically analyze the resulting videos (egocentric video). A key step in this process, hand detection, is difficult to do robustly and reliably, hindering deployment of a complete monitoring system in the home and community. We propose an accurate and efficient hand detection method that uses a simple combination of existing detection and tracking algorithms. Methods: Detection, tracking, and combination methods were evaluated on a new hand detection dataset, consisting of 167,622 frames of egocentric videos collected on 17 individuals with SCI performing activities of daily living in a home simulation laboratory. Results: The F1-scores for the best detector and tracker alone (SSD and Median Flow) were 0.90$\pm$0.07 and 0.42$\pm$0.18, respectively. The best combination method, in which a detector was used to initialize and reset a tracker, resulted in an F1-score of 0.87$\pm$0.07 while being two times faster than the fastest detector alone. Conclusion: The combination of the fastest detector and best tracker improved the accuracy over online trackers while improving the speed of detectors. Significance: The method proposed here, in combination with wearable cameras, will help clinicians directly measure hand function in a patient's daily life at home, enabling independence after SCI.
23.Transfer Learning from Partial Annotations for Whole Brain Segmentation ⬇️
Brain MR image segmentation is a key task in neuroimaging studies. It is commonly conducted using standard computational tools, such as FSL, SPM, multi-atlas segmentation etc, which are often registration-based and suffer from expensive computation cost. Recently, there is an increased interest using deep neural networks for brain image segmentation, which have demonstrated advantages in both speed and performance. However, neural networks-based approaches normally require a large amount of manual annotations for optimising the massive amount of network parameters. For 3D networks used in volumetric image segmentation, this has become a particular challenge, as a 3D network consists of many more parameters compared to its 2D counterpart. Manual annotation of 3D brain images is extremely time-consuming and requires extensive involvement of trained experts. To address the challenge with limited manual annotations, here we propose a novel multi-task learning framework for brain image segmentation, which utilises a large amount of automatically generated partial annotations together with a small set of manually created full annotations for network training. Our method yields a high performance comparable to state-of-the-art methods for whole brain segmentation.
24.Self-supervised Recurrent Neural Network for 4D Abdominal and In-utero MR Imaging ⬇️
Accurately estimating and correcting the motion artifacts are crucial for 3D image reconstruction of the abdominal and in-utero magnetic resonance imaging (MRI). The state-of-art methods are based on slice-to-volume registration (SVR) where multiple 2D image stacks are acquired in three orthogonal orientations. In this work, we present a novel reconstruction pipeline that only needs one orientation of 2D MRI scans and can reconstruct the full high-resolution image without masking or registration steps. The framework consists of two main stages: the respiratory motion estimation using a self-supervised recurrent neural network, which learns the respiratory signals that are naturally embedded in the asymmetry relationship of the neighborhood slices and cluster them according to a respiratory state. Then, we train a 3D deconvolutional network for super-resolution (SR) reconstruction of the sparsely selected 2D images using integrated reconstruction and total variation loss. We evaluate the classification accuracy on 5 simulated images and compare our results with the SVR method in adult abdominal and in-utero MRI scans. The results show that the proposed pipeline can accurately estimate the respiratory state and reconstruct 4D SR volumes with better or similar performance to the 3D SVR pipeline with less than 20% sparsely selected slices. The method has great potential to transform the 4D abdominal and in-utero MRI in clinical practice.
25.SMART tracking: Simultaneous anatomical imaging and real-time passive device tracking for MR-guided interventions ⬇️
Purpose: This study demonstrates a proof of concept of a method for simultaneous anatomical imaging and real-time (SMART) passive device tracking for MR-guided interventions.
Methods: Phase Correlation template matching was combined with a fast undersampled radial multi-echo acquisition using the white marker phenomenon after the first echo. In this way, the first echo provides anatomical contrast, whereas the other echoes provide white marker contrast to allow accurate device localization using fast simulations and template matching. This approach was tested on tracking of five 0.5 mm steel markers in an agarose phantom and on insertion of an MRI-compatible 20 Gauge titanium needle in ex vivo porcine tissue. The locations of the steel markers were quantitatively compared to the marker locations as found on a CT scan of the same phantom.
Results: The average pairwise error between the MRI and CT locations was 0.30 mm for tracking of stationary steel spheres and 0.29 mm during motion. Qualitative evaluation of the tracking of needle insertions showed that tracked positions were stable throughout needle insertion and retraction.
Conclusions: The proposed SMART tracking method provided accurate passive tracking of devices at high framerates, inclusion of real-time anatomical scanning, and the capability of automatic slice positioning. Furthermore, the method does not require specialized hardware and could therefore be applied to track any rigid metal device that causes appreciable magnetic field distortions.
26.Method and System for Image Analysis to Detect Cancer ⬇️
Breast cancer is the most common cancer and is the leading cause of cancer death among women worldwide. Detection of breast cancer, while it is still small and confined to the breast, provides the best chance of effective treatment. Computer Aided Detection (CAD) systems that detect cancer from mammograms will help in reducing the human errors that lead to missing breast carcinoma. Literature is rich of scientific papers for methods of CAD design, yet with no complete system architecture to deploy those methods. On the other hand, commercial CADs are developed and deployed only to vendors' mammography machines with no availability to public access. This paper presents a complete CAD; it is complete since it combines, on a hand, the rigor of algorithm design and assessment (method), and, on the other hand, the implementation and deployment of a system architecture for public accessibility (system). (1) We develop a novel algorithm for image enhancement so that mammograms acquired from any digital mammography machine look qualitatively of the same clarity to radiologists' inspection; and is quantitatively standardized for the detection algorithms. (2) We develop novel algorithms for masses and microcalcifications detection with accuracy superior to both literature results and the majority of approved commercial systems. (3) We design, implement, and deploy a system architecture that is computationally effective to allow for deploying these algorithms to cloud for public access.
27.CAMEL: A Weakly Supervised Learning Framework for Histopathology Image Segmentation ⬇️
Histopathology image analysis plays a critical role in cancer diagnosis and treatment. To automatically segment the cancerous regions, fully supervised segmentation algorithms require labor-intensive and time-consuming labeling at the pixel level. In this research, we propose CAMEL, a weakly supervised learning framework for histopathology image segmentation using only image-level labels. Using multiple instance learning (MIL)-based label enrichment, CAMEL splits the image into latticed instances and automatically generates instance-level labels. After label enrichment, the instance-level labels are further assigned to the corresponding pixels, producing the approximate pixel-level labels and making fully supervised training of segmentation models possible. CAMEL achieves comparable performance with the fully supervised approaches in both instance-level classification and pixel-level segmentation on CAMELYON16 and a colorectal adenoma dataset. Moreover, the generality of the automatic labeling methodology may benefit future weakly supervised learning studies for histopathology image analysis.
28.Heterogeneous Domain Adaptation via Soft Transfer Network ⬇️
Heterogeneous domain adaptation (HDA) aims to facilitate the learning task in a target domain by borrowing knowledge from a heterogeneous source domain. In this paper, we propose a Soft Transfer Network (STN), which jointly learns a domain-shared classifier and a domain-invariant subspace in an end-to-end manner, for addressing the HDA problem. The proposed STN not only aligns the discriminative directions of domains but also matches both the marginal and conditional distributions across domains. To circumvent negative transfer, STN aligns the conditional distributions by using the soft-label strategy of unlabeled target data, which prevents the hard assignment of each unlabeled target data to only one category that may be incorrect. Further, STN introduces an adaptive coefficient to gradually increase the importance of the soft-labels since they will become more and more accurate as the number of iterations increases. We perform experiments on the transfer tasks of image-to-image, text-to-image, and text-to-text. Experimental results testify that the STN significantly outperforms several state-of-the-art approaches.
29.A Coarse-to-Fine Multi-stream Hybrid Deraining Network for Single Image Deraining ⬇️
Single image deraining task is still a very challenging task due to its ill-posed nature in reality. Recently, researchers have tried to fix this issue by training the CNN-based end-to-end models, but they still cannot extract the negative rain streaks from rainy images precisely, which usually leads to an over de-rained or under de-rained result. To handle this issue, this paper proposes a new coarse-to-fine single image deraining framework termed Multi-stream Hybrid Deraining Network (shortly, MH-DerainNet). To obtain the negative rain streaks during training process more accurately, we present a new module named dual path residual dense block, i.e., Residual path and Dense path. The Residual path is used to reuse com-mon features from the previous layers while the Dense path can explore new features. In addition, to concatenate different scaled features, we also apply the idea of multi-stream with shortcuts between cascaded dual path residual dense block based streams. To obtain more distinct derained images, we combine the SSIM loss and perceptual loss to preserve the per-pixel similarity as well as preserving the global structures so that the deraining result is more accurate. Extensive experi-ments on both synthetic and real rainy images demonstrate that our MH-DerainNet can deliver significant improvements over several recent state-of-the-art methods.
30.O-MedAL: Online Active Deep Learning for Medical Image Analysis ⬇️
Active Learning methods create an optimized and labeled training set from unlabeled data. We introduce a novel Online Active Deep Learning method for Medical Image Analysis. We extend our MedAL active learning framework to present new results in this paper. Experiments on three medical image datasets show that our novel online active learning model requires significantly less labelings, is more accurate, and is more robust to class imbalances than existing methods. Our method is also more accurate and computationally efficient than the baseline model. Compared to random sampling and uncertainty sampling, the method uses 275 and 200 (out of 768) fewer labeled examples, respectively. For Diabetic Retinopathy detection, our method attains a 5.88% accuracy improvement over the baseline model when 80% of the dataset is labeled, and the model reaches baseline accuracy when only 40% is labeled.
31.Revealing Backdoors, Post-Training, in DNN Classifiers via Novel Inference on Optimized Perturbations Inducing Group Misclassification ⬇️
Recently, a special type of data poisoning (DP) attack targeting Deep Neural Network (DNN) classifiers, known as a backdoor, was proposed. These attacks do not seek to degrade classification accuracy, but rather to have the classifier learn to classify to a target class whenever the backdoor pattern is present in a test example. Launching backdoor attacks does not require knowledge of the classifier or its training process - it only needs the ability to poison the training set with (a sufficient number of) exemplars containing a sufficiently strong backdoor pattern (labeled with the target class). Here we address post-training detection of backdoor attacks in DNN image classifiers, seldom considered in existing works, wherein the defender does not have access to the poisoned training set, but only to the trained classifier itself, as well as to clean examples from the classification domain. This is an important scenario because a trained classifier may be the basis of e.g. a phone app that will be shared with many users. Detecting backdoors post-training may thus reveal a widespread attack. We propose a purely unsupervised anomaly detection (AD) defense against imperceptible backdoor attacks that: i) detects whether the trained DNN has been backdoor-attacked; ii) infers the source and target classes involved in a detected attack; iii) we even demonstrate it is possible to accurately estimate the backdoor pattern. We test our AD approach, in comparison with alternative defenses, for several backdoor patterns, data sets, and attack settings and demonstrate its favorability. Our defense essentially requires setting a single hyperparameter (the detection threshold), which can e.g. be chosen to fix the system's false positive rate.
32.Domain-Agnostic Learning with Anatomy-Consistent Embedding for Cross-Modality Liver Segmentation ⬇️
Domain Adaptation (DA) has the potential to greatly help the generalization of deep learning models. However, the current literature usually assumes to transfer the knowledge from the source domain to a specific known target domain. Domain Agnostic Learning (DAL) proposes a new task of transferring knowledge from the source domain to data from multiple heterogeneous target domains. In this work, we propose the Domain-Agnostic Learning framework with Anatomy-Consistent Embedding (DALACE) that works on both domain-transfer and task-transfer to learn a disentangled representation, aiming to not only be invariant to different modalities but also preserve anatomical structures for the DA and DAL tasks in cross-modality liver segmentation. We validated and compared our model with state-of-the-art methods, including CycleGAN, Task Driven Generative Adversarial Network (TD-GAN), and Domain Adaptation via Disentangled Representations (DADR). For the DA task, our DALACE model outperformed CycleGAN, TD-GAN ,and DADR with DSC of 0.847 compared to 0.721, 0.793 and 0.806. For the DAL task, our model improved the performance with DSC of 0.794 from 0.522, 0.719 and 0.742 by CycleGAN, TD-GAN, and DADR. Further, we visualized the success of disentanglement, which added human interpretability of the learned meaningful representations. Through ablation analysis, we specifically showed the concrete benefits of disentanglement for downstream tasks and the role of supervision for better disentangled representation with segmentation consistency to be invariant to domains with the proposed Domain-Agnostic Module (DAM) and to preserve anatomical information with the proposed Anatomy-Preserving Module (APM).
33.Adversarial regression training for visualizing the progression of chronic obstructive pulmonary disease with chest x-rays ⬇️
Knowledge of what spatial elements of medical images deep learning methods use as evidence is important for model interpretability, trustiness, and validation. There is a lack of such techniques for models in regression tasks. We propose a method, called visualization for regression with a generative adversarial network (VR-GAN), for formulating adversarial training specifically for datasets containing regression target values characterizing disease severity. We use a conditional generative adversarial network where the generator attempts to learn to shift the output of a regressor through creating disease effect maps that are added to the original images. Meanwhile, the regressor is trained to predict the original regression value for the modified images. A model trained with this technique learns to provide visualization for how the image would appear at different stages of the disease. We analyze our method in a dataset of chest x-rays associated with pulmonary function tests, used for diagnosing chronic obstructive pulmonary disease (COPD). For validation, we compute the difference of two registered x-rays of the same patient at different time points and correlate it to the generated disease effect map. The proposed method outperforms a technique based on classification and provides realistic-looking images, making modifications to images following what radiologists usually observe for this disease. Implementation code is available at this https URL.
34.Embracing Imperfect Datasets: A Review of Deep Learning Solutions for Medical Image Segmentation ⬇️
The medical imaging literature has witnessed remarkable progress in high-performing segmentation models based on convolutional neural networks. Despite the new performance highs, the recent advanced segmentation models still require large, representative, and high quality annotated datasets. However, rarely do we have a perfect training dataset, particularly in the field of medical imaging, where data and annotations are both expensive to acquire. Recently, a large body of research has studied the problem of medical image segmentation with imperfect datasets, tackling two major dataset limitations: scarce annotations where only limited annotated data is available for training, and weak annotations where the training data has only sparse annotations, noisy annotations, or image-level annotations. In this article, we provide a detailed review of the solutions above, summarizing both the technical novelties and empirical results. We further compare the benefits and requirements of the surveyed methodologies and provide our recommended solutions to the problems of scarce and weak annotations. We hope this review increases the community awareness of the techniques to handle imperfect datasets.
35.Complex Deep Learning Models for Denoising of Human Heart ECG signals ⬇️
Effective and powerful methods for denoising real electrocardiogram (ECG) signals are important for wearable sensors and devices. Deep Learning (DL) models have been used extensively in image processing and other domains with great success but only very recently have been used in processing ECG signals. This paper presents several DL models namely Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTM), Restricted Boltzmann Machine (RBM) together with the more conventional filtering methods (low pass filtering, high pass filtering, Notch filtering) and the standard wavelet-based technique for denoising EEG signals. These methods are trained, tested and evaluated on different synthetic and real ECG datasets taken from the MIT PhysioNet database and for different simulation conditions (i.e. various lengths of the ECG signals, single or multiple records). The results show the CNN model is a performant model that can be used for off-line denoising ECG applications where it is satisfactory to train on a clean part of an ECG signal from an ECG record, and then to test on the same ECG signal, which would have some high level of noise added to it. However, for real-time applications or near-real time applications, this task becomes more cumbersome, as the clean part of an ECG signal is very probable to be very limited in size. Therefore the solution put forth in this work is to train a CNN model on 1 second ECG noisy artificial multiple heartbeat data (i.e. ECG at effort), which was generated in a first instance based on few sequences of real signal heartbeat ECG data (i.e. ECG at rest). Afterwards it would be possible to use the trained CNN model in real life situations to denoise the ECG signal.
36.Visualization of Very Large High-Dimensional Data Sets as Minimum Spanning Trees ⬇️
Here, we introduce a new data visualization and exploration method, TMAP (tree-map), which exploits locality sensitive hashing, Kruskal's minimum-spanning-tree algorithm, and a multilevel multipole-based graph layout algorithm to represent large and high dimensional data sets as a tree structure, which is readily understandable and explorable. Compared to other data visualization methods such as t-SNE or UMAP, TMAP increases the size of data sets that can be visualized due to its significantly lower memory requirements and running time and should find broad applicability in the age of big data. We exemplify TMAP in the area of cheminformatics with interactive maps for 1.16 million drug-like molecules from ChEMBL, 10.1 million small molecule fragments from FDB17, and 131 thousand 3D- structures of biomolecules from the PDB Databank, and to visualize data from literature (GUTENBERG data set), cancer biology (PANSCAN data set) and particle physics (MiniBooNE data set). TMAP is available as a Python package. Installation, usage instructions and application examples can be found at http://tmap.gdb.tools.