1.Vision-Language Pre-Training with Triple Contrastive Learning ⬇️
Vision-language representation learning largely benefits from image-text alignment through contrastive losses (e.g., InfoNCE loss). The success of this alignment strategy is attributed to its capability in maximizing the mutual information (MI) between an image and its matched text. However, simply performing cross-modal alignment (CMA) ignores data potential within each modality, which may result in degraded representations. For instance, although CMA-based models are able to map image-text pairs close together in the embedding space, they fail to ensure that similar inputs from the same modality stay close by. This problem can get even worse when the pre-training data is noisy. In this paper, we propose triple contrastive learning (TCL) for vision-language pre-training by leveraging both cross-modal and intra-modal self-supervision. Besides CMA, TCL introduces an intra-modal contrastive objective to provide complementary benefits in representation learning. To take advantage of localized and structural information from image and text input, TCL further maximizes the average MI between local regions of image/text and their global summary. To the best of our knowledge, ours is the first work that takes into account local structure information for multi-modality representation learning. Experimental evaluations show that our approach is competitive and achieve the new state of the art on various common down-stream vision-language tasks such as image-text retrieval and visual question answering.
2.Self-Supervised Bulk Motion Artifact Removal in Optical Coherence Tomography Angiography ⬇️
Optical coherence tomography angiography (OCTA) is an important imaging modality in many bioengineering tasks. The image quality of OCTA, however, is often hurt by Bulk Motion Artifacts (BMA), which are due to micromotion of subjects and typically appear as bright stripes surrounded by blurred areas. State-of-the-art BMA handling solutions usually treat the problem as an image inpainting one with deep neural network algorithms. These solutions, however, require numerous training samples with nontrivial annotation. Nevertheless, this context-based inpainting model has limited correction capability because it discards the rich structural and appearance information carried in the BMA stripe region. To address these issues, in this paper we propose a self-supervised content-aware BMA recover model. First, the gradient-based structural information and appearance feature are extracted from the BMA area and injected into the model to capture more connectivity. Second, with easily collected defective masks, the model is trained in a self-supervised manner that only the clear areas are for training while the BMA areas for inference. With structural information and appearance feature from noisy image as references, our model could correct larger BMA and produce better visualizing result. Only 2D images with defective masks are involved so our method is more efficient. Experiments on OCTA of mouse cortex demonstrate that our model could correct most BMA with extremely large sizes and inconsistent intensities while existing methods fail.
3.On the Evaluation of RGB-D-based Categorical Pose and Shape Estimation ⬇️
Recently, various methods for 6D pose and shape estimation of objects have been proposed. Typically, these methods evaluate their pose estimation in terms of average precision, and reconstruction quality with chamfer distance. In this work we take a critical look at this predominant evaluation protocol including metrics and datasets. We propose a new set of metrics, contribute new annotations for the Redwood dataset and evaluate state-of-the-art methods in a fair comparison. We find that existing methods do not generalize well to unconstrained orientations, and are actually heavily biased towards objects being upright. We contribute an easy-to-use evaluation toolbox with well-defined metrics, method and dataset interfaces, which readily allows evaluation and comparison with various state-of-the-art approaches (see this https URL ).
4.VRL3: A Data-Driven Framework for Visual Deep Reinforcement Learning ⬇️
We propose a simple but powerful data-driven framework for solving highly challenging visual deep reinforcement learning (DRL) tasks. We analyze a number of major obstacles in taking a data-driven approach, and present a suite of design principles, training strategies, and critical insights about data-driven visual DRL. Our framework has three stages: in stage 1, we leverage non-RL datasets (e.g. ImageNet) to learn task-agnostic visual representations; in stage 2, we use offline RL data (e.g. a limited number of expert demonstrations) to convert the task-agnostic representations into more powerful task-specific representations; in stage 3, we fine-tune the agent with online RL. On a set of highly challenging hand manipulation tasks with sparse reward and realistic visual inputs, our framework learns 370%-1200% faster than the previous SOTA method while using an encoder that is 50 times smaller, fully demonstrating the potential of data-driven deep reinforcement learning.
5.AF$_2$: Adaptive Focus Framework for Aerial Imagery Segmentation ⬇️
As a specific semantic segmentation task, aerial imagery segmentation has been widely employed in high spatial resolution (HSR) remote sensing images understanding. Besides common issues (e.g. large scale variation) faced by general semantic segmentation tasks, aerial imagery segmentation has some unique challenges, the most critical one among which lies in foreground-background imbalance. There have been some recent efforts that attempt to address this issue by proposing sophisticated neural network architectures, since they can be used to extract informative multi-scale feature representations and increase the discrimination of object boundaries. Nevertheless, many of them merely utilize those multi-scale representations in ad-hoc measures but disregard the fact that the semantic meaning of objects with various sizes could be better identified via receptive fields of diverse ranges. In this paper, we propose Adaptive Focus Framework (AF$_2$), which adopts a hierarchical segmentation procedure and focuses on adaptively utilizing multi-scale representations generated by widely adopted neural network architectures. Particularly, a learnable module, called Adaptive Confidence Mechanism (ACM), is proposed to determine which scale of representation should be used for the segmentation of different objects. Comprehensive experiments show that AF$_2$ has significantly improved the accuracy on three widely used aerial benchmarks, as fast as the mainstream method.
6.Resurrecting Trust in Facial Recognition: Mitigating Backdoor Attacks in Face Recognition to Prevent Potential Privacy Breaches ⬇️
Biometric data, such as face images, are often associated with sensitive information (e.g medical, financial, personal government records). Hence, a data breach in a system storing such information can have devastating consequences. Deep learning is widely utilized for face recognition (FR); however, such models are vulnerable to backdoor attacks executed by malicious parties. Backdoor attacks cause a model to misclassify a particular class as a target class during recognition. This vulnerability can allow adversaries to gain access to highly sensitive data protected by biometric authentication measures or allow the malicious party to masquerade as an individual with higher system permissions. Such breaches pose a serious privacy threat. Previous methods integrate noise addition mechanisms into face recognition models to mitigate this issue and improve the robustness of classification against backdoor attacks. However, this can drastically affect model accuracy. We propose a novel and generalizable approach (named BA-BAM: Biometric Authentication - Backdoor Attack Mitigation), that aims to prevent backdoor attacks on face authentication deep learning models through transfer learning and selective image perturbation. The empirical evidence shows that BA-BAM is highly robust and incurs a maximal accuracy drop of 2.4%, while reducing the attack success rate to a maximum of 20%. Comparisons with existing approaches show that BA-BAM provides a more practical backdoor mitigation approach for face recognition.
7.Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion ⬇️
Recently, segmentation-based scene text detection methods have drawn extensive attention in the scene text detection field, because of their superiority in detecting the text instances of arbitrary shapes and extreme aspect ratios, profiting from the pixel-level descriptions. However, the vast majority of the existing segmentation-based approaches are limited to their complex post-processing algorithms and the scale robustness of their segmentation models, where the post-processing algorithms are not only isolated to the model optimization but also time-consuming and the scale robustness is usually strengthened by fusing multi-scale feature maps directly. In this paper, we propose a Differentiable Binarization (DB) module that integrates the binarization process, one of the most important steps in the post-processing procedure, into a segmentation network. Optimized along with the proposed DB module, the segmentation network can produce more accurate results, which enhances the accuracy of text detection with a simple pipeline. Furthermore, an efficient Adaptive Scale Fusion (ASF) module is proposed to improve the scale robustness by fusing features of different scales adaptively. By incorporating the proposed DB and ASF with the segmentation network, our proposed scene text detector consistently achieves state-of-the-art results, in terms of both detection accuracy and speed, on five standard benchmarks.
8.VLAD-VSA: Cross-Domain Face Presentation Attack Detection with Vocabulary Separation and Adaptation ⬇️
For face presentation attack detection (PAD), most of the spoofing cues are subtle, local image patterns (e.g., local image distortion, 3D mask edge and cut photo edges). The representations of existing PAD works with simple global pooling method, however, lose the local feature discriminability. In this paper, the VLAD aggregation method is adopted to quantize local features with visual vocabulary locally partitioning the feature space, and hence preserve the local discriminability. We further propose the vocabulary separation and adaptation method to modify VLAD for cross-domain PADtask. The proposed vocabulary separation method divides vocabulary into domain-shared and domain-specific visual words to cope with the diversity of live and attack faces under the cross-domain scenario. The proposed vocabulary adaptation method imitates the maximization step of the k-means algorithm in the end-to-end training, which guarantees the visual words be close to the center of assigned local features and thus brings robust similarity measurement. We give illustrations and extensive experiments to demonstrate the effectiveness of VLAD with the proposed vocabulary separation and adaptation method on standard cross-domain PAD benchmarks. The codes are available at this https URL.
9.A Comprehensive Evaluation on Multi-channel Biometric Face Presentation Attack Detection ⬇️
The vulnerability against presentation attacks is a crucial problem undermining the wide-deployment of face recognition systems. Though presentation attack detection (PAD) systems try to address this problem, the lack of generalization and robustness continues to be a major concern. Several works have shown that using multi-channel PAD systems could alleviate this vulnerability and result in more robust systems. However, there is a wide selection of channels available for a PAD system such as RGB, Near Infrared, Shortwave Infrared, Depth, and Thermal sensors. Having a lot of sensors increases the cost of the system, and therefore an understanding of the performance of different sensors against a wide variety of attacks is necessary while selecting the modalities. In this work, we perform a comprehensive study to understand the effectiveness of various imaging modalities for PAD. The studies are performed on a multi-channel PAD dataset, collected with 14 different sensing modalities considering a wide range of 2D, 3D, and partial attacks. We used the multi-channel convolutional network-based architecture, which uses pixel-wise binary supervision. The model has been evaluated with different combinations of channels, and different image qualities on a variety of challenging known and unknown attack protocols. The results reveal interesting trends and can act as pointers for sensor selection for safety-critical presentation attack detection systems. The source codes and protocols to reproduce the results are made available publicly making it possible to extend this work to other architectures.
10.End-to-End High Accuracy License Plate Recognition Based on Depthwise Separable Convolution Networks ⬇️
Automatic license plate recognition plays a crucial role in modern transportation systems such as for traffic monitoring and vehicle violation detection. In real-world scenarios, license plate recognition still faces many challenges and is impaired by unpredictable interference such as weather or lighting conditions. Many machine learning based ALPR solutions have been proposed to solve such challenges in recent years. However, most are not convincing, either because their results are evaluated on small or simple datasets that lack diverse surroundings, or because they require powerful hardware to achieve a reasonable frames-per-second in real-world applications. In this paper, we propose a novel segmentation-free framework for license plate recognition and introduce NP-ALPR, a diverse and challenging dataset which resembles real-world scenarios. The proposed network model consists of the latest deep learning methods and state-of-the-art ideas, and benefits from a novel network architecture. It achieves higher accuracy with lower computational requirements than previous works. We evaluate the effectiveness of the proposed method on three different datasets and show a recognition accuracy of over 99% and over 70 fps, demonstrating that our method is not only robust but also computationally efficient.
11.A Self-Supervised Descriptor for Image Copy Detection ⬇️
Image copy detection is an important task for content moderation. We introduce SSCD, a model that builds on a recent self-supervised contrastive training objective. We adapt this method to the copy detection task by changing the architecture and training objective, including a pooling operator from the instance matching literature, and adapting contrastive learning to augmentations that combine images.
Our approach relies on an entropy regularization term, promoting consistent separation between descriptor vectors, and we demonstrate that this significantly improves copy detection accuracy. Our method produces a compact descriptor vector, suitable for real-world web scale applications. Statistical information from a background image distribution can be incorporated into the descriptor.
On the recent DISC2021 benchmark, SSCD is shown to outperform both baseline copy detection models and self-supervised architectures designed for image classification by huge margins, in all settings. For example, SSCD out-performs SimCLR descriptors by 48% absolute.
12.PointSCNet: Point Cloud Structure and Correlation Learning Based on Space Filling Curve-Guided Sampling ⬇️
Geometrical structures and the internal local region relationship, such as symmetry, regular array, junction, etc., are essential for understanding a 3D shape. This paper proposes a point cloud feature extraction network named PointSCNet, to capture the geometrical structure information and local region correlation information of a point cloud. The PointSCNet consists of three main modules: the space-filling curve-guided sampling module, the information fusion module, and the channel-spatial attention module. The space-filling curve-guided sampling module uses Z-order curve coding to sample points that contain geometrical correlation. The information fusion module uses a correlation tensor and a set of skip connections to fuse the structure and correlation information. The channel-spatial attention module enhances the representation of key points and crucial feature channels to refine the network. The proposed PointSCNet is evaluated on shape classification and part segmentation tasks. The experimental results demonstrate that the PointSCNet outperforms or is on par with state-of-the-art methods by learning the structure and correlation of point clouds effectively.
13.Rethinking the Zigzag Flattening for Image Reading ⬇️
Sequence ordering of word vector matters a lot to text reading, which has been proven in natural language processing (NLP). However, the rule of different sequence ordering in computer vision (CV) was not well explored, e.g., why the "zigzag" flattening (ZF) is commonly utilized as a default option to get the image patches ordering in vision transformers (ViTs). Notably, when decomposing multi-scale images, the ZF could not maintain the invariance of feature point positions. To this end, we investigate the Hilbert fractal flattening (HF) as another method for sequence ordering in CV and contrast it against ZF. The HF has proven to be superior to other curves in maintaining spatial locality, when performing multi-scale transformations of dimensional space. And it can be easily plugged into most deep neural networks (DNNs). Extensive experiments demonstrate that it can yield consistent and significant performance boosts for a variety of architectures. Finally, we hope that our studies spark further research about the flattening strategy of image reading.
14.Edge Data Based Trailer Inception Probabilistic Matrix Factorization for Context-Aware Movie Recommendation ⬇️
The rapid growth of edge data generated by mobile devices and applications deployed at the edge of the network has exacerbated the problem of information overload. As an effective way to alleviate information overload, recommender system can improve the quality of various services by adding application data generated by users on edge devices, such as visual and textual information, on the basis of sparse rating data. The visual information in the movie trailer is a significant part of the movie recommender system. However, due to the complexity of visual information extraction, data sparsity cannot be remarkably alleviated by merely using the rough visual features to improve the rating prediction accuracy. Fortunately, the convolutional neural network can be used to extract the visual features precisely. Therefore, the end-to-end neural image caption (NIC) model can be utilized to obtain the textual information describing the visual features of movie trailers. This paper proposes a trailer inception probabilistic matrix factorization model called Ti-PMF, which combines NIC, recurrent convolutional neural network, and probabilistic matrix factorization models as the rating prediction model. We implement the proposed Ti-PMF model with extensive experiments on three real-world datasets to validate its effectiveness. The experimental results illustrate that the proposed Ti-PMF outperforms the existing ones.
15.Offline Text-Independent Writer Identification based on word level data ⬇️
This paper proposes a novel scheme to identify the authorship of a document based on handwritten input word images of an individual. Our approach is text-independent and does not place any restrictions on the size of the input word images under consideration. To begin with, we employ the SIFT algorithm to extract multiple key points at various levels of abstraction (comprising allograph, character, or combination of characters). These key points are then passed through a trained CNN network to generate feature maps corresponding to a convolution layer. However, owing to the scale corresponding to the SIFT key points, the size of a generated feature map may differ. As an alleviation to this issue, the histogram of gradients is applied on the feature map to produce a fixed representation. Typically, in a CNN, the number of filters of each convolution block increase depending on the depth of the network. Thus, extracting histogram features for each of the convolution feature map increase the dimension as well as the computational load. To address this aspect, we use an entropy-based method to learn the weights of the feature maps of a particular CNN layer during the training phase of our algorithm. The efficacy of our proposed system has been demonstrated on two publicly available databases namely CVL and IAM. We empirically show that the results obtained are promising when compared with previous works.
16.Learning Multiple Explainable and Generalizable Cues for Face Anti-spoofing ⬇️
Although previous CNN based face anti-spoofing methods have achieved promising performance under intra-dataset testing, they suffer from poor generalization under cross-dataset testing. The main reason is that they learn the network with only binary supervision, which may learn arbitrary cues overfitting on the training dataset. To make the learned feature explainable and more generalizable, some researchers introduce facial depth and reflection map as the auxiliary supervision. However, many other generalizable cues are unexplored for face anti-spoofing, which limits their performance under cross-dataset testing. To this end, we propose a novel framework to learn multiple explainable and generalizable cues (MEGC) for face anti-spoofing. Specifically, inspired by the process of human decision, four mainly used cues by humans are introduced as auxiliary supervision including the boundary of spoof medium, moiré pattern, reflection artifacts and facial depth in addition to the binary supervision. To avoid extra labelling cost, corresponding synthetic methods are proposed to generate these auxiliary supervision maps. Extensive experiments on public datasets validate the effectiveness of these cues, and state-of-the-art performances are achieved by our proposed method.
17.Cell nuclei classification in histopathological images using hybrid OLConvNet ⬇️
Computer-aided histopathological image analysis for cancer detection is a major research challenge in the medical domain. Automatic detection and classification of nuclei for cancer diagnosis impose a lot of challenges in developing state of the art algorithms due to the heterogeneity of cell nuclei and data set variability. Recently, a multitude of classification algorithms has used complex deep learning models for their dataset. However, most of these methods are rigid and their architectural arrangement suffers from inflexibility and non-interpretability. In this research article, we have proposed a hybrid and flexible deep learning architecture OLConvNet that integrates the interpretability of traditional object-level features and generalization of deep learning features by using a shallower Convolutional Neural Network (CNN) named as
$CNN_{3L}$ .$CNN_{3L}$ reduces the training time by training fewer parameters and hence eliminating space constraints imposed by deeper algorithms. We used F1-score and multiclass Area Under the Curve (AUC) performance parameters to compare the results. To further strengthen the viability of our architectural approach, we tested our proposed methodology with state of the art deep learning architectures AlexNet, VGG16, VGG19, ResNet50, InceptionV3, and DenseNet121 as backbone networks. After a comprehensive analysis of classification results from all four architectures, we observed that our proposed model works well and perform better than contemporary complex algorithms.
18.Synthetic CT Skull Generation for Transcranial MR Imaging-Guided Focused Ultrasound Interventions with Conditional Adversarial Networks ⬇️
Transcranial MRI-guided focused ultrasound (TcMRgFUS) is a therapeutic ultrasound method that focuses sound through the skull to a small region noninvasively under MRI guidance. It is clinically approved to thermally ablate regions of the thalamus and is being explored for other therapies, such as blood brain barrier opening and neuromodulation. To accurately target ultrasound through the skull, the transmitted waves must constructively interfere at the target region. However, heterogeneity of the sound speed, density, and ultrasound attenuation in different individuals' skulls requires patient-specific estimates of these parameters for optimal treatment planning. CT imaging is currently the gold standard for estimating acoustic properties of an individual skull during clinical procedures, but CT imaging exposes patients to radiation and increases the overall number of imaging procedures required for therapy. A method to estimate acoustic parameters in the skull without the need for CT would be desirable. Here, we synthesized CT images from routinely acquired T1-weighted MRI by using a 3D patch-based conditional generative adversarial network and evaluated the performance of synthesized CT images for treatment planning with transcranial focused ultrasound. We compared the performance of synthetic CT to real CT images using Kranion and k-Wave acoustic simulation. Our work demonstrates the feasibility of replacing real CT with the MR-synthesized CT for TcMRgFUS planning.
19.A Smoothing and Thresholding Image Segmentation Framework with Weighted Anisotropic-Isotropic Total Variation ⬇️
In this paper, we propose a multi-stage image segmentation framework that incorporates a weighted difference of anisotropic and isotropic total variation (AITV). The segmentation framework generally consists of two stages: smoothing and thresholding, thus referred to as SaT. In the first stage, a smoothed image is obtained by an AITV-regularized Mumford-Shah (MS) model, which can be solved efficiently by the alternating direction method of multipliers (ADMM) with a closed-form solution of a proximal operator of the
$\ell_1 -\alpha \ell_2$ regularizer. Convergence of the ADMM algorithm is analyzed. In the second stage, we threshold the smoothed image by$k$ -means clustering to obtain the final segmentation result. Numerical experiments demonstrate that the proposed segmentation framework is versatile for both grayscale and color images, efficient in producing high-quality segmentation results within a few seconds, and robust to input images that are corrupted with noise, blur, or both. We compare the AITV method with its original convex and nonconvex TV$^p (0<p<1)$ counterparts, showcasing the qualitative and quantitative advantages of our proposed method.
20.ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond ⬇️
Vision transformers have shown great potential in various computer vision tasks owing to their strong capability to model long-range dependency using the self-attention mechanism. Nevertheless, they treat an image as a 1D sequence of visual tokens, lacking an intrinsic inductive bias (IB) in modeling local visual structures and dealing with scale variance, which is instead learned implicitly from large-scale training data with longer training schedules. In this paper, we propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE. Technically, ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context using multiple convolutions with different dilation rates. In this way, it acquires an intrinsic scale invariance IB and can learn robust feature representation for objects at various scales. Moreover, in each transformer layer, ViTAE has a convolution block parallel to the multi-head self-attention module, whose features are fused and fed into the feed-forward network. Consequently, it has the intrinsic locality IB and is able to learn local features and global dependencies collaboratively. The proposed two kinds of cells are stacked in both isotropic and multi-stage manners to formulate two families of ViTAE models, i.e., the vanilla ViTAE and ViTAEv2. Experiments on the ImageNet dataset as well as downstream tasks on the MS COCO, ADE20K, and AP10K datasets validate the superiority of our models over the baseline transformer models and concurrent works. Besides, we scale up our ViTAE model to 644M parameters and obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set, without using extra private data.
21.Simplified Learning of CAD Features Leveraging a Deep Residual Autoencoder ⬇️
In the domain of computer vision, deep residual neural networks like EfficientNet have set new standards in terms of robustness and accuracy. One key problem underlying the training of deep neural networks is the immanent lack of a sufficient amount of training data. The problem worsens especially if labels cannot be generated automatically, but have to be annotated manually. This challenge occurs for instance if expert knowledge related to 3D parts should be externalized based on example models. One way to reduce the necessary amount of labeled data may be the use of autoencoders, which can be learned in an unsupervised fashion without labeled data. In this work, we present a deep residual 3D autoencoder based on the EfficientNet architecture, intended for transfer learning tasks related to 3D CAD model assessment. For this purpose, we adopted EfficientNet to 3D problems like voxel models derived from a STEP file. Striving to reduce the amount of labeled 3D data required, the networks encoder can be utilized for transfer training.
22.Point Cloud Denoising via Momentum Ascent in Gradient Fields ⬇️
To achieve point cloud denoising, traditional methods heavily rely on geometric priors, and most learning-based approaches suffer from outliers and loss of details. Recently, the gradient-based method was proposed to estimate the gradient fields from the noisy point clouds using neural networks, and refine the position of each point according to the estimated gradient. However, the predicted gradient could fluctuate, leading to perturbed and unstable solutions, as well as a large inference time. To address these issues, we develop the momentum gradient ascent method that leverages the information of previous iterations in determining the trajectories of the points, thus improving the stability of the solution and reducing the inference time. Experiments demonstrate that the proposed method outperforms state-of-the-art methods with a variety of point clouds and noise levels.
23.Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution ⬇️
When transferring a pretrained model to a downstream task, two popular methods are full fine-tuning (updating all the model parameters) and linear probing (updating only the last linear layer -- the "head"). It is well known that fine-tuning leads to better accuracy in-distribution (ID). However, in this paper, we find that fine-tuning can achieve worse accuracy than linear probing out-of-distribution (OOD) when the pretrained features are good and the distribution shift is large. On 10 distribution shift datasets (Breeds-Living17, Breeds-Entity30, DomainNet, CIFAR
$\to$ STL, CIFAR10.1, FMoW, ImageNetV2, ImageNet-R, ImageNet-A, ImageNet-Sketch), fine-tuning obtains on average 2% higher accuracy ID but 7% lower accuracy OOD than linear probing. We show theoretically that this tradeoff between ID and OOD accuracy arises even in a simple setting: fine-tuning overparameterized two-layer linear networks. We prove that the OOD error of fine-tuning is high when we initialize with a fixed or random head -- this is because while fine-tuning learns the head, the lower layers of the neural network change simultaneously and distort the pretrained features. Our analysis suggests that the easy two-step strategy of linear probing then full fine-tuning (LP-FT), sometimes used as a fine-tuning heuristic, combines the benefits of both fine-tuning and linear probing. Empirically, LP-FT outperforms both fine-tuning and linear probing on the above datasets (1% better ID, 10% better OOD than full fine-tuning).
24.PCSCNet: Fast 3D Semantic Segmentation of LiDAR Point Cloud for Autonomous Car using Point Convolution and Sparse Convolution Network ⬇️
The autonomous car must recognize the driving environment quickly for safe driving. As the Light Detection And Range (LiDAR) sensor is widely used in the autonomous car, fast semantic segmentation of LiDAR point cloud, which is the point-wise classification of the point cloud within the sensor framerate, has attracted attention in recognition of the driving environment. Although the voxel and fusion-based semantic segmentation models are the state-of-the-art model in point cloud semantic segmentation recently, their real-time performance suffer from high computational load due to high voxel resolution. In this paper, we propose the fast voxel-based semantic segmentation model using Point Convolution and 3D Sparse Convolution (PCSCNet). The proposed model is designed to outperform at both high and low voxel resolution using point convolution-based feature extraction. Moreover, the proposed model accelerates the feature propagation using 3D sparse convolution after the feature extraction. The experimental results demonstrate that the proposed model outperforms the state-of-the-art real-time models in semantic segmentation of SemanticKITTI and nuScenes, and achieves the real-time performance in LiDAR point cloud inference.
25.DGAFF: Deep Genetic Algorithm Fitness Formation for EEG Bio-Signal Channel Selection ⬇️
Brain-computer interface systems aim to facilitate human-computer interactions in a great deal by direct translation of brain signals for computers. Recently, using many electrodes has caused better performance in these systems. However, increasing the number of recorded electrodes leads to additional time, hardware, and computational costs besides undesired complications of the recording process. Channel selection has been utilized to decrease data dimension and eliminate irrelevant channels while reducing the noise effects. Furthermore, the technique lowers the time and computational costs in real-time applications. We present a channel selection method, which combines a sequential search method with a genetic algorithm called Deep GA Fitness Formation (DGAFF). The proposed method accelerates the convergence of the genetic algorithm and increases the system's performance. The system evaluation is based on a lightweight deep neural network that automates the whole model training process. The proposed method outperforms other channel selection methods in classifying motor imagery on the utilized dataset.
26.Domain-Augmented Domain Adaptation ⬇️
Unsupervised domain adaptation (UDA) enables knowledge transfer from the labelled source domain to the unlabeled target domain by reducing the cross-domain discrepancy. However, most of the studies were based on direct adaptation from the source domain to the target domain and have suffered from large domain discrepancies. To overcome this challenge, in this paper, we propose the domain-augmented domain adaptation (DADA) to generate pseudo domains that have smaller discrepancies with the target domain, to enhance the knowledge transfer process by minimizing the discrepancy between the target domain and pseudo domains. Furthermore, we design a pseudo-labeling method for DADA by projecting representations from the target domain to multiple pseudo domains and taking the averaged predictions on the classification from the pseudo domains as the pseudo labels. We conduct extensive experiments with the state-of-the-art domain adaptation methods on four benchmark datasets: Office Home, Office-31, VisDA2017, and Digital datasets. The results demonstrate the superiority of our model.
27.Don't Touch What Matters: Task-Aware Lipschitz Data Augmentation for Visual Reinforcement Learning ⬇️
One of the key challenges in visual Reinforcement Learning (RL) is to learn policies that can generalize to unseen environments. Recently, data augmentation techniques aiming at enhancing data diversity have demonstrated proven performance in improving the generalization ability of learned policies. However, due to the sensitivity of RL training, naively applying data augmentation, which transforms each pixel in a task-agnostic manner, may suffer from instability and damage the sample efficiency, thus further exacerbating the generalization performance. At the heart of this phenomenon is the diverged action distribution and high-variance value estimation in the face of augmented images. To alleviate this issue, we propose Task-aware Lipschitz Data Augmentation (TLDA) for visual RL, which explicitly identifies the task-correlated pixels with large Lipschitz constants, and only augments the task-irrelevant pixels. To verify the effectiveness of TLDA, we conduct extensive experiments on DeepMind Control suite, CARLA and DeepMind Manipulation tasks, showing that TLDA improves both sample efficiency in training time and generalization in test time. It outperforms previous state-of-the-art methods across the 3 different visual control benchmarks.
28.Deep Feature based Cross-slide Registration ⬇️
Cross-slide image analysis provides additional information by analysing the expression of different biomarkers as compared to a single slide analysis. Slides stained with different biomarkers are analysed side by side which may reveal unknown relations between the different biomarkers. During the slide preparation, a tissue section may be placed at an arbitrary orientation as compared to other sections of the same tissue block. The problem is compounded by the fact that tissue contents are likely to change from one section to the next and there may be unique artefacts on some of the slides. This makes registration of each section to a reference section of the same tissue block an important pre-requisite task before any cross-slide analysis. We propose a deep feature based registration (DFBR) method which utilises data-driven features to estimate the rigid transformation. We adopted a multi-stage strategy for improving the quality of registration. We also developed a visualisation tool to view registered pairs of WSIs at different magnifications. With the help of this tool, one can apply a transformation on the fly without the need to generate transformed source WSI in a pyramidal form. We compared the performance of data-driven features with that of hand-crafted features on the COMET dataset. Our approach can align the images with low registration errors. Generally, the success of non-rigid registration is dependent on the quality of rigid registration. To evaluate the efficacy of the DFBR method, the first two steps of the ANHIR winner's framework are replaced with our DFBR to register challenge provided image pairs. The modified framework produce comparable results to that of challenge winning team.
29.LiDAR-guided Stereo Matching with a Spatial Consistency Constraint ⬇️
The complementary fusion of light detection and ranging (LiDAR) data and image data is a promising but challenging task for generating high-precision and high-density point clouds. This study proposes an innovative LiDAR-guided stereo matching approach called LiDAR-guided stereo matching (LGSM), which considers the spatial consistency represented by continuous disparity or depth changes in the homogeneous region of an image. The LGSM first detects the homogeneous pixels of each LiDAR projection point based on their color or intensity similarity. Next, we propose a riverbed enhancement function to optimize the cost volume of the LiDAR projection points and their homogeneous pixels to improve the matching robustness. Our formulation expands the constraint scopes of sparse LiDAR projection points with the guidance of image information to optimize the cost volume of pixels as much as possible. We applied LGSM to semi-global matching and AD-Census on both simulated and real datasets. When the percentage of LiDAR points in the simulated datasets was 0.16%, the matching accuracy of our method achieved a subpixel level, while that of the original stereo matching algorithm was 3.4 pixels. The experimental results show that LGSM is suitable for indoor, street, aerial, and satellite image datasets and provides good transferability across semi-global matching and AD-Census. Furthermore, the qualitative and quantitative evaluations demonstrate that LGSM is superior to two state-of-the-art optimizing cost volume methods, especially in reducing mismatches in difficult matching areas and refining the boundaries of objects.
30.Multiscale Crowd Counting and Localization By Multitask Point Supervision ⬇️
We propose a multitask approach for crowd counting and person localization in a unified framework. As the detection and localization tasks are well-correlated and can be jointly tackled, our model benefits from a multitask solution by learning multiscale representations of encoded crowd images, and subsequently fusing them. In contrast to the relatively more popular density-based methods, our model uses point supervision to allow for crowd locations to be accurately identified. We test our model on two popular crowd counting datasets, ShanghaiTech A and B, and demonstrate that our method achieves strong results on both counting and localization tasks, with MSE measures of 110.7 and 15.0 for crowd counting and AP measures of 0.71 and 0.75 for localization, on ShanghaiTech A and B respectively. Our detailed ablation experiments show the impact of our multiscale approach as well as the effectiveness of the fusion module embedded in our network. Our code is available at: this https URL.
31.Generative Target Update for Adaptive Siamese Tracking ⬇️
Siamese trackers perform similarity matching with templates (i.e., target models) to recursively localize objects within a search region. Several strategies have been proposed in the literature to update a template based on the tracker output, typically extracted from the target search region in the current frame, and thereby mitigate the effects of target drift. However, this may lead to corrupted templates, limiting the potential benefits of a template update strategy.
This paper proposes a model adaptation method for Siamese trackers that uses a generative model to produce a synthetic template from the object search regions of several previous frames, rather than directly using the tracker output. Since the search region encompasses the target, attention from the search region is used for robust model adaptation. In particular, our approach relies on an auto-encoder trained through adversarial learning to detect changes in a target object's appearance and predict a future target template, using a set of target templates localized from tracker outputs at previous frames. To prevent template corruption during the update, the proposed tracker also performs change detection using the generative model to suspend updates until the tracker stabilizes, and robust matching can resume through dynamic template fusion.
Extensive experiments conducted on VOT-16, VOT-17, OTB-50, and OTB-100 datasets highlight the effectiveness of our method, along with the impact of its key components. Results indicate that our proposed approach can outperform state-of-art trackers, and its overall robustness allows tracking for a longer time before failure.
32.SRL-SOA: Self-Representation Learning with Sparse 1D-Operational Autoencoder for Hyperspectral Image Band Selection ⬇️
The band selection in the hyperspectral image (HSI) data processing is an important task considering its effect on the computational complexity and accuracy. In this work, we propose a novel framework for the band selection problem: Self-Representation Learning (SRL) with Sparse 1D-Operational Autoencoder (SOA). The proposed SLR-SOA approach introduces a novel autoencoder model, SOA, that is designed to learn a representation domain where the data are sparsely represented. Moreover, the network composes of 1D-operational layers with the non-linear neuron model. Hence, the learning capability of neurons (filters) is greatly improved with shallow architectures. Using compact architectures is especially crucial in autoencoders as they tend to overfit easily because of their identity mapping objective. Overall, we show that the proposed SRL-SOA band selection approach outperforms the competing methods over two HSI data including Indian Pines and Salinas-A considering the achieved land cover classification accuracies. The software implementation of the SRL-SOA approach is shared publicly at this https URL.
33.Non-Deterministic Face Mask Removal Based On 3D Priors ⬇️
This paper presents a novel image inpainting framework for face mask removal. Although current methods have demonstrated their impressive ability in recovering damaged face images, they suffer from two main problems: the dependence on manually labeled missing regions and the deterministic result corresponding to each input. The proposed approach tackles these problems by integrating a multi-task 3D face reconstruction module with a face inpainting module. Given a masked face image, the former predicts a 3DMM-based reconstructed face together with a binary occlusion map, providing dense geometrical and textural priors that greatly facilitate the inpainting task of the latter. By gradually controlling the 3D shape parameters, our method generates high-quality dynamic inpainting results with different expressions and mouth movements. Qualitative and quantitative experiments verify the effectiveness of the proposed method.
34.Sparsity Winning Twice: Better Robust Generalization from More Efficient Training ⬇️
Recent studies demonstrate that deep networks, even robustified by the state-of-the-art adversarial training (AT), still suffer from large robust generalization gaps, in addition to the much more expensive training costs than standard training. In this paper, we investigate this intriguing problem from a new perspective, i.e., injecting appropriate forms of sparsity during adversarial training. We introduce two alternatives for sparse adversarial training: (i) static sparsity, by leveraging recent results from the lottery ticket hypothesis to identify critical sparse subnetworks arising from the early training; (ii) dynamic sparsity, by allowing the sparse subnetwork to adaptively adjust its connectivity pattern (while sticking to the same sparsity ratio) throughout training. We find both static and dynamic sparse methods to yield win-win: substantially shrinking the robust generalization gap and alleviating the robust overfitting, meanwhile significantly saving training and inference FLOPs. Extensive experiments validate our proposals with multiple network architectures on diverse datasets, including CIFAR-10/100 and Tiny-ImageNet. For example, our methods reduce robust generalization gap and overfitting by 34.44% and 4.02%, with comparable robust/standard accuracy boosts and 87.83%/87.82% training/inference FLOPs savings on CIFAR-100 with ResNet-18. Besides, our approaches can be organically combined with existing regularizers, establishing new state-of-the-art results in AT. Codes are available in this https URL.
35.Distortion-Aware Loop Filtering of Intra 360^o Video Coding with Equirectangular Projection ⬇️
In this paper, we propose a distortion-aware loop filtering model to improve the performance of intra coding for 360$^o$ videos projected via equirectangular projection (ERP) format. To enable the awareness of distortion, our proposed module analyzes content characteristics based on a coding unit (CU) partition mask and processes them through partial convolution to activate the specified area. The feature recalibration module, which leverages cascaded residual channel-wise attention blocks (RCABs) to adjust the inter-channel and intra-channel features automatically, is capable of adapting with different quality levels. The perceptual geometry optimization combining with weighted mean squared error (WMSE) and the perceptual loss guarantees both the local field of view (FoV) and global image reconstruction with high quality. Extensive experimental results show that our proposed scheme achieves significant bitrate savings compared with the anchor (HM + 360Lib), leading to 8.9%, 9.0%, 7.1% and 7.4% on average bit rate reductions in terms of PSNR, WPSNR, and PSNR of two viewports for luminance component of 360^o videos, respectively.
36.Pseudo Numerical Methods for Diffusion Models on Manifolds ⬇️
Denoising Diffusion Probabilistic Models (DDPMs) can generate high-quality samples such as image and audio samples. However, DDPMs require hundreds to thousands of iterations to produce final samples. Several prior works have successfully accelerated DDPMs through adjusting the variance schedule (e.g., Improved Denoising Diffusion Probabilistic Models) or the denoising equation (e.g., Denoising Diffusion Implicit Models (DDIMs)). However, these acceleration methods cannot maintain the quality of samples and even introduce new noise at a high speedup rate, which limit their practicability. To accelerate the inference process while keeping the sample quality, we provide a fresh perspective that DDPMs should be treated as solving differential equations on manifolds. Under such a perspective, we propose pseudo numerical methods for diffusion models (PNDMs). Specifically, we figure out how to solve differential equations on manifolds and show that DDIMs are simple cases of pseudo numerical methods. We change several classical numerical methods to corresponding pseudo numerical methods and find that the pseudo linear multi-step method is the best in most situations. According to our experiments, by directly using pre-trained models on Cifar10, CelebA and LSUN, PNDMs can generate higher quality synthetic images with only 50 steps compared with 1000-step DDIMs (20x speedup), significantly outperform DDIMs with 250 steps (by around 0.4 in FID) and have good generalization on different variance schedules. Our implementation is available at this https URL.
37.Dynamic Spatial Propagation Network for Depth Completion ⬇️
Image-guided depth completion aims to generate dense depth maps with sparse depth measurements and corresponding RGB images. Currently, spatial propagation networks (SPNs) are the most popular affinity-based methods in depth completion, but they still suffer from the representation limitation of the fixed affinity and the over smoothing during iterations. Our solution is to estimate independent affinity matrices in each SPN iteration, but it is over-parameterized and heavy calculation. This paper introduces an efficient model that learns the affinity among neighboring pixels with an attention-based, dynamic approach. Specifically, the Dynamic Spatial Propagation Network (DySPN) we proposed makes use of a non-linear propagation model (NLPM). It decouples the neighborhood into parts regarding to different distances and recursively generates independent attention maps to refine these parts into adaptive affinity matrices. Furthermore, we adopt a diffusion suppression (DS) operation so that the model converges at an early stage to prevent over-smoothing of dense depth. Finally, in order to decrease the computational cost required, we also introduce three variations that reduce the amount of neighbors and attentions needed while still retaining similar accuracy. In practice, our method requires less iteration to match the performance of other SPNs and yields better results overall. DySPN outperforms other state-of-the-art (SoTA) methods on KITTI Depth Completion (DC) evaluation by the time of submission and is able to yield SoTA performance in NYU Depth v2 dataset as well.
38.Visual Attention Network ⬇️
While originally designed for natural language processing (NLP) tasks, the self-attention mechanism has recently taken various computer vision areas by storm. However, the 2D nature of images brings three challenges for applying self-attention in computer vision. (1) Treating images as 1D sequences neglects their 2D structures. (2) The quadratic complexity is too expensive for high-resolution images. (3) It only captures spatial adaptability but ignores channel adaptability. In this paper, we propose a novel large kernel attention (LKA) module to enable self-adaptive and long-range correlations in self-attention while avoiding the above issues. We further introduce a novel neural network based on LKA, namely Visual Attention Network (VAN). While extremely simple and efficient, VAN outperforms the state-of-the-art vision transformers and convolutional neural networks with a large margin in extensive experiments, including image classification, object detection, semantic segmentation, instance segmentation, etc. Code is available at this https URL.
39.3DRM:Pair-wise relation module for 3D object detection ⬇️
Context has proven to be one of the most important factors in object layout reasoning for 3D scene understanding. Existing deep contextual models either learn holistic features for context encoding or rely on pre-defined scene templates for context modeling. We argue that scene understanding benefits from object relation reasoning, which is capable of mitigating the ambiguity of 3D object detections and thus helps locate and classify the 3D objects more accurately and robustly. To achieve this, we propose a novel 3D relation module (3DRM) which reasons about object relations at pair-wise levels. The 3DRM predicts the semantic and spatial relationships between objects and extracts the object-wise relation features. We demonstrate the effects of 3DRM by plugging it into proposal-based and voting-based 3D object detection pipelines, respectively. Extensive evaluations show the effectiveness and generalization of 3DRM on 3D object detection. Our source code is available at this https URL.
40.Towards a Unified Approach to Homography Estimation Using Image Features and Pixel Intensities ⬇️
The homography matrix is a key component in various vision-based robotic tasks. Traditionally, homography estimation algorithms are classified into feature- or intensity-based. The main advantages of the latter are their versatility, accuracy, and robustness to arbitrary illumination changes. On the other hand, they have a smaller domain of convergence than the feature-based solutions. Their combination is hence promising, but existing techniques only apply them sequentially. This paper proposes a new hybrid method that unifies both classes into a single nonlinear optimization procedure, applies the same minimization method, and uses the same homography parametrization and warping function. Experimental validation using a classical testing framework shows that the proposed unified approach has improved convergence properties compared to each individual class. These are also demonstrated in a visual tracking application. As a final contribution, our ready-to-use implementation of the algorithm is made publicly available to the research community.
41.ARM3D: Attention-based relation module for indoor 3D object detection ⬇️
Relation context has been proved to be useful for many challenging vision tasks. In the field of 3D object detection, previous methods have been taking the advantage of context encoding, graph embedding, or explicit relation reasoning to extract relation context. However, there exists inevitably redundant relation context due to noisy or low-quality proposals. In fact, invalid relation context usually indicates underlying scene misunderstanding and ambiguity, which may, on the contrary, reduce the performance in complex scenes. Inspired by recent attention mechanism like Transformer, we propose a novel 3D attention-based relation module (ARM3D). It encompasses object-aware relation reasoning to extract pair-wise relation contexts among qualified proposals and an attention module to distribute attention weights towards different relation contexts. In this way, ARM3D can take full advantage of the useful relation context and filter those less relevant or even confusing contexts, which mitigates the ambiguity in detection. We have evaluated the effectiveness of ARM3D by plugging it into several state-of-the-art 3D object detectors and showing more accurate and robust detection results. Extensive experiments show the capability and generalization of ARM3D on 3D object detection. Our source code is available at this https URL.
42.MANet: Improving Video Denoising with a Multi-Alignment Network ⬇️
In video denoising, the adjacent frames often provide very useful information, but accurate alignment is needed before such information can be harnassed. In this work, we present a multi-alignment network, which generates multiple flow proposals followed by attention-based averaging. It serves to mimics the non-local mechanism, suppressing noise by averaging multiple observations. Our approach can be applied to various state-of-the-art models that are based on flow estimation. Experiments on a large-scale video dataset demonstrate that our method improves the denoising baseline model by 0.2dB, and further reduces the parameters by 47% with model distillation.
43.MSSNet: Multi-Scale-Stage Network for Single Image Deblurring ⬇️
Most of traditional single image deblurring methods before deep learning adopt a coarse-to-fine scheme that estimates a sharp image at a coarse scale and progressively refines it at finer scales. While this scheme has also been adopted to several deep learning-based approaches, recently a number of single-scale approaches have been introduced showing superior performance to previous coarse-to-fine approaches both in quality and computation time, making the traditional coarse-to-fine scheme seemingly obsolete. In this paper, we revisit the coarse-to-fine scheme, and analyze defects of previous coarse-to-fine approaches that degrade their performance. Based on the analysis, we propose Multi-Scale-Stage Network (MSSNet), a novel deep learning-based approach to single image deblurring that adopts our remedies to the defects. Specifically, MSSNet adopts three novel technical components: stage configuration reflecting blur scales, an inter-scale information propagation scheme, and a pixel-shuffle-based multi-scale scheme. Our experiments show that MSSNet achieves the state-of-the-art performance in terms of quality, network size, and computation time.
44.Region-Based Semantic Factorization in GANs ⬇️
Despite the rapid advancement of semantic discovery in the latent space of Generative Adversarial Networks (GANs), existing approaches either are limited to finding global attributes or rely on a number of segmentation masks to identify local attributes. In this work, we present a highly efficient algorithm to factorize the latent semantics learned by GANs concerning an arbitrary image region. Concretely, we revisit the task of local manipulation with pre-trained GANs and formulate region-based semantic discovery as a dual optimization problem. Through an appropriately defined generalized Rayleigh quotient, we manage to solve such a problem without any annotations or training. Experimental results on various state-of-the-art GAN models demonstrate the effectiveness of our approach, as well as its superiority over prior arts regarding precise control, region robustness, speed of implementation, and simplicity of use.
45.An Unsupervised Attentive-Adversarial Learning Framework for Single Image Deraining ⬇️
Single image deraining has been an important topic in low-level computer vision tasks. The atmospheric veiling effect (which is generated by rain accumulation, similar to fog) usually appears with the rain. Most deep learning-based single image deraining methods mainly focus on rain streak removal by disregarding this effect, which leads to low-quality deraining performance. In addition, these methods are trained only on synthetic data, hence they do not take into account real-world rainy images. To address the above issues, we propose a novel unsupervised attentive-adversarial learning framework (UALF) for single image deraining that trains on both synthetic and real rainy images while simultaneously capturing both rain streaks and rain accumulation features. UALF consists of a Rain-fog2Clean (R2C) transformation block and a Clean2Rain-fog (C2R) transformation block. In R2C, to better characterize the rain-fog fusion feature and to achieve high-quality deraining performance, we employ an attention rain-fog feature extraction network (ARFE) to exploit the self-similarity of global and local rain-fog information by learning the spatial feature correlations. Moreover, to improve the transformation ability of C2R, we design a rain-fog feature decoupling and reorganization network (RFDR) by embedding a rainy image degradation model and a mixed discriminator to preserve richer texture details. Extensive experiments on benchmark rain-fog and rain datasets show that UALF outperforms state-of-the-art deraining methods. We also conduct defogging performance evaluation experiments to further demonstrate the effectiveness of UALF
46.Image-to-Graph Transformers for Chemical Structure Recognition ⬇️
For several decades, chemical knowledge has been published in written text, and there have been many attempts to make it accessible, for example, by transforming such natural language text to a structured format. Although the discovered chemical itself commonly represented in an image is the most important part, the correct recognition of the molecular structure from the image in literature still remains a hard problem since they are often abbreviated to reduce the complexity and drawn in many different styles. In this paper, we present a deep learning model to extract molecular structures from images. The proposed model is designed to transform the molecular image directly into the corresponding graph, which makes it capable of handling non-atomic symbols for abbreviations. Also, by end-to-end learning approach it can fully utilize many open image-molecule pair data from various sources, and hence it is more robust to image style variation than other tools. The experimental results show that the proposed model outperforms the existing models with 17.1 % and 12.8 % relative improvement for well-known benchmark datasets and large molecular images that we collected from literature, respectively.
47.Tripartite: Tackle Noisy Labels by a More Precise Partition ⬇️
Samples in large-scale datasets may be mislabeled due to various reasons, and Deep Neural Networks can easily over-fit to the noisy label data. To tackle this problem, the key point is to alleviate the harm of these noisy labels. Many existing methods try to divide training data into clean and noisy subsets in terms of loss values, and then process the noisy label data varied. One of the reasons hindering a better performance is the hard samples. As hard samples always have relatively large losses whether their labels are clean or noisy, these methods could not divide them precisely. Instead, we propose a Tripartite solution to partition training data more precisely into three subsets: hard, noisy, and clean. The partition criteria are based on the inconsistent predictions of two networks, and the inconsistency between the prediction of a network and the given label. To minimize the harm of noisy labels but maximize the value of noisy label data, we apply a low-weight learning on hard data and a self-supervised learning on noisy label data without using the given labels. Extensive experiments demonstrate that Tripartite can filter out noisy label data more precisely, and outperforms most state-of-the-art methods on five benchmark datasets, especially on real-world datasets.
48.Diversity aware image generation ⬇️
The machine learning generative algorithms such as GAN and VAE show impressive results in practice when constructing images similar to those in a training set. However, the generation of new images builds mainly on the understanding of the hidden structure of the training database followed by a mere sampling from a multi-dimensional normal variable. In particular each sample is independent from the other ones and can repeatedly propose same type of images. To cure this drawback we propose a kernel-based measure representation method that can produce new objects from a given target measure by approximating the measure as a whole and even staying away from objects already drawn from that distribution. This ensures a better variety of the produced images. The method is tested on some classic machine learning benchmarks.\end{abstract}
49.Priming Cross-Session Motor Imagery Classification with A Universal Deep Domain Adaptation Framework ⬇️
Motor imagery (MI) is a common brain computer interface (BCI) paradigm. EEG is non-stationary with low signal-to-noise, classifying motor imagery tasks of the same participant from different EEG recording sessions is generally challenging, as EEG data distribution may vary tremendously among different acquisition sessions. Although it is intuitive to consider the cross-session MI classification as a domain adaptation problem, the rationale and feasible approach is not elucidated. In this paper, we propose a Siamese deep domain adaptation (SDDA) framework for cross-session MI classification based on mathematical models in domain adaptation theory. The proposed framework can be easily applied to most existing artificial neural networks without altering the network structure, which facilitates our method with great flexibility and transferability. In the proposed framework, domain invariants were firstly constructed jointly with channel normalization and Euclidean alignment. Then, embedding features from source and target domain were mapped into the Reproducing Kernel Hilbert Space (RKHS) and aligned accordingly. A cosine-based center loss was also integrated into the framework to improve the generalizability of the SDDA. The proposed framework was validated with two classic and popular convolutional neural networks from BCI research field (EEGNet and ConvNet) in two MI-EEG public datasets (BCI Competition IV IIA, IIB). Compared to the vanilla EEGNet and ConvNet, the proposed SDDA framework was able to boost the MI classification accuracy by 15.2%, 10.2% respectively in IIA dataset, and 5.5%, 4.2% in IIB dataset. The final MI classification accuracy reached 82.01% in IIA dataset and 87.52% in IIB, which outperformed the state-of-the-art methods in the literature.
50.HDAM: Heuristic Difference Attention Module for Convolutional Neural Networks ⬇️
The attention mechanism is one of the most important priori knowledge to enhance convolutional neural networks. Most attention mechanisms are bound to the convolutional layer and use local or global contextual information to recalibrate the input. This is a popular attention strategy design method. Global contextual information helps the network to consider the overall distribution, while local contextual information is more general. The contextual information makes the network pay attention to the mean or maximum value of a particular receptive field. Different from the most attention mechanism, this article proposes a novel attention mechanism with the heuristic difference attention module, HDAM. HDAM's input recalibration is based on the difference between the local and global contextual information instead of the mean and maximum values. At the same time, to make different layers have a more suitable local receptive field size and increase the exibility of the local receptive field design, we use genetic algorithm to heuristically produce local receptive fields. First, HDAM extracts the mean value of the global and local receptive fields as the corresponding contextual information. Then the difference between the global and local contextual information is calculated. Finally HDAM uses this difference to recalibrate the input. In addition, we use the heuristic ability of genetic algorithm to search for the local receptive field size of each layer. Our experiments on CIFAR-10 and CIFAR-100 show that HDAM can use fewer parameters than other attention mechanisms to achieve higher accuracy. We implement HDAM with the Python library, Pytorch, and the code and models will be publicly available.
51.SODA: Site Object Detection dAtaset for Deep Learning in Construction ⬇️
Computer vision-based deep learning object detection algorithms have been developed sufficiently powerful to support the ability to recognize various objects. Although there are currently general datasets for object detection, there is still a lack of large-scale, open-source dataset for the construction industry, which limits the developments of object detection algorithms as they tend to be data-hungry. Therefore, this paper develops a new large-scale image dataset specifically collected and annotated for the construction site, called Site Object Detection dAtaset (SODA), which contains 15 kinds of object classes categorized by workers, materials, machines, and layout. Firstly, more than 20,000 images were collected from multiple construction sites in different site conditions, weather conditions, and construction phases, which covered different angles and perspectives. After careful screening and processing, 19,846 images including 286,201 objects were then obtained and annotated with labels in accordance with predefined categories. Statistical analysis shows that the developed dataset is advantageous in terms of diversity and volume. Further evaluation with two widely-adopted object detection algorithms based on deep learning (YOLO v3/ YOLO v4) also illustrates the feasibility of the dataset for typical construction scenarios, achieving a maximum mAP of 81.47%. In this manner, this research contributes a large-scale image dataset for the development of deep learning-based object detection methods in the construction industry and sets up a performance benchmark for further evaluation of corresponding algorithms in this area.
52.Holistic Attention-Fusion Adversarial Network for Single Image Defogging ⬇️
Adversarial learning-based image defogging methods have been extensively studied in computer vision due to their remarkable performance. However, most existing methods have limited defogging capabilities for real cases because they are trained on the paired clear and synthesized foggy images of the same scenes. In addition, they have limitations in preserving vivid color and rich textual details in defogging. To address these issues, we develop a novel generative adversarial network, called holistic attention-fusion adversarial network (HAAN), for single image defogging. HAAN consists of a Fog2Fogfree block and a Fogfree2Fog block. In each block, there are three learning-based modules, namely, fog removal, color-texture recovery, and fog synthetic, that are constrained each other to generate high quality images. HAAN is designed to exploit the self-similarity of texture and structure information by learning the holistic channel-spatial feature correlations between the foggy image with its several derived images. Moreover, in the fog synthetic module, we utilize the atmospheric scattering model to guide it to improve the generative quality by focusing on an atmospheric light optimization with a novel sky segmentation network. Extensive experiments on both synthetic and real-world datasets show that HAAN outperforms state-of-the-art defogging methods in terms of quantitative accuracy and subjective visual quality.
53.student dangerous behavior detection in school ⬇️
Video surveillance systems have been installed to ensure the student safety in schools. However, discovering dangerous behaviors, such as fighting and falling down, usually depends on untimely human observations. In this paper, we focus on detecting dangerous behaviors of students automatically, which faces numerous challenges, such as insufficient datasets, confusing postures, keyframes detection and prompt response. To address these challenges, we first build a danger behavior dataset with locations and labels from surveillance videos, and transform action recognition of long videos to an object detection task that avoids keyframes detection. Then, we propose a novel end-to-end dangerous behavior detection method, named DangerDet, that combines multi-scale body features and keypoints-based pose features. We could improve the accuracy of behavior classification due to the highly correlation between pose and behavior. On our dataset, DangerDet achieves 71.0% mAP with about 11 FPS. It keeps a better balance between the accuracy and time cost.
54.Going Deeper into Recognizing Actions in Dark Environments: A Comprehensive Benchmark Study ⬇️
While action recognition (AR) has gained large improvements with the introduction of large-scale video datasets and the development of deep neural networks, AR models robust to challenging environments in real-world scenarios are still under-explored. We focus on the task of action recognition in dark environments, which can be applied to fields such as surveillance and autonomous driving at night. Intuitively, current deep networks along with visual enhancement techniques should be able to handle AR in dark environments, however, it is observed that this is not always the case in practice. To dive deeper into exploring solutions for AR in dark environments, we launched the UG2+ Challenge Track 2 (UG2-2) in IEEE CVPR 2021, with a goal of evaluating and advancing the robustness of AR models in dark environments. The challenge builds and expands on top of a novel ARID dataset, the first dataset for the task of dark video AR, and guides models to tackle such a task in both fully and semi-supervised manners. Baseline results utilizing current AR models and enhancement methods are reported, justifying the challenging nature of this task with substantial room for improvements. Thanks to the active participation from the research community, notable advances have been made in participants' solutions, while analysis of these solutions helped better identify possible directions to tackle the challenge of AR in dark environments.
55.BP-Triplet Net for Unsupervised Domain Adaptation: A Bayesian Perspective ⬇️
Triplet loss, one of the deep metric learning (DML) methods, is to learn the embeddings where examples from the same class are closer than examples from different classes. Motivated by DML, we propose an effective BP-Triplet Loss for unsupervised domain adaption (UDA) from the perspective of Bayesian learning and we name the model as BP-Triplet Net. In previous metric learning based methods for UDA, sample pairs across domains are treated equally, which is not appropriate due to the domain bias. In our work, considering the different importance of pair-wise samples for both feature learning and domain alignment, we deduce our BP-Triplet loss for effective UDA from the perspective of Bayesian learning. Our BP-Triplet loss adjusts the weights of pair-wise samples in intra domain and inter domain. Especially, it can self attend to the hard pairs (including hard positive pair and hard negative pair). Together with the commonly used adversarial loss for domain alignment, the quality of target pseudo labels is progressively improved. Our method achieved low joint error of the ideal source and target hypothesis. The expected target error can then be upper bounded following Ben-David s theorem. Comprehensive evaluations on five benchmark datasets, handwritten digits, Office31, ImageCLEF-DA, Office-Home and VisDA-2017 demonstrate the effectiveness of the proposed approach for UDA.
56.C2N: Practical Generative Noise Modeling for Real-World Denoising ⬇️
Learning-based image denoising methods have been bounded to situations where well-aligned noisy and clean images are given, or samples are synthesized from predetermined noise models, e.g., Gaussian. While recent generative noise modeling methods aim to simulate the unknown distribution of real-world noise, several limitations still exist. In a practical scenario, a noise generator should learn to simulate the general and complex noise distribution without using paired noisy and clean images. However, since existing methods are constructed on the unrealistic assumption of real-world noise, they tend to generate implausible patterns and cannot express complicated noise maps. Therefore, we introduce a Clean-to-Noisy image generation framework, namely C2N, to imitate complex real-world noise without using any paired examples. We construct the noise generator in C2N accordingly with each component of real-world noise characteristics to express a wide range of noise accurately. Combined with our C2N, conventional denoising CNNs can be trained to outperform existing unsupervised methods on challenging real-world benchmarks by a large margin.
57.PMP-Net++: Point Cloud Completion by Transformer-Enhanced Multi-step Point Moving Paths ⬇️
Point cloud completion concerns to predict missing part for incomplete 3D shapes. A common strategy is to generate complete shape according to incomplete input. However, unordered nature of point clouds will degrade generation of high-quality 3D shapes, as detailed topology and structure of unordered points are hard to be captured during the generative process using an extracted latent code. We address this problem by formulating completion as point cloud deformation process. Specifically, we design a novel neural network, named PMP-Net++, to mimic behavior of an earth mover. It moves each point of incomplete input to obtain a complete point cloud, where total distance of point moving paths (PMPs) should be the shortest. Therefore, PMP-Net++ predicts unique PMP for each point according to constraint of point moving distances. The network learns a strict and unique correspondence on point-level, and thus improves quality of predicted complete shape. Moreover, since moving points heavily relies on per-point features learned by network, we further introduce a transformer-enhanced representation learning network, which significantly improves completion performance of PMP-Net++. We conduct comprehensive experiments in shape completion, and further explore application on point cloud up-sampling, which demonstrate non-trivial improvement of PMP-Net++ over state-of-the-art point cloud completion/up-sampling methods.
58.Highlighting Object Category Immunity for the Generalization of Human-Object Interaction Detection ⬇️
Human-Object Interaction (HOI) detection plays a core role in activity understanding. As a compositional learning problem (human-verb-object), studying its generalization matters. However, widely-used metric mean average precision (mAP) fails to model the compositional generalization well. Thus, we propose a novel metric, mPD (mean Performance Degradation), as a complementary of mAP to evaluate the performance gap among compositions of different objects and the same verb. Surprisingly, mPD reveals that previous methods usually generalize poorly. With mPD as a cue, we propose Object Category (OC) Immunity to boost HOI generalization. The idea is to prevent model from learning spurious object-verb correlations as a short-cut to over-fit the train set. To achieve OC-immunity, we propose an OC-immune network that decouples the inputs from OC, extracts OC-immune representations, and leverages uncertainty quantification to generalize to unseen objects. In both conventional and zero-shot experiments, our method achieves decent improvements. To fully evaluate the generalization, we design a new and more difficult benchmark, on which we present significant advantage. The code is available at this https URL.
59.SAGE: SLAM with Appearance and Geometry Prior for Endoscopy ⬇️
In endoscopy, many applications (e.g., surgical navigation) would benefit from a real-time method that can simultaneously track the endoscope and reconstruct the dense 3D geometry of the observed anatomy from a monocular endoscopic video. To this end, we develop a Simultaneous Localization and Mapping system by combining the learning-based appearance and optimizable geometry priors and factor graph optimization. The appearance and geometry priors are explicitly learned in an end-to-end differentiable training pipeline to master the task of pair-wise image alignment, one of the core components of the SLAM system. In our experiments, the proposed SLAM system is shown to robustly handle the challenges of texture scarceness and illumination variation that are commonly seen in endoscopy. The system generalizes well to unseen endoscopes and subjects and performs favorably compared with a state-of-the-art feature-based SLAM system. The code repository is available at this https URL.
60.Modern Augmented Reality: Applications, Trends, and Future Directions ⬇️
Augmented reality (AR) is one of the relatively old, yet trending areas in the intersection of computer vision and computer graphics with numerous applications in several areas, from gaming and entertainment, to education and healthcare. Although it has been around for nearly fifty years, it has seen a lot of interest by the research community in the recent years, mainly because of the huge success of deep learning models for various computer vision and AR applications, which made creating new generations of AR technologies possible. This work tries to provide an overview of modern augmented reality, from both application-level and technical perspective. We first give an overview of main AR applications, grouped into more than ten categories. We then give an overview of around 100 recent promising machine learning based works developed for AR systems, such as deep learning works for AR shopping (clothing, makeup), AR based image filters (such as Snapchat's lenses), AR animations, and more. In the end we discuss about some of the current challenges in AR domain, and the future directions in this area.
61.Snowflake Point Deconvolution for Point Cloud Completion and Generation with Skip-Transformer ⬇️
Most existing point cloud completion methods suffered from discrete nature of point clouds and unstructured prediction of points in local regions, which makes it hard to reveal fine local geometric details. To resolve this issue, we propose SnowflakeNet with Snowflake Point Deconvolution (SPD) to generate the complete point clouds. SPD models the generation of complete point clouds as the snowflake-like growth of points, where the child points are progressively generated by splitting their parent points after each SPD. Our insight of revealing detailed geometry is to introduce skip-transformer in SPD to learn point splitting patterns which can fit local regions the best. Skip-transformer leverages attention mechanism to summarize the splitting patterns used in previous SPD layer to produce the splitting in current SPD layer. The locally compact and structured point clouds generated by SPD precisely reveal the structure characteristic of 3D shape in local patches, which enables us to predict highly detailed geometries. Moreover, since SPD is a general operation, which is not limited to completion, we further explore the applications of SPD on other generative tasks, including point cloud auto-encoding, generation, single image reconstruction and upsampling. Our experimental results outperform the state-of-the-art methods under widely used benchmarks.
62.Robotic Telekinesis: Learning a Robotic Hand Imitator by Watching Humans on Youtube ⬇️
We build a system that enables any human to control a robot hand and arm, simply by demonstrating motions with their own hand. The robot observes the human operator via a single RGB camera and imitates their actions in real-time. Human hands and robot hands differ in shape, size, and joint structure, and performing this translation from a single uncalibrated camera is a highly underconstrained problem. Moreover, the retargeted trajectories must effectively execute tasks on a physical robot, which requires them to be temporally smooth and free of self-collisions. Our key insight is that while paired human-robot correspondence data is expensive to collect, the internet contains a massive corpus of rich and diverse human hand videos. We leverage this data to train a system that understands human hands and retargets a human video stream into a robot hand-arm trajectory that is smooth, swift, safe, and semantically similar to the guiding demonstration. We demonstrate that it enables previously untrained people to teleoperate a robot on various dexterous manipulation tasks. Our low-cost, glove-free, marker-free remote teleoperation system makes robot teaching more accessible and we hope that it can aid robots that learn to act autonomously in the real world. Videos at this https URL
63.Malaria detection in Segmented Blood Cell using Convolutional Neural Networks and Canny Edge Detection ⬇️
We apply convolutional neural networks to identify between malaria infected and non-infected segmented cells from the thin blood smear slide images. We optimize our model to find over 95% accuracy in malaria cell detection. We also apply Canny image processing to reduce training file size while maintaining comparable accuracy (~ 94%).
64.MIST GAN: Modality Imputation Using Style Transfer for MRI ⬇️
MRI entails a great amount of cost, time and effort for the generation of all the modalities that are recommended for efficient diagnosis and treatment planning. Recent advancements in deep learning research show that generative models have achieved substantial improvement in the aspects of style transfer and image synthesis. In this work, we formulate generating the missing MR modality from existing MR modalities as an imputation problem using style transfer. With a multiple-to-one mapping, we model a network that accommodates domain specific styles in generating the target image. We analyse the style diversity both within and across MR modalities. Our model is tested on the BraTS'18 dataset and the results obtained are observed to be on par with the state-of-the-art in terms of visual metrics, SSIM and PSNR. After being evaluated by two expert radiologists, we show that our model is efficient, extendable, and suitable for clinical applications.
65.Reducing the Gibbs effect in multimodal medical imaging by the Fake Nodes Approach ⬇️
It is a common practice in multimodal medical imaging to undersample the anatomically-derived segmentation images to measure the mean activity of a co-acquired functional image. This practice avoids the resampling-related Gibbs effect that would occur in oversampling the functional image. As sides effect, waste of time and efforts are produced since the anatomical segmentation at full resolution is performed in many hours of computations or manual work. In this work we explain the commonly-used resampling methods and give errors bound in the cases of continuous and discontinuous signals. Then we propose a Fake Nodes scheme for image resampling designed to reduce the Gibbs effect when oversampling the functional image. This new approach is compared to the traditional counterpart in two significant experiments, both showing that Fake Nodes resampling gives smaller errors.
66.Seeing the advantage: visually grounding word embeddings to better capture human semantic knowledge ⬇️
Distributional semantic models capture word-level meaning that is useful in many natural language processing tasks and have even been shown to capture cognitive aspects of word meaning. The majority of these models are purely text based, even though the human sensory experience is much richer. In this paper we create visually grounded word embeddings by combining English text and images and compare them to popular text-based methods, to see if visual information allows our model to better capture cognitive aspects of word meaning. Our analysis shows that visually grounded embedding similarities are more predictive of the human reaction times in a large priming experiment than the purely text-based embeddings. The visually grounded embeddings also correlate well with human word similarity ratings. Importantly, in both experiments we show that the grounded embeddings account for a unique portion of explained variance, even when we include text-based embeddings trained on huge corpora. This shows that visual grounding allows our model to capture information that cannot be extracted using text as the only source of information.
67.A Review of Emerging Research Directions in Abstract Visual Reasoning ⬇️
Abstract Visual Reasoning (AVR) problems are commonly used to approximate human intelligence. They test the ability of applying previously gained knowledge, experience and skills in a completely new setting, which makes them particularly well-suited for this task. Recently, the AVR problems have become popular as a proxy to study machine intelligence, which has led to emergence of new distinct types of problems and multiple benchmark sets. In this work we review this emerging AVR research and propose a taxonomy to categorise the AVR tasks along 5 dimensions: input shapes, hidden rules, target task, cognitive function, and main challenge. The perspective taken in this survey allows to characterise AVR problems with respect to their shared and distinct properties, provides a unified view on the existing approaches for solving AVR tasks, shows how the AVR problems relate to practical applications, and outlines promising directions for future work. One of them refers to the observation that in the machine learning literature different tasks are considered in isolation, which is in the stark contrast with the way the AVR tasks are used to measure human intelligence, where multiple types of problems are combined within a single IQ test.
68.ALGAN: Anomaly Detection by Generating Pseudo Anomalous Data via Latent Variables ⬇️
In many anomaly detection tasks, where anomalous data rarely appear and are difficult to collect, training with only normal data is important. Although it is possible to manually create anomalous data using prior knowledge, they may be subject to user bias. In this paper, we propose an Anomalous Latent variable Generative Adversarial Network (ALGAN) in which the GAN generator produces pseudo-anomalous data as well as fake-normal data, whereas the discriminator is trained to distinguish between normal and pseudo-anomalous data. This differs from the standard GAN discriminator, which specializes in classifying two similar classes. The training dataset contains only normal data as anomalous states are introduced in the latent variable and input them into the generator to produce diverse pseudo-anomalous data. We compared the performance of ALGAN with other existing methods using the MVTec-AD, Magnetic Tile Defects, and COIL-100 datasets. The experimental results showed that the proposed ALGAN exhibited an AUROC comparable to state-of-the-art methods while achieving a much faster prediction time.
69.VRConvMF: Visual Recurrent Convolutional Matrix Factorization for Movie Recommendation ⬇️
Sparsity of user-to-item rating data becomes one of challenging issues in the recommender systems, which severely deteriorates the recommendation performance. Fortunately, context-aware recommender systems can alleviate the sparsity problem by making use of some auxiliary information, such as the information of both the users and items. In particular, the visual information of items, such as the movie poster, can be considered as the supplement for item description documents, which helps to obtain more item features. In this paper, we focus on movie recommender system and propose a probabilistic matrix factorization based recommendation scheme called visual recurrent convolutional matrix factorization (VRConvMF), which utilizes the textual and multi-level visual features extracted from the descriptive texts and posters respectively. We implement the proposed VRConvMF and conduct extensive experiments on three commonly used real world datasets to validate its effectiveness. The experimental results illustrate that the proposed VRConvMF outperforms the existing schemes.
70.Efficient Cross-Modal Retrieval via Deep Binary Hashing and Quantization ⬇️
Cross-modal retrieval aims to search for data with similar semantic meanings across different content modalities. However, cross-modal retrieval requires huge amounts of storage and retrieval time since it needs to process data in multiple modalities. Existing works focused on learning single-source compact features such as binary hash codes that preserve similarities between different modalities. In this work, we propose a jointly learned deep hashing and quantization network (HQ) for cross-modal retrieval. We simultaneously learn binary hash codes and quantization codes to preserve semantic information in multiple modalities by an end-to-end deep learning architecture. At the retrieval step, binary hashing is used to retrieve a subset of items from the search space, then quantization is used to re-rank the retrieved items. We theoretically and empirically show that this two-stage retrieval approach provides faster retrieval results while preserving accuracy. Experimental results on the NUS-WIDE, MIR-Flickr, and Amazon datasets demonstrate that HQ achieves boosts of more than 7% in precision compared to supervised neural network-based compact coding models.
71.Learning Bayesian Sparse Networks with Full Experience Replay for Continual Learning ⬇️
Continual Learning (CL) methods aim to enable machine learning models to learn new tasks without catastrophic forgetting of those that have been previously mastered. Existing CL approaches often keep a buffer of previously-seen samples, perform knowledge distillation, or use regularization techniques towards this goal. Despite their performance, they still suffer from interference across tasks which leads to catastrophic forgetting. To ameliorate this problem, we propose to only activate and select sparse neurons for learning current and past tasks at any stage. More parameters space and model capacity can thus be reserved for the future tasks. This minimizes the interference between parameters for different tasks. To do so, we propose a Sparse neural Network for Continual Learning (SNCL), which employs variational Bayesian sparsity priors on the activations of the neurons in all layers. Full Experience Replay (FER) provides effective supervision in learning the sparse activations of the neurons in different layers. A loss-aware reservoir-sampling strategy is developed to maintain the memory buffer. The proposed method is agnostic as to the network structures and the task boundaries. Experiments on different datasets show that our approach achieves state-of-the-art performance for mitigating forgetting.
72.OG-SGG: Ontology-Guided Scene Graph Generation. A Case Study in Transfer Learning for Telepresence Robotics ⬇️
Scene graph generation from images is a task of great interest to applications such as robotics, because graphs are the main way to represent knowledge about the world and regulate human-robot interactions in tasks such as Visual Question Answering (VQA). Unfortunately, its corresponding area of machine learning is still relatively in its infancy, and the solutions currently offered do not specialize well in concrete usage scenarios. Specifically, they do not take existing "expert" knowledge about the domain world into account; and that might indeed be necessary in order to provide the level of reliability demanded by the use case scenarios. In this paper, we propose an initial approximation to a framework called Ontology-Guided Scene Graph Generation (OG-SGG), that can improve the performance of an existing machine learning based scene graph generator using prior knowledge supplied in the form of an ontology; and we present results evaluated on a specific scenario founded in telepresence robotics.
73.OSegNet: Operational Segmentation Network for COVID-19 Detection using Chest X-ray Images ⬇️
Coronavirus disease 2019 (COVID-19) has been diagnosed automatically using Machine Learning algorithms over chest X-ray (CXR) images. However, most of the earlier studies used Deep Learning models over scarce datasets bearing the risk of overfitting. Additionally, previous studies have revealed the fact that deep networks are not reliable for classification since their decisions may originate from irrelevant areas on the CXRs. Therefore, in this study, we propose Operational Segmentation Network (OSegNet) that performs detection by segmenting COVID-19 pneumonia for a reliable diagnosis. To address the data scarcity encountered in training and especially in evaluation, this study extends the largest COVID-19 CXR dataset: QaTa-COV19 with 121,378 CXRs including 9258 COVID-19 samples with their corresponding ground-truth segmentation masks that are publicly shared with the research community. Consequently, OSegNet has achieved a detection performance with the highest accuracy of 99.65% among the state-of-the-art deep models with 98.09% precision.
74.Diffusion Causal Models for Counterfactual Estimation ⬇️
We consider the task of counterfactual estimation from observational imaging data given a known causal structure. In particular, quantifying the causal effect of interventions for high-dimensional data with neural networks remains an open challenge. Herein we propose Diff-SCM, a deep structural causal model that builds on recent advances of generative energy-based models. In our setting, inference is performed by iteratively sampling gradients of the marginal and conditional distributions entailed by the causal model. Counterfactual estimation is achieved by firstly inferring latent variables with deterministic forward diffusion, then intervening on a reverse diffusion process using the gradients of an anti-causal predictor w.r.t the input. Furthermore, we propose a metric for evaluating the generated counterfactuals. We find that Diff-SCM produces more realistic and minimal counterfactuals than baselines on MNIST data and can also be applied to ImageNet data. Code is available this https URL.
75.Multi-Task Conditional Imitation Learning for Autonomous Navigation at Crowded Intersections ⬇️
In recent years, great efforts have been devoted to deep imitation learning for autonomous driving control, where raw sensory inputs are directly mapped to control actions. However, navigating through densely populated intersections remains a challenging task due to uncertainty caused by uncertain traffic participants. We focus on autonomous navigation at crowded intersections that require interaction with pedestrians. A multi-task conditional imitation learning framework is proposed to adapt both lateral and longitudinal control tasks for safe and efficient interaction. A new benchmark called IntersectNav is developed and human demonstrations are provided. Empirical results show that the proposed method can achieve a success rate gain of up to 30% compared to the state-of-the-art.
76.Transferring Adversarial Robustness Through Robust Representation Matching ⬇️
With the widespread use of machine learning, concerns over its security and reliability have become prevalent. As such, many have developed defenses to harden neural networks against adversarial examples, imperceptibly perturbed inputs that are reliably misclassified. Adversarial training in which adversarial examples are generated and used during training is one of the few known defenses able to reliably withstand such attacks against neural networks. However, adversarial training imposes a significant training overhead and scales poorly with model complexity and input dimension. In this paper, we propose Robust Representation Matching (RRM), a low-cost method to transfer the robustness of an adversarially trained model to a new model being trained for the same task irrespective of architectural differences. Inspired by student-teacher learning, our method introduces a novel training loss that encourages the student to learn the teacher's robust representations. Compared to prior works, RRM is superior with respect to both model performance and adversarial training time. On CIFAR-10, RRM trains a robust model
$\sim 1.8\times$ faster than the state-of-the-art. Furthermore, RRM remains effective on higher-dimensional datasets. On Restricted-ImageNet, RRM trains a ResNet50 model$\sim 18\times$ faster than standard adversarial training.
77.Outlier-based Autism Detection using Longitudinal Structural MRI ⬇️
Diagnosis of Autism Spectrum Disorder (ASD) using clinical evaluation (cognitive tests) is challenging due to wide variations amongst individuals. Since no effective treatment exists, prompt and reliable ASD diagnosis can enable the effective preparation of treatment regimens. This paper proposes structural Magnetic Resonance Imaging (sMRI)-based ASD diagnosis via an outlier detection approach. To learn Spatio-temporal patterns in structural brain connectivity, a Generative Adversarial Network (GAN) is trained exclusively with sMRI scans of healthy subjects. Given a stack of three adjacent slices as input, the GAN generator reconstructs the next three adjacent slices; the GAN discriminator then identifies ASD sMRI scan reconstructions as outliers. This model is compared against two other baselines -- a simpler UNet and a sophisticated Self-Attention GAN. Axial, Coronal, and Sagittal sMRI slices from the multi-site ABIDE II dataset are used for evaluation. Extensive experiments reveal that our ASD detection framework performs comparably with the state-of-the-art with far fewer training data. Furthermore, longitudinal data (two scans per subject over time) achieve 17-28% higher accuracy than cross-sectional data (one scan per subject). Among other findings, metrics employed for model training as well as reconstruction loss computation impact detection performance, and the coronal modality is found to best encode structural information for ASD detection.
78.Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations ⬇️
There have been many attempts to build multimodal dialog systems that can respond to a question about given audio-visual information, and the representative task for such systems is the Audio Visual Scene-Aware Dialog (AVSD). Most conventional AVSD models adopt the Convolutional Neural Network (CNN)-based video feature extractor to understand visual information. While a CNN tends to obtain both temporally and spatially local information, global information is also crucial for boosting video understanding because AVSD requires long-term temporal visual dependency and whole visual information. In this study, we apply the Transformer-based video feature that can capture both temporally and spatially global representations more efficiently than the CNN-based feature. Our AVSD model with its Transformer-based feature attains higher objective performance scores for answer generation. In addition, our model achieves a subjective score close to that of human answers in DSTC10. We observed that the Transformer-based visual feature is beneficial for the AVSD task because our model tends to correctly answer the questions that need a temporally and spatially broad range of visual information.
79.Deconstructing Distributions: A Pointwise Framework of Learning ⬇️
In machine learning, we traditionally evaluate the performance of a single model, averaged over a collection of test inputs. In this work, we propose a new approach: we measure the performance of a collection of models when evaluated on a
$\textit{single input point}$ . Specifically, we study a point's$\textit{profile}$ : the relationship between models' average performance on the test distribution and their pointwise performance on this individual point. We find that profiles can yield new insights into the structure of both models and data -- in and out-of-distribution. For example, we empirically show that real data distributions consist of points with qualitatively different profiles. On one hand, there are "compatible" points with strong correlation between the pointwise and average performance. On the other hand, there are points with weak and even$\textit{negative}$ correlation: cases where improving overall model accuracy actually$\textit{hurts}$ performance on these inputs. We prove that these experimental observations are inconsistent with the predictions of several simplified models of learning proposed in prior work. As an application, we use profiles to construct a dataset we call CIFAR-10-NEG: a subset of CINIC-10 such that for standard models, accuracy on CIFAR-10-NEG is$\textit{negatively correlated}$ with accuracy on CIFAR-10 test. This illustrates, for the first time, an OOD dataset that completely inverts "accuracy-on-the-line" (Miller, Taori, Raghunathan, Sagawa, Koh, Shankar, Liang, Carmon, and Schmidt 2021)
80.A Novel Framework for Brain Tumor Detection Based on Convolutional Variational Generative Models ⬇️
Brain tumor detection can make the difference between life and death. Recently, deep learning-based brain tumor detection techniques have gained attention due to their higher performance. However, obtaining the expected performance of such deep learning-based systems requires large amounts of classified images to train the deep models. Obtaining such data is usually boring, time-consuming, and can easily be exposed to human mistakes which hinder the utilization of such deep learning approaches. This paper introduces a novel framework for brain tumor detection and classification. The basic idea is to generate a large synthetic MRI images dataset that reflects the typical pattern of the brain MRI images from a small class-unbalanced collected dataset. The resulted dataset is then used for training a deep model for detection and classification. Specifically, we employ two types of deep models. The first model is a generative model to capture the distribution of the important features in a set of small class-unbalanced brain MRI images. Then by using this distribution, the generative model can synthesize any number of brain MRI images for each class. Hence, the system can automatically convert a small unbalanced dataset to a larger balanced one. The second model is the classifier that is trained using the large balanced dataset to detect brain tumors in MRI images. The proposed framework acquires an overall detection accuracy of 96.88% which highlights the promise of the proposed framework as an accurate low-overhead brain tumor detection system.
81.Alternative design of DeepPDNet in the context of image restoration ⬇️
This work designs an image restoration deep network relying on unfolded Chambolle-Pock primal-dual iterations. Each layer of our network is built from Chambolle-Pock iterations when specified for minimizing a sum of a
$\ell_2$ -norm data-term and an analysis sparse prior. The parameters of our network are the step-sizes of the Chambolle-Pock scheme and the linear operator involved in sparsity-based penalization, including implicitly the regularization parameter. A backpropagation procedure is fully described. Preliminary experiments illustrate the good behavior of such a deep primal-dual network in the context of image restoration on BSD68 database.
82.Image quality assessment by overlapping task-specific and task-agnostic measures: application to prostate multiparametric MR images for cancer segmentation ⬇️
Image quality assessment (IQA) in medical imaging can be used to ensure that downstream clinical tasks can be reliably performed. Quantifying the impact of an image on the specific target tasks, also named as task amenability, is needed. A task-specific IQA has recently been proposed to learn an image-amenability-predicting controller simultaneously with a target task predictor. This allows for the trained IQA controller to measure the impact an image has on the target task performance, when this task is performed using the predictor, e.g. segmentation and classification neural networks in modern clinical applications. In this work, we propose an extension to this task-specific IQA approach, by adding a task-agnostic IQA based on auto-encoding as the target task. Analysing the intersection between low-quality images, deemed by both the task-specific and task-agnostic IQA, may help to differentiate the underpinning factors that caused the poor target task performance. For example, common imaging artefacts may not adversely affect the target task, which would lead to a low task-agnostic quality and a high task-specific quality, whilst individual cases considered clinically challenging, which can not be improved by better imaging equipment or protocols, is likely to result in a high task-agnostic quality but a low task-specific quality. We first describe a flexible reward shaping strategy which allows for the adjustment of weighting between task-agnostic and task-specific quality scoring. Furthermore, we evaluate the proposed algorithm using a clinically challenging target task of prostate tumour segmentation on multiparametric magnetic resonance (mpMR) images, from 850 patients. The proposed reward shaping strategy, with appropriately weighted task-specific and task-agnostic qualities, successfully identified samples that need re-acquisition due to defected imaging process.
83.Clustering by the Probability Distributions from Extreme Value Theory ⬇️
Clustering is an essential task to unsupervised learning. It tries to automatically separate instances into coherent subsets. As one of the most well-known clustering algorithms, k-means assigns sample points at the boundary to a unique cluster, while it does not utilize the information of sample distribution or density. Comparably, it would potentially be more beneficial to consider the probability of each sample in a possible cluster. To this end, this paper generalizes k-means to model the distribution of clusters. Our novel clustering algorithm thus models the distributions of distances to centroids over a threshold by Generalized Pareto Distribution (GPD) in Extreme Value Theory (EVT). Notably, we propose the concept of centroid margin distance, use GPD to establish a probability model for each cluster, and perform a clustering algorithm based on the covering probability function derived from GPD. Such a GPD k-means thus enables the clustering algorithm from the probabilistic perspective. Correspondingly, we also introduce a naive baseline, dubbed as Generalized Extreme Value (GEV) k-means. GEV fits the distribution of the block maxima. In contrast, the GPD fits the distribution of distance to the centroid exceeding a sufficiently large threshold, leading to a more stable performance of GPD k-means. Notably, GEV k-means can also estimate cluster structure and thus perform reasonably well over classical k-means. Thus, extensive experiments on synthetic datasets and real datasets demonstrate that GPD k-means outperforms competitors. The github codes are released in this https URL.
84.RDP-Net: Region Detail Preserving Network for Change Detection ⬇️
Change detection (CD) is an essential earth observation technique. It captures the dynamic information of land objects. With the rise of deep learning, neural networks (NN) have shown great potential in CD. However, current NN models introduce backbone architectures that lose the detail information during learning. Moreover, current NN models are heavy in parameters, which prevents their deployment on edge devices such as drones. In this work, we tackle this issue by proposing RDP-Net: a region detail preserving network for CD. We propose an efficient training strategy that quantifies the importance of individual samples during the warmup period of NN training. Then, we perform non-uniform sampling based on the importance score so that the NN could learn detail information from easy to hard. Next, we propose an effective edge loss that improves the network's attention on details such as boundaries and small regions. As a result, we provide a NN model that achieves the state-of-the-art empirical performance in CD with only 1.70M parameters. We hope our RDP-Net would benefit the practical CD applications on compact devices and could inspire more people to bring change detection to a new level with the efficient training strategy.
85.The Loop Game: Quality Assessment and Optimization for Low-Light Image Enhancement ⬇️
There is an increasing consensus that the design and optimization of low light image enhancement methods need to be fully driven by perceptual quality. With numerous approaches proposed to enhance low-light images, much less work has been dedicated to quality assessment and quality optimization of low-light enhancement. In this paper, to close the gap between enhancement and assessment, we propose a loop enhancement framework that produces a clear picture of how the enhancement of low-light images could be optimized towards better visual quality. In particular, we create a large-scale database for QUality assessment Of The Enhanced LOw-Light Image (QUOTE-LOL), which serves as the foundation in studying and developing objective quality assessment measures. The objective quality assessment measure plays a critical bridging role between visual quality and enhancement and is further incorporated in the optimization in learning the enhancement model towards perceptual optimally. Finally, we iteratively perform the enhancement and optimization tasks, enhancing the low-light images continuously. The superiority of the proposed scheme is validated based on various low-light scenes. The database as well as the code will be available.
86.Overparametrization improves robustness against adversarial attacks: A replication study ⬇️
Overparametrization has become a de facto standard in machine learning. Despite numerous efforts, our understanding of how and where overparametrization helps model accuracy and robustness is still limited. To this end, here we conduct an empirical investigation to systemically study and replicate previous findings in this area, in particular the study by Madry et al. Together with this study, our findings support the "universal law of robustness" recently proposed by Bubeck et al. We argue that while critical for robust perception, overparametrization may not be enough to achieve full robustness and smarter architectures e.g. the ones implemented by the human visual cortex) seem inevitable.
87.Statistical and Topological Summaries Aid Disease Detection for Segmented Retinal Vascular Images ⬇️
Disease complications can alter vascular network morphology and disrupt tissue functioning. Diabetic retinopathy, for example, is a complication of type 1 and 2 diabetus mellitus that can cause blindness. Microvascular diseases are assessed by visual inspection of retinal images, but this can be challenging when diseases exhibit silent symptoms or patients cannot attend in-person meetings. We examine the performance of machine learning algorithms in detecting microvascular disease when trained on either statistical or topological summaries of segmented retinal vascular images. We apply our methods to four publicly-available datasets and find that the fractal dimension performs best for high resolution images. By contrast, we find that topological descriptor vectors quantifying the number of loops in the data achieve the highest accuracy for low resolution images. Further analysis, using the topological approach, reveals that microvascular disease may alter morphology by reducing the number of loops in the retinal vasculature. Our work provides preliminary guidelines on which methods are most appropriate for assessing disease in high and low resolution images. In the longer term, these methods could be incorporated into automated disease assessment tools.
88.Punctuation Restoration ⬇️
Given the increasing number of livestreaming videos, automatic speech recognition and post-processing for livestreaming video transcripts are crucial for efficient data management as well as knowledge mining. A key step in this process is punctuation restoration which restores fundamental text structures such as phrase and sentence boundaries from the video transcripts. This work presents a new human-annotated corpus, called BehancePR, for punctuation restoration in livestreaming video transcripts. Our experiments on BehancePR demonstrate the challenges of punctuation restoration for this domain. Furthermore, we show that popular natural language processing toolkits are incapable of detecting sentence boundary on non-punctuated transcripts of livestreaming videos, calling for more research effort to develop robust models for this area.
89.Echofilter: A Deep Learning Segmentation Model Improves the Automation, Standardization, and Timeliness for Post-Processing Echosounder Data in Tidal Energy Streams ⬇️
Understanding the abundance and distribution of fish in tidal energy streams is important for assessing the risk presented by the introduction of tidal energy devices into the habitat. However, the impressive tidal currents that make sites favorable for tidal energy development are often highly turbulent and entrain air into the water, complicating the interpretation of echosounder data. The portion of the water column contaminated by returns from entrained air must be excluded from data used for biological analyses. Application of a single algorithm to identify the depth-of-penetration of entrained-air is insufficient for a boundary that is discontinuous, depth-dynamic, porous, and widely variable across the tidal flow speeds which can range from 0 to 5m/s. Using a case study at a tidal energy demonstration site in the Bay of Fundy, we describe the development and application of deep learning models that produce a pronounced, consistent, substantial, and measurable improvement of the automated detection of the extent to which entrained-air has penetrated the water column.
Our model, Echofilter, was highly responsive to the dynamic range of turbulence conditions and sensitive to the fine-scale nuances in the boundary position, producing an entrained-air boundary line with an average error of 0.32m on mobile downfacing and 0.5-1.0m on stationary upfacing data. The model's annotations had a high level of agreement with the human segmentation (mobile downfacing Jaccard index: 98.8%; stationary upfacing: 93-95%). This resulted in a 50% reduction in the time required for manual edits compared to the time required to manually edit the line placed by currently available algorithms. Because of the improved initial automated placement, the implementation of the models generated a marked increase in the standardization and repeatability of line placement.
90.A Lightweight Dual-Domain Attention Framework for Sparse-View CT Reconstruction ⬇️
Computed Tomography (CT) plays an essential role in clinical diagnosis. Due to the adverse effects of radiation on patients, the radiation dose is expected to be reduced as low as possible. Sparse sampling is an effective way, but it will lead to severe artifacts on the reconstructed CT image, thus sparse-view CT image reconstruction has been a prevailing and challenging research area. With the popularity of mobile devices, the requirements for lightweight and real-time networks are increasing rapidly. In this paper, we design a novel lightweight network called CAGAN, and propose a dual-domain reconstruction pipeline for parallel beam sparse-view CT. CAGAN is an adversarial auto-encoder, combining the Coordinate Attention unit, which preserves the spatial information of features. Also, the application of Shuffle Blocks reduces the parameters by a quarter without sacrificing its performance. In the Radon domain, the CAGAN learns the mapping between the interpolated data and fringe-free projection data. After the restored Radon data is reconstructed to an image, the image is sent into the second CAGAN trained for recovering the details, so that a high-quality image is obtained. Experiments indicate that the CAGAN strikes an excellent balance between model complexity and performance, and our pipeline outperforms the DD-Net and the DuDoNet.
91.Bit-wise Training of Neural Network Weights ⬇️
We introduce an algorithm where the individual bits representing the weights of a neural network are learned. This method allows training weights with integer values on arbitrary bit-depths and naturally uncovers sparse networks, without additional constraints or regularization techniques. We show better results than the standard training technique with fully connected networks and similar performance as compared to standard training for convolutional and residual networks. By training bits in a selective manner we found that the biggest contribution to achieving high accuracy is given by the first three most significant bits, while the rest provide an intrinsic regularization. As a consequence more than 90% of a network can be used to store arbitrary codes without affecting its accuracy. These codes may be random noise, binary files or even the weights of previously trained networks.
92.SPNet: A novel deep neural network for retinal vessel segmentation based on shared decoder and pyramid-like loss ⬇️
Segmentation of retinal vessel images is critical to the diagnosis of retinopathy. Recently, convolutional neural networks have shown significant ability to extract the blood vessel structure. However, it remains challenging to refined segmentation for the capillaries and the edges of retinal vessels due to thickness inconsistencies and blurry boundaries. In this paper, we propose a novel deep neural network for retinal vessel segmentation based on shared decoder and pyramid-like loss (SPNet) to address the above problems. Specifically, we introduce a decoder-sharing mechanism to capture multi-scale semantic information, where feature maps at diverse scales are decoded through a sequence of weight-sharing decoder modules. Also, to strengthen characterization on the capillaries and the edges of blood vessels, we define a residual pyramid architecture which decomposes the spatial information in the decoding phase. A pyramid-like loss function is designed to compensate possible segmentation errors progressively. Experimental results on public benchmarks show that the proposed method outperforms the backbone network and the state-of-the-art methods, especially in the regions of the capillaries and the vessel contours. In addition, performances on cross-datasets verify that SPNet shows stronger generalization ability.
93.Learning Representations Robust to Group Shifts and Adversarial Examples ⬇️
Despite the high performance achieved by deep neural networks on various tasks, extensive studies have demonstrated that small tweaks in the input could fail the model predictions. This issue of deep neural networks has led to a number of methods to improve model robustness, including adversarial training and distributionally robust optimization. Though both of these two methods are geared towards learning robust models, they have essentially different motivations: adversarial training attempts to train deep neural networks against perturbations, while distributional robust optimization aims at improving model performance on the most difficult "uncertain distributions". In this work, we propose an algorithm that combines adversarial training and group distribution robust optimization to improve robust representation learning. Experiments on three image benchmark datasets illustrate that the proposed method achieves superior results on robust metrics without sacrificing much of the standard measures.
94.A Molecular Prior Distribution for Bayesian Inference Based on Wilson Statistics ⬇️
Background and Objective: Wilson statistics describe well the power spectrum of proteins at high frequencies. Therefore, it has found several applications in structural biology, e.g., it is the basis for sharpening steps used in cryogenic electron microscopy (cryo-EM). A recent paper gave the first rigorous proof of Wilson statistics based on a formalism of Wilson's original argument. This new analysis also leads to statistical estimates of the scattering potential of proteins that reveal a correlation between neighboring Fourier coefficients. Here we exploit these estimates to craft a novel prior that can be used for Bayesian inference of molecular structures. Methods: We describe the properties of the prior and the computation of its hyperparameters. We then evaluate the prior on two synthetic linear inverse problems, and compare against a popular prior in cryo-EM reconstruction at a range of SNRs. Results: We show that the new prior effectively suppresses noise and fills-in low SNR regions in the spectral domain. Furthermore, it improves the resolution of estimates on the problems considered for a wide range of SNR and produces Fourier Shell Correlation curves that are insensitive to masking effects. Conclusions: We analyze the assumptions in the model, discuss relations to other regularization strategies, and postulate on potential implications for structure determination in cryo-EM.