Skip to content
This repository has been archived by the owner on Apr 21, 2024. It is now read-only.

Latest commit

 

History

History
131 lines (131 loc) · 91.8 KB

20220126.md

File metadata and controls

131 lines (131 loc) · 91.8 KB

ArXiv cs.CV --Wed, 26 Jan 2022

1.MonarchNet: Differentiating Monarch Butterflies from Butterflies Species with Similar Phenotypes ⬇️

In recent years, the monarch butterfly's iconic migration patterns have come under threat from a number of factors, from climate change to pesticide use. To track trends in their populations, scientists as well as citizen scientists must identify individuals accurately. This is uniquely key for the study of monarch butterflies because there exist other species of butterfly, such as viceroy butterflies, that are "look-alikes" (coined by the Convention on International Trade in Endangered Species of Wild Fauna and Flora), having similar phenotypes. To tackle this problem and to aid in more efficient identification, we present MonarchNet, the first comprehensive dataset consisting of butterfly imagery for monarchs and five look-alike species. We train a baseline deep-learning classification model to serve as a tool for differentiating monarch butterflies and its various look-alikes. We seek to contribute to the study of biodiversity and butterfly ecology by providing a novel method for computational classification of these particular butterfly species. The ultimate aim is to help scientists track monarch butterfly population and migration trends in the most precise and efficient manner possible.

2.Interpretability in Convolutional Neural Networks for Building Damage Classification in Satellite Imagery ⬇️

Natural disasters ravage the world's cities, valleys, and shores on a regular basis. Deploying precise and efficient computational mechanisms for assessing infrastructure damage is essential to channel resources and minimize the loss of life. Using a dataset that includes labeled pre- and post- disaster satellite imagery, we take a machine learning-based remote sensing approach and train multiple convolutional neural networks (CNNs) to assess building damage on a per-building basis. We present a novel methodology of interpretable deep learning that seeks to explicitly investigate the most useful modalities of information in the training data to create an accurate classification model. We also investigate which loss functions best optimize these models. Our findings include that ordinal-cross entropy loss is the most optimal criterion for optimization to use and that including the type of disaster that caused the damage in combination with pre- and post-disaster training data most accurately predicts the level of damage caused. Further, we make progress in the qualitative representation of which parts of the images that the model is using to predict damage levels, through gradient-weighted class activation mapping (Grad-CAM). Our research seeks to computationally contribute to aiding in this ongoing and growing humanitarian crisis, heightened by anthropogenic climate change.

3.Blind Image Deblurring: a Review ⬇️

This is a review on blind image deblurring. First, we formulate the blind image deblurring problem and explain why it is challenging. Next, we bring some psychological and cognitive studies on the way our human vision system deblurs. Then, relying on several previous reviews, we discuss the topic of metrics and datasets, which is non-trivial to blind deblurring. Finally, we introduce some typical optimization-based methods and learning-based methods.

4.A Review of Deep Learning Based Image Super-resolution Techniques ⬇️

Image super-resolution technology is the process of obtaining high-resolution images from one or more low-resolution images. With the development of deep learning, image super-resolution technology based on deep learning method is emerging. This paper reviews the research progress of the application of depth learning method in the field of image super-resolution, introduces this kind of super-resolution work from several aspects, and looks forward to the further application of depth learning method in the field of image super-resolution. By collecting and counting the relevant literature on the application of depth learning in the field of image super-resolution, we preliminarily summarizes the application results of depth learning method in the field of image super-resolution, and reports the latest progress of image super-resolution technology based on depth learning method.

5.Adaptive Activation-based Structured Pruning ⬇️

Pruning is a promising approach to compress complex deep learning models in order to deploy them on resource-constrained edge devices. However, many existing pruning solutions are based on unstructured pruning, which yield models that cannot efficiently run on commodity hardware, and require users to manually explore and tune the pruning process, which is time consuming and often leads to sub-optimal results. To address these limitations, this paper presents an adaptive, activation-based, structured pruning approach to automatically and efficiently generate small, accurate, and hardware-efficient models that meet user requirements. First, it proposes iterative structured pruning using activation-based attention feature maps to effectively identify and prune unimportant filters. Then, it proposes adaptive pruning policies for automatically meeting the pruning objectives of accuracy-critical, memory-constrained, and latency-sensitive tasks. A comprehensive evaluation shows that the proposed method can substantially outperform the state-of-the-art structured pruning works on CIFAR-10 and ImageNet datasets. For example, on ResNet-56 with CIFAR-10, without any accuracy drop, our method achieves the largest parameter reduction (79.11%), outperforming the related works by 22.81% to 66.07%, and the largest FLOPs reduction (70.13%), outperforming the related works by 14.13% to 26.53%.

6.Sphere2Vec: Multi-Scale Representation Learning over a Spherical Surface for Geospatial Predictions ⬇️

Generating learning-friendly representations for points in a 2D space is a fundamental and long-standing problem in machine learning. Recently, multi-scale encoding schemes (such as Space2Vec) were proposed to directly encode any point in 2D space as a high-dimensional vector, and has been successfully applied to various (geo)spatial prediction tasks. However, a map projection distortion problem rises when applying location encoding models to large-scale real-world GPS coordinate datasets (e.g., species images taken all over the world) - all current location encoding models are designed for encoding points in a 2D (Euclidean) space but not on a spherical surface, e.g., earth surface. To solve this problem, we propose a multi-scale location encoding model called Sphere2V ec which directly encodes point coordinates on a spherical surface while avoiding the mapprojection distortion problem. We provide theoretical proof that the Sphere2Vec encoding preserves the spherical surface distance between any two points. We also developed a unified view of distance-reserving encoding on spheres based on the Double Fourier Sphere (DFS). We apply Sphere2V ec to the geo-aware image classification task. Our analysis shows that Sphere2V ec outperforms other 2D space location encoder models especially on the polar regions and data-sparse areas for image classification tasks because of its nature for spherical surface distance preservation.

7.How Low Can We Go? Pixel Annotation for Semantic Segmentation ⬇️

How many labeled pixels are needed to segment an image, without any prior knowledge? We conduct an experiment to answer this question.
In our experiment, an Oracle is using Active Learning to train a network from scratch. The Oracle has access to the entire label map of the image, but the goal is to reveal as little pixel labels to the network as possible. We find that, on average, the Oracle needs to reveal (i.e., annotate) less than $0.1%$ of the pixels in order to train a network. The network can then label all pixels in the image at an accuracy of more than $98%$.
Based on this single-image-annotation experiment, we design an experiment to quickly annotate an entire data set. In the data set level experiment the Oracle trains a new network for each image from scratch. The network can then be used to create pseudo-labels, which are the network predicted labels of the unlabeled pixels, for the entire image. Only then, a data set level network is trained from scratch on all the pseudo-labeled images at once.
We repeat both image level and data set level experiments on two, very different, real-world data sets, and find that it is possible to reach the performance of a fully annotated data set using a fraction of the annotation cost.

8.Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition ⬇️

Audio-visual automatic speech recognition (AV-ASR) extends the speech recognition by introducing the video modality. In particular, the information contained in the motion of the speaker's mouth is used to augment the audio features. The video modality is traditionally processed with a 3D convolutional neural network (e.g. 3D version of VGG). Recently, image transformer networks arXiv:2010.11929 demonstrated the ability to extract rich visual features for the image classification task. In this work, we propose to replace the 3D convolution with a video transformer video feature extractor. We train our baselines and the proposed model on a large scale corpus of the YouTube videos. Then we evaluate the performance on a labeled subset of YouTube as well as on the public corpus LRS3-TED. Our best model video-only model achieves the performance of 34.9% WER on YTDEV18 and 19.3% on LRS3-TED which is a 10% and 9% relative improvements over the convolutional baseline. We achieve the state of the art performance of the audio-visual recognition on the LRS3-TED after fine-tuning our model (1.6% WER).

9.Main Product Detection with Graph Networks for Fashion ⬇️

Computer vision has established a foothold in the online fashion retail industry. Main product detection is a crucial step of vision-based fashion product feed parsing pipelines, focused in identifying the bounding boxes that contain the product being sold in the gallery of images of the product page. The current state-of-the-art approach does not leverage the relations between regions in the image, and treats images of the same product independently, therefore not fully exploiting visual and product contextual information. In this paper we propose a model that incorporates Graph Convolutional Networks (GCN) that jointly represent all detected bounding boxes in the gallery as nodes. We show that the proposed method is better than the state-of-the-art, especially, when we consider the scenario where title-input is missing at inference time and for cross-dataset evaluation, our method outperforms previous approaches by a large margin.

10.Rayleigh EigenDirections (REDs): GAN latent space traversals for multidimensional features ⬇️

We present a method for finding paths in a deep generative model's latent space that can maximally vary one set of image features while holding others constant. Crucially, unlike past traversal approaches, ours can manipulate multidimensional features of an image such as facial identity and pixels within a specified region. Our method is principled and conceptually simple: optimal traversal directions are chosen by maximizing differential changes to one feature set such that changes to another set are negligible. We show that this problem is nearly equivalent to one of Rayleigh quotient maximization, and provide a closed-form solution to it based on solving a generalized eigenvalue equation. We use repeated computations of the corresponding optimal directions, which we call Rayleigh EigenDirections (REDs), to generate appropriately curved paths in latent space. We empirically evaluate our method using StyleGAN2 on two image domains: faces and living rooms. We show that our method is capable of controlling various multidimensional features out of the scope of previous latent space traversal methods: face identity, spatial frequency bands, pixels within a region, and the appearance and position of an object. Our work suggests that a wealth of opportunities lies in the local analysis of the geometry and semantics of latent spaces.

11.Comparison of Evaluation Metrics for Landmark Detection in CMR Images ⬇️

Cardiac Magnetic Resonance (CMR) images are widely used for cardiac diagnosis and ventricular assessment. Extracting specific landmarks like the right ventricular insertion points is of importance for spatial alignment and 3D modeling. The automatic detection of such landmarks has been tackled by multiple groups using Deep Learning, but relatively little attention has been paid to the failure cases of evaluation metrics in this field. In this work, we extended the public ACDC dataset with additional labels of the right ventricular insertion points and compare different variants of a heatmap-based landmark detection pipeline. In this comparison, we demonstrate very likely pitfalls of apparently simple detection and localisation metrics which highlights the importance of a clear detection strategy and the definition of an upper limit for localisation-based metrics. Our preliminary results indicate that a combination of different metrics is necessary, as they yield different winners for method comparison. Additionally, they highlight the need of a comprehensive metric description and evaluation standardisation, especially for the error cases where no metrics could be computed or where no lower/upper boundary of a metric exists. Code and labels: this https URL

12.Towards Cross-Disaster Building Damage Assessment with Graph Convolutional Networks ⬇️

In the aftermath of disasters, building damage maps are obtained using change detection to plan rescue operations. Current convolutional neural network approaches do not consider the similarities between neighboring buildings for predicting the damage. We present a novel graph-based building damage detection solution to capture these relationships. Our proposed model architecture learns from both local and neighborhood features to predict building damage. Specifically, we adopt the sample and aggregate graph convolution strategy to learn aggregation functions that generalize to unseen graphs which is essential for alleviating the time needed to obtain predictions for new disasters. Our experiments on the xBD dataset and comparisons with a classical convolutional neural network reveal that while our approach is handicapped by class imbalance, it presents a promising and distinct advantage when it comes to cross-disaster generalization.

13.Capturing Temporal Information in a Single Frame: Channel Sampling Strategies for Action Recognition ⬇️

We address the problem of capturing temporal information for video classification in 2D networks, without increasing computational cost. Existing approaches focus on modifying the architecture of 2D networks (e.g. by including filters in the temporal dimension to turn them into 3D networks, or using optical flow, etc.), which increases computation cost. Instead, we propose a novel sampling strategy, where we re-order the channels of the input video, to capture short-term frame-to-frame changes. We observe that without bells and whistles, the proposed sampling strategy improves performance on multiple architectures (e.g. TSN, TRN, and TSM) and datasets (CATER, Something-Something-V1 and V2), up to 24% over the baseline of using the standard video input. In addition, our sampling strategies do not require training from scratch and do not increase the computational cost of training and testing. Given the generality of the results and the flexibility of the approach, we hope this can be widely useful to the video understanding community. Code is available at this https URL.

14.BLDNet: A Semi-supervised Change Detection Building Damage Framework using Graph Convolutional Networks and Urban Domain Knowledge ⬇️

Change detection is instrumental to localize damage and understand destruction in disaster informatics. While convolutional neural networks are at the core of recent change detection solutions, we present in this work, BLDNet, a novel graph formulation for building damage change detection and enable learning relationships and representations from both local patterns and non-stationary neighborhoods. More specifically, we use graph convolutional networks to efficiently learn these features in a semi-supervised framework with few annotated data. Additionally, BLDNet formulation allows for the injection of additional contextual building meta-features. We train and benchmark on the xBD dataset to validate the effectiveness of our approach. We also demonstrate on urban data from the 2020 Beirut Port Explosion that performance is improved by incorporating domain knowledge building meta-features.

15.Winograd Convolution for Deep Neural Networks: Efficient Point Selection ⬇️

Convolutional neural networks (CNNs) have dramatically improved the accuracy of tasks such as object recognition, image segmentation and interactive speech systems. CNNs require large amounts of computing resources because ofcomputationally intensive convolution layers. Fast convolution algorithms such as Winograd convolution can greatly reduce the computational cost of these layers at a cost of poor numeric properties, such that greater savings in computation exponentially increase floating point errors.
A defining feature of each Winograd convolution algorithm is a set of real-value points where polynomials are sampled. The choice of points impacts the numeric accuracy of the algorithm, but the optimal set of points for small convolutions remains unknown. Existing work considers only small integers and simple fractions as candidate points. In this work, we propose a novel approach to point selection using points of the form {-1/c , -c, c, 1/c } using the full range of real-valued numbers for c. We show that groups of this form cause cancellations in the Winograd transform matrices that reduce numeric error. We find empirically that the error for different values of c forms a rough curve across the range of real-value numbers helping to localize the values of c that reduce error and that lower errors can be achieved with non-obvious real-valued evaluation points instead of integers or simple fractions. We study a range of sizes for small convolutions and achieve reduction in error ranging from 2% to around 59% for both 1D and 2D convolution. Furthermore, we identify patterns in cases when we select a subset of our proposed points which will always lead to a lower error. Finally we implement a complete Winograd convolution layer and use it to run deep convolution neural networks on real datasets and show that our proposed points reduce error, ranging from 22% to 63%.

16.ADAPT: An Open-Source sUAS Payload for Real-Time Disaster Prediction and Response with AI ⬇️

Small unmanned aircraft systems (sUAS) are becoming prominent components of many humanitarian assistance and disaster response (HADR) operations. Pairing sUAS with onboard artificial intelligence (AI) substantially extends their utility in covering larger areas with fewer support personnel. A variety of missions, such as search and rescue, assessing structural damage, and monitoring forest fires, floods, and chemical spills, can be supported simply by deploying the appropriate AI models. However, adoption by resource-constrained groups, such as local municipalities, regulatory agencies, and researchers, has been hampered by the lack of a cost-effective, readily-accessible baseline platform that can be adapted to their unique missions. To fill this gap, we have developed the free and open-source ADAPT multi-mission payload for deploying real-time AI and computer vision onboard a sUAS during local and beyond-line-of-site missions. We have emphasized a modular design with low-cost, readily-available components, open-source software, and thorough documentation (this https URL). The system integrates an inertial navigation system, high-resolution color camera, computer, and wireless downlink to process imagery and broadcast georegistered analytics back to a ground station. Our goal is to make it easy for the HADR community to build their own copies of the ADAPT payload and leverage the thousands of hours of engineering we have devoted to developing and testing. In this paper, we detail the development and testing of the ADAPT payload. We demonstrate the example mission of real-time, in-flight ice segmentation to monitor river ice state and provide timely predictions of catastrophic flooding events. We deploy a novel active learning workflow to annotate river ice imagery, train a real-time deep neural network for ice segmentation, and demonstrate operation in the field.

17.ShapeFormer: Transformer-based Shape Completion via Sparse Representation ⬇️

We present ShapeFormer, a transformer-based network that produces a distribution of object completions, conditioned on incomplete, and possibly noisy, point clouds. The resultant distribution can then be sampled to generate likely completions, each exhibiting plausible shape details while being faithful to the input. To facilitate the use of transformers for 3D, we introduce a compact 3D representation, vector quantized deep implicit function, that utilizes spatial sparsity to represent a close approximation of a 3D shape by a short sequence of discrete variables. Experiments demonstrate that ShapeFormer outperforms prior art for shape completion from ambiguous partial inputs in terms of both completion quality and diversity. We also show that our approach effectively handles a variety of shape types, incomplete patterns, and real-world scans.

18.City3D: Large-scale Urban Reconstruction from Airborne Point Clouds ⬇️

We present a fully automatic approach for reconstructing compact 3D building models from large-scale airborne point clouds. A major challenge of urban reconstruction from airborne point clouds lies in that the vertical walls are typically missing. Based on the observation that urban buildings typically consist of planar roofs connected with vertical walls to the ground, we propose an approach to infer the vertical walls directly from the data. With the planar segments of both roofs and walls, we hypothesize the faces of the building surface, and the final model is obtained by using an extended hypothesis-and-selection-based polygonal surface reconstruction framework. Specifically, we introduce a new energy term to encourage roof preferences and two additional hard constraints into the optimization step to ensure correct topology and enhance detail recovery. Experiments on various large-scale airborne point clouds have demonstrated that the method is superior to the state-of-the-art methods in terms of reconstruction accuracy and robustness. In addition, we have generated a new dataset with our method consisting of the point clouds and 3D models of 20k real-world buildings. We believe this dataset can stimulate research in urban reconstruction from airborne point clouds and the use of 3D city models in urban applications.

19.Convolutional Xformers for Vision ⬇️

Vision transformers (ViTs) have found only limited practical use in processing images, in spite of their state-of-the-art accuracy on certain benchmarks. The reason for their limited use include their need for larger training datasets and more computational resources compared to convolutional neural networks (CNNs), owing to the quadratic complexity of their self-attention mechanism. We propose a linear attention-convolution hybrid architecture -- Convolutional X-formers for Vision (CXV) -- to overcome these limitations. We replace the quadratic attention with linear attention mechanisms, such as Performer, Nyströmformer, and Linear Transformer, to reduce its GPU usage. Inductive prior for image data is provided by convolutional sub-layers, thereby eliminating the need for class token and positional embeddings used by the ViTs. We also propose a new training method where we use two different optimizers during different phases of training and show that it improves the top-1 image classification accuracy across different architectures. CXV outperforms other architectures, token mixers (e.g. ConvMixer, FNet and MLP Mixer), transformer models (e.g. ViT, CCT, CvT and hybrid Xformers), and ResNets for image classification in scenarios with limited data and GPU resources (cores, RAM, power).

20.DocEnTr: An End-to-End Document Image Enhancement Transformer ⬇️

Document images can be affected by many degradation scenarios, which cause recognition and processing difficulties. In this age of digitization, it is important to denoise them for proper usage. To address this challenge, we present a new encoder-decoder architecture based on vision transformers to enhance both machine-printed and handwritten document images, in an end-to-end fashion. The encoder operates directly on the pixel patches with their positional information without the use of any convolutional layers, while the decoder reconstructs a clean image from the encoded patches. Conducted experiments show a superiority of the proposed model compared to the state-of the-art methods on several DIBCO benchmarks. Code and models will be publicly available at: \url{this https URL}.

21.BERTHA: Video Captioning Evaluation Via Transfer-Learned Human Assessment ⬇️

Evaluating video captioning systems is a challenging task as there are multiple factors to consider; for instance: the fluency of the caption, multiple actions happening in a single scene, and the human bias of what is considered important. Most metrics try to measure how similar the system generated captions are to a single or a set of human-annotated captions. This paper presents a new method based on a deep learning model to evaluate these systems. The model is based on BERT, which is a language model that has been shown to work well in multiple NLP tasks. The aim is for the model to learn to perform an evaluation similar to that of a human. To do so, we use a dataset that contains human evaluations of system generated captions. The dataset consists of the human judgments of the captions produce by the system participating in various years of the TRECVid video to text task. These annotations will be made publicly available. BERTHA obtain favourable results, outperforming the commonly used metrics in some setups.

22.Feature Diversity Learning with Sample Dropout for Unsupervised Domain Adaptive Person Re-identification ⬇️

Clustering-based approach has proved effective in dealing with unsupervised domain adaptive person re-identification (ReID) tasks. However, existing works along this approach still suffer from noisy pseudo labels and the unreliable generalization ability during the whole training process. To solve these problems, this paper proposes a new approach to learn the feature representation with better generalization ability through limiting noisy pseudo labels. At first, we propose a Sample Dropout (SD) method to prevent the training of the model from falling into the vicious circle caused by samples that are frequently assigned with noisy pseudo labels. In addition, we put forward a brand-new method referred as to Feature Diversity Learning (FDL) under the classic mutual-teaching architecture, which can significantly improve the generalization ability of the feature representation on the target domain. Experimental results show that our proposed FDL-SD achieves the state-of-the-art performance on multiple benchmark datasets.

23.Universal Generative Modeling for Calibration-free Parallel Mr Imaging ⬇️

The integration of compressed sensing and parallel imaging (CS-PI) provides a robust mechanism for accelerating MRI acquisitions. However, most such strategies require the explicit formation of either coil sensitivity profiles or a cross-coil correlation operator, and as a result reconstruction corresponds to solving a challenging bilinear optimization problem. In this work, we present an unsupervised deep learning framework for calibration-free parallel MRI, coined universal generative modeling for parallel imaging (UGM-PI). More precisely, we make use of the merits of both wavelet transform and the adaptive iteration strategy in a unified framework. We train a powerful noise conditional score network by forming wavelet tensor as the network input at the training phase. Experimental results on both physical phantom and in vivo datasets implied that the proposed method is comparable and even superior to state-of-the-art CS-PI reconstruction approaches.

24.Zero-Shot Sketch Based Image Retrieval using Graph Transformer ⬇️

The performance of a zero-shot sketch-based image retrieval (ZS-SBIR) task is primarily affected by two challenges. The substantial domain gap between image and sketch features needs to be bridged, while at the same time the side information has to be chosen tactfully. Existing literature has shown that varying the semantic side information greatly affects the performance of ZS-SBIR. To this end, we propose a novel graph transformer based zero-shot sketch-based image retrieval (GTZSR) framework for solving ZS-SBIR tasks which uses a novel graph transformer to preserve the topology of the classes in the semantic space and propagates the context-graph of the classes within the embedding features of the visual space. To bridge the domain gap between the visual features, we propose minimizing the Wasserstein distance between images and sketches in a learned domain-shared space. We also propose a novel compatibility loss that further aligns the two visual domains by bridging the domain gap of one class with respect to the domain gap of all other classes in the training set. Experimental results obtained on the extended Sketchy, TU-Berlin, and QuickDraw datasets exhibit sharp improvements over the existing state-of-the-art methods in both ZS-SBIR and generalized ZS-SBIR.

25.Estimating the Direction and Radius of Pipe from GPR Image by Ellipse Inversion Model ⬇️

Ground Penetrating Radar (GPR) is widely used as a non-destructive approach to estimate buried utilities. When the GPR's detecting direction is perpendicular to a pipeline, a hyperbolic characteristic would be formed on the GPR B-scan image. However, in real-world applications, the direction of pipelines on the existing pipeline map could be inaccurate, and it is hard to ensure the moving direction of GPR to be actually perpendicular to underground pipelines. In this paper, a novel model is proposed to estimate the direction and radius of pipeline and revise the existing pipeline map from GPR B-scan images. The model consists of two parts: GPR B-scan image processing and Ellipse Iterative Inversion Algorithm (EIIA). Firstly, the GPR B-scan image is processed with downward-opening point set extracted. The obtained point set is then iteratively inverted to the elliptical cross section of the buried pipeline, which is caused by the angle between the GPR's detecting direction and the pipeline's direction. By minimizing the sum of the algebraic distances from the extracted point set to the inverted ellipse, the most likely pipeline's direction and radius are determined. Experiments on real-world datasets are conducted, and the results demonstrate the effectiveness of the method.

26.Pre-Trained Language Transformers are Universal Image Classifiers ⬇️

Facial images disclose many hidden personal traits such as age, gender, race, health, emotion, and psychology. Understanding these traits will help to classify the people in different attributes. In this paper, we have presented a novel method for classifying images using a pretrained transformer model. We apply the pretrained transformer for the binary classification of facial images in criminal and non-criminal classes. The pretrained transformer of GPT-2 is trained to generate text and then fine-tuned to classify facial images. During the finetuning process with images, most of the layers of GT-2 are frozen during backpropagation and the model is frozen pretrained transformer (FPT). The FPT acts as a universal image classifier, and this paper shows the application of FPT on facial images. We also use our FPT on encrypted images for classification. Our FPT shows high accuracy on both raw facial images and encrypted images. We hypothesize the meta-learning capacity FPT gained because of its large size and trained on a large size with theory and experiments. The GPT-2 trained to generate a single word token at a time, through the autoregressive process, forced to heavy-tail distribution. Then the FPT uses the heavy-tail property as its meta-learning capacity for classifying images. Our work shows one way to avoid bias during the machine classification of images.The FPT encodes worldly knowledge because of the pretraining of one text, which it uses during the classification. The statistical error of classification is reduced because of the added context gained from the text.Our paper shows the ethical dimension of using encrypted data for classification.Criminal images are sensitive to share across the boundary but encrypted largely evades ethical concern.FPT showing good classification accuracy on encrypted images shows promise for further research on privacy-preserving machine learning.

27.RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals ⬇️

Human silhouette segmentation, which is originally defined in computer vision, has achieved promising results for understanding human activities. However, the physical limitation makes existing systems based on optical cameras suffer from severe performance degradation under low illumination, smoke, and/or opaque obstruction conditions. To overcome such limitations, in this paper, we propose to utilize the radio signals, which can traverse obstacles and are unaffected by the lighting conditions to achieve silhouette segmentation. The proposed RFMask framework is composed of three modules. It first transforms RF signals captured by millimeter wave radar on two planes into spatial domain and suppress interference with the signal processing module. Then, it locates human reflections on RF frames and extract features from surrounding signals with human detection module. Finally, the extracted features from RF frames are aggregated with an attention based mask generation module. To verify our proposed framework, we collect a dataset containing 804,760 radio frames and 402,380 camera frames with human activities under various scenes. Experimental results show that the proposed framework can achieve impressive human silhouette segmentation even under the challenging scenarios(such as low light and occlusion scenarios) where traditional optical-camera-based methods fail. To the best of our knowledge, this is the first investigation towards segmenting human silhouette based on millimeter wave signals. We hope that our work can serve as a baseline and inspire further research that perform vision tasks with radio signals. The dataset and codes will be made in public.

28.Explore and Match: End-to-End Video Grounding with Transformer ⬇️

We present a new paradigm named explore-and-match for video grounding, which aims to seamlessly unify two streams of video grounding methods: proposal-based and proposal-free. To achieve this goal, we formulate video grounding as a set prediction problem and design an end-to-end trainable Video Grounding Transformer (VidGTR) that can utilize the architectural strengths of rich contextualization and parallel decoding for set prediction. The overall training is balanced by two key losses that play different roles, namely span localization loss and set guidance loss. These two losses force each proposal to regress the target timespan and identify the target query. Throughout the training, VidGTR first explores the search space to diversify the initial proposals and then matches the proposals to the corresponding targets to fit them in a fine-grained manner. The explore-and-match scheme successfully combines the strengths of two complementary methods, without encoding prior knowledge into the pipeline. As a result, VidGTR sets new state-of-the-art results on two video grounding benchmarks with double the inference speed.

29.Semantically Video Coding: Instill Static-Dynamic Clues into Structured Bitstream for AI Tasks ⬇️

Traditional media coding schemes typically encode image/video into a semantic-unknown binary stream, which fails to directly support downstream intelligent tasks at the bitstream level. Semantically Structured Image Coding (SSIC) framework makes the first attempt to enable decoding-free or partial-decoding image intelligent task analysis via a Semantically Structured Bitstream (SSB). However, the SSIC only considers image coding and its generated SSB only contains the static object information. In this paper, we extend the idea of semantically structured coding from video coding perspective and propose an advanced Semantically Structured Video Coding (SSVC) framework to support heterogeneous intelligent applications. Video signals contain more rich dynamic motion information and exist more redundancy due to the similarity between adjacent frames. Thus, we present a reformulation of semantically structured bitstream (SSB) in SSVC which contains both static object characteristics and dynamic motion clues. Specifically, we introduce optical flow to encode continuous motion information and reduce cross-frame redundancy via a predictive coding architecture, then the optical flow and residual information are reorganized into SSB, which enables the proposed SSVC could better adaptively support video-based downstream intelligent applications. Extensive experiments demonstrate that the proposed SSVC framework could directly support multiple intelligent tasks just depending on a partially decoded bitstream. This avoids the full bitstream decompression and thus significantly saves bitrate/bandwidth consumption for intelligent analytics. We verify this point on the tasks of image object detection, pose estimation, video action recognition, video object segmentation, etc.

30.Unsupervised Image Fusion Method based on Feature Mutual Mapping ⬇️

Deep learning-based image fusion approaches have obtained wide attention in recent years, achieving promising performance in terms of visual perception. However, the fusion module in the current deep learning-based methods suffers from two limitations, \textit{i.e.}, manually designed fusion function, and input-independent network learning. In this paper, we propose an unsupervised adaptive image fusion method to address the above issues. We propose a feature mutual mapping fusion module and dual-branch multi-scale autoencoder. More specifically, we construct a global map to measure the connections of pixels between the input source images. % The found mapping relationship guides the image fusion. Besides, we design a dual-branch multi-scale network through sampling transformation to extract discriminative image features. We further enrich feature representations of different scales through feature aggregation in the decoding process. Finally, we propose a modified loss function to train the network with efficient convergence property. Through sufficient training on infrared and visible image data sets, our method also shows excellent generalized performance in multi-focus and medical image fusion. Our method achieves superior performance in both visual perception and objective evaluation. Experiments prove that the performance of our proposed method on a variety of image fusion tasks surpasses other state-of-the-art methods, proving the effectiveness and versatility of our approach.

31.TGFuse: An Infrared and Visible Image Fusion Approach Based on Transformer and Generative Adversarial Network ⬇️

The end-to-end image fusion framework has achieved promising performance, with dedicated convolutional networks aggregating the multi-modal local appearance. However, long-range dependencies are directly neglected in existing CNN fusion approaches, impeding balancing the entire image-level perception for complex scenario fusion. In this paper, therefore, we propose an infrared and visible image fusion algorithm based on a lightweight transformer module and adversarial learning. Inspired by the global interaction power, we use the transformer technique to learn the effective global fusion relations. In particular, shallow features extracted by CNN are interacted in the proposed transformer fusion module to refine the fusion relationship within the spatial scope and across channels simultaneously. Besides, adversarial learning is designed in the training process to improve the output discrimination via imposing competitive consistency from the inputs, reflecting the specific characteristics in infrared and visible images. The experimental performance demonstrates the effectiveness of the proposed modules, with superior improvement against the state-of-the-art, generalising a novel paradigm via transformer and adversarial learning in the fusion task.

32.MSNet: A Deep Multi-scale Submanifold Network for Visual Classification ⬇️

The Symmetric Positive Definite (SPD) matrix has received wide attention as a tool for visual data representation in computer vision. Although there are many different attempts to develop effective deep architectures for data processing on the Riemannian manifold of SPD matrices, a very few solutions explicitly mine the local geometrical information in deep SPD feature representations. While CNNs have demonstrated the potential of hierarchical local pattern extraction even for SPD represented data, we argue that it is of utmost importance to ensure the preservation of local geometric information in the SPD networks. Accordingly, in this work we propose an SPD network designed with this objective in mind. In particular, we propose an architecture, referred to as MSNet, which fuses geometrical multi-scale information. We first analyse the convolution operator commonly used for mapping the local information in Euclidean deep networks from the perspective of a higher level of abstraction afforded by the Category Theory. Based on this analysis, we postulate a submanifold selection principle to guide the design of our MSNet. In particular, we use it to design a submanifold fusion block to take advantage of the rich local geometry encoded in the network layers. The experiments involving multiple visual tasks show that our algorithm outperforms most Riemannian SOTA competitors.

33.SURDS: Self-Supervised Attention-guided Reconstruction and Dual Triplet Loss for Writer Independent Offline Signature Verification ⬇️

Offline Signature Verification (OSV) is a fundamental biometric task across various forensic, commercial and legal applications. The underlying task at hand is to carefully model fine-grained features of the signatures to distinguish between genuine and forged ones, which differ only in minute deformities. This makes OSV more challenging compared to other verification problems. In this work, we propose a two-stage deep learning framework that leverages self-supervised representation learning as well as metric learning for writer-independent OSV. First, we train an image reconstruction network using an encoder-decoder architecture that is augmented by a 2D spatial attention mechanism using signature image patches. Next, the trained encoder backbone is fine-tuned with a projector head using a supervised metric learning framework, whose objective is to optimize a novel dual triplet loss by sampling negative samples from both within the same writer class as well as from other writers. The intuition behind this is to ensure that a signature sample lies closer to its positive counterpart compared to negative samples from both intra-writer and cross-writer sets. This results in robust discriminative learning of the embedding space. To the best of our knowledge, this is the first work of using self-supervised learning frameworks for OSV. The proposed two-stage framework has been evaluated on two publicly available offline signature datasets and compared with various state-of-the-art methods. It is noted that the proposed method provided promising results outperforming several existing pieces of work.

34.A Hybrid Quantum-Classical Algorithm for Robust Fitting ⬇️

Fitting geometric models onto outlier contaminated data is provably intractable. Many computer vision systems rely on random sampling heuristics to solve robust fitting, which do not provide optimality guarantees and error bounds. It is therefore critical to develop novel approaches that can bridge the gap between exact solutions that are costly, and fast heuristics that offer no quality assurances. In this paper, we propose a hybrid quantum-classical algorithm for robust fitting. Our core contribution is a novel robust fitting formulation that solves a sequence of integer programs and terminates with a global solution or an error bound. The combinatorial subproblems are amenable to a quantum annealer, which helps to tighten the bound efficiently. While our usage of quantum computing does not surmount the fundamental intractability of robust fitting, by providing error bounds our algorithm is a practical improvement over randomised heuristics. Moreover, our work represents a concrete application of quantum computing in computer vision. We present results obtained using an actual quantum computer (D-Wave Advantage) and via simulation. Source code: this https URL

35.ARPD: Anchor-free Rotation-aware People Detection using Topview Fisheye Camera ⬇️

People detection in top-view, fish-eye images is challenging as people in fish-eye images often appear in arbitrary directions and are distorted differently. Due to this unique radial geometry, axis-aligned people detectors often work poorly on fish-eye frames. Recent works account for this variability by modifying existing anchor-based detectors or relying on complex pre/post-processing. Anchor-based methods spread a set of pre-defined bounding boxes on the input image, most of which are invalid. In addition to being inefficient, this approach could lead to a significant imbalance between the positive and negative anchor boxes. In this work, we propose ARPD, a single-stage anchor-free fully convolutional network to detect arbitrarily rotated people in fish-eye images. Our network uses keypoint estimation to find the center point of each object and regress the object's other properties directly. To capture the various orientation of people in fish-eye cameras, in addition to the center and size, ARPD also predicts the angle of each bounding box. We also propose a periodic loss function that accounts for angle periodicity and relieves the difficulty of learning small-angle oscillations. Experimental results show that our method competes favorably with state-of-the-art algorithms while running significantly faster.

36.A Classical Approach to Handcrafted Feature Extraction Techniques for Bangla Handwritten Digit Recognition ⬇️

Bangla Handwritten Digit recognition is a significant step forward in the development of Bangla OCR. However, intricate shape, structural likeness and distinctive composition style of Bangla digits makes it relatively challenging to distinguish. Thus, in this paper, we benchmarked four rigorous classifiers to recognize Bangla Handwritten Digit: K-Nearest Neighbor (KNN), Support Vector Machine (SVM), Random Forest (RF), and Gradient-Boosted Decision Trees (GBDT) based on three handcrafted feature extraction techniques: Histogram of Oriented Gradients (HOG), Local Binary Pattern (LBP), and Gabor filter on four publicly available Bangla handwriting digits datasets: NumtaDB, CMARTdb, Ekush and BDRW. Here, handcrafted feature extraction methods are used to extract features from the dataset image, which are then utilized to train machine learning classifiers to identify Bangla handwritten digits. We further fine-tuned the hyperparameters of the classification algorithms in order to acquire the finest Bangla handwritten digits recognition performance from these algorithms, and among all the models we employed, the HOG features combined with SVM model (HOG+SVM) attained the best performance metrics across all datasets. The recognition accuracy of the HOG+SVM method on the NumtaDB, CMARTdb, Ekush and BDRW datasets reached 93.32%, 98.08%, 95.68% and 89.68%, respectively as well as we compared the model performance with recent state-of-art methods.

37.Revisiting L1 Loss in Super-Resolution: A Probabilistic View and Beyond ⬇️

Super-resolution as an ill-posed problem has many high-resolution candidates for a low-resolution input. However, the popular $\ell_1$ loss used to best fit the given HR image fails to consider this fundamental property of non-uniqueness in image restoration. In this work, we fix the missing piece in $\ell_1$ loss by formulating super-resolution with neural networks as a probabilistic model. It shows that $\ell_1$ loss is equivalent to a degraded likelihood function that removes the randomness from the learning process. By introducing a data-adaptive random variable, we present a new objective function that aims at minimizing the expectation of the reconstruction error over all plausible solutions. The experimental results show consistent improvements on mainstream architectures, with no extra parameter or computing cost at inference time.

38.Real-time automatic polyp detection in colonoscopy using feature enhancement module and spatiotemporal similarity correlation unit ⬇️

Automatic detection of polyps is challenging because different polyps vary greatly, while the changes between polyps and their analogues are small. The state-of-the-art methods are based on convolutional neural networks (CNNs). However, they may fail due to lack of training data, resulting in high rates of missed detection and false positives (FPs). In order to solve these problems, our method combines the two-dimensional (2-D) CNN-based real-time object detector network with spatiotemporal information. Firstly, we use a 2-D detector network to detect static images and frames, and based on the detector network, we propose two feature enhancement modules-the FP Relearning Module (FPRM) to make the detector network learning more about the features of FPs for higher precision, and the Image Style Transfer Module (ISTM) to enhance the features of polyps for sensitivity improvement. In video detection, we integrate spatiotemporal information, which uses Structural Similarity (SSIM) to measure the similarity between video frames. Finally, we propose the Inter-frame Similarity Correlation Unit (ISCU) to combine the results obtained by the detector network and frame similarity to make the final decision. We verify our method on both private databases and publicly available databases. Experimental results show that these modules and units provide a performance improvement compared with the baseline method. Comparison with the state-of-the-art methods shows that the proposed method outperforms the existing ones which can meet real-time constraints. It's demonstrated that our method provides a performance improvement in sensitivity, precision and specificity, and has great potential to be applied in clinical colonoscopy.

39.Splatting-based Synthesis for Video Frame Interpolation ⬇️

Frame interpolation is an essential video processing technique that adjusts the temporal resolution of an image sequence. An effective approach to perform frame interpolation is based on splatting, also known as forward warping. Specifically, splatting can be used to warp the input images to an arbitrary temporal location based on an optical flow estimate. A synthesis network, also sometimes referred to as refinement network, can then be used to generate the output frame from the warped images. In doing so, it is common to not only warp the images but also various feature representations which provide rich contextual cues to the synthesis network. However, while this approach has been shown to work well and enables arbitrary-time interpolation due to using splatting, the involved synthesis network is prohibitively slow. In contrast, we propose to solely rely on splatting to synthesize the output without any subsequent refinement. This splatting-based synthesis is much faster than similar approaches, especially for multi-frame interpolation, while enabling new state-of-the-art results at high resolutions.

40.ViT-HGR: Vision Transformer-based Hand Gesture Recognition from High Density Surface EMG Signals ⬇️

Recently, there has been a surge of significant interest on application of Deep Learning (DL) models to autonomously perform hand gesture recognition using surface Electromyogram (sEMG) signals. DL models are, however, mainly designed to be applied on sparse sEMG signals. Furthermore, due to their complex structure, typically, we are faced with memory constraints; require large training times and a large number of training samples, and; there is the need to resort to data augmentation and/or transfer learning. In this paper, for the first time (to the best of our knowledge), we investigate and design a Vision Transformer (ViT) based architecture to perform hand gesture recognition from High Density (HD-sEMG) signals. Intuitively speaking, we capitalize on the recent breakthrough role of the transformer architecture in tackling different complex problems together with its potential for employing more input parallelization via its attention mechanism. The proposed Vision Transformer-based Hand Gesture Recognition (ViT-HGR) framework can overcome the aforementioned training time problems and can accurately classify a large number of hand gestures from scratch without any need for data augmentation and/or transfer learning. The efficiency of the proposed ViT-HGR framework is evaluated using a recently-released HD-sEMG dataset consisting of 65 isometric hand gestures. Our experiments with 64-sample (31.25 ms) window size yield average test accuracy of 84.62 +/- 3.07%, where only 78, 210 number of parameters is utilized. The compact structure of the proposed ViT-based ViT-HGR framework (i.e., having significantly reduced number of trainable parameters) shows great potentials for its practical application for prosthetic control.

41.Are Commercial Face Detection Models as Biased as Academic Models? ⬇️

As facial recognition systems are deployed more widely, scholars and activists have studied their biases and harms. Audits are commonly used to accomplish this and compare the algorithmic facial recognition systems' performance against datasets with various metadata labels about the subjects of the images. Seminal works have found discrepancies in performance by gender expression, age, perceived race, skin type, etc. These studies and audits often examine algorithms which fall into two categories: academic models or commercial models. We present a detailed comparison between academic and commercial face detection systems, specifically examining robustness to noise. We find that state-of-the-art academic face detection models exhibit demographic disparities in their noise robustness, specifically by having statistically significant decreased performance on older individuals and those who present their gender in a masculine manner. When we compare the size of these disparities to that of commercial models, we conclude that commercial models - in contrast to their relatively larger development budget and industry-level fairness commitments - are always as biased or more biased than an academic model.

42.Self-supervised Point Cloud Registration with Deep Versatile Descriptors ⬇️

Recent years have witnessed an increasing trend toward solving point cloud registration problems with various deep learning-based algorithms. Compared to supervised/semi-supervised registration methods, unsupervised methods require no human annotations. However, unsupervised methods mainly depend on the global descriptors, which ignore the high-level representations of local geometries. In this paper, we propose a self-supervised registration scheme with a novel Deep Versatile Descriptors (DVD), jointly considering global representations and local representations. The DVD is motivated by a key observation that the local distinctive geometric structures of the point cloud by two subset points can be employed to enhance the representation ability of the feature extraction module. Furthermore, we utilize two additional tasks (reconstruction and normal estimation) to enhance the transformation awareness of the proposed DVDs. Lastly, we conduct extensive experiments on synthetic and real-world datasets, demonstrating that our method achieves state-of-the-art performance against competing methods over a wide range of experimental settings.

43.PONI: Potential Functions for ObjectGoal Navigation with Interaction-free Learning ⬇️

State-of-the-art approaches to ObjectGoal navigation rely on reinforcement learning and typically require significant computational resources and time for learning. We propose Potential functions for ObjectGoal Navigation with Interaction-free learning (PONI), a modular approach that disentangles the skills of where to look?' for an object and how to navigate to (x, y)?'. Our key insight is that `where to look?' can be treated purely as a perception problem, and learned without environment interactions. To address this, we propose a network that predicts two complementary potential functions conditioned on a semantic map and uses them to decide where to look for an unseen object. We train the potential function network using supervised learning on a passive dataset of top-down semantic maps, and integrate it into a modular framework to perform ObjectGoal navigation. Experiments on Gibson and Matterport3D demonstrate that our method achieves the state-of-the-art for ObjectGoal navigation while incurring up to 1,600x less computational cost for training.

44.Automatic Recognition and Digital Documentation of Cultural Heritage Hemispherical Domes using Images ⬇️

Advancements in optical metrology has enabled documentation of dense 3D point clouds of cultural heritage sites. For large scale and continuous digital documentation, processing of dense 3D point clouds becomes computationally cumbersome, and often requires additional hardware for data management, increasing the time cost, and complexity of projects. To this end, this manuscript presents an original approach to generate fast and reliable semantic digital models of heritage hemispherical domes using only two images. New closed formulations were derived to establish the relationships between spheres and their projected ellipses onto images, which fostered the development of a new automatic framework for as-built generation of spheres. The effectiveness of the proposed method was evaluated under both laboratory and real-world datasets. The results revealed that the proposed method achieved as-built modeling accuracy of around 6mm, while improving the computation time by a factor of 7, when compared to established point cloud processing methods.

45.The Vehicle Trajectory Prediction Based on ResNet and EfficientNet Model ⬇️

At present, a major challenge for the application of automatic driving technology is the accurate prediction of vehicle trajectory. With the vigorous development of computer technology and the emergence of convolution depth neural network, the accuracy of prediction results has been improved. But, the depth, width of the network and image resolution are still important reasons that restrict the accuracy of the model and the prediction results. The main innovation of this paper is the combination of RESNET network and efficient net network, which not only greatly increases the network depth, but also comprehensively changes the choice of network width and image resolution, so as to make the model performance better, but also save computing resources as much as possible. The experimental results also show that our proposed model obtains the optimal prediction results. Specifically, the loss value of our method is separately 4 less and 2.1 less than that of resnet and efficientnet method.

46.ImpliCity: City Modeling from Satellite Images with Deep Implicit Occupancy Fields ⬇️

High-resolution optical satellite sensors, in combination with dense stereo algorithms, have made it possible to reconstruct 3D city models from space. However, the resulting models are, in practice, rather noisy, and they tend to miss small geometric features that are clearly visible in the images. We argue that one reason for the limited DSM quality may be a too early, heuristic reduction of the triangulated 3D point cloud to an explicit height field or surface mesh. To make full use of the point cloud and the underlying images, we introduce ImpliCity, a neural representation of the 3D scene as an implicit, continuous occupancy field, driven by learned embeddings of the point cloud and a stereo pair of ortho-photos. We show that this representation enables the extraction of high-quality DSMs: with image resolution 0.5$,$m, ImpliCity reaches a median height error of $\approx,$0.7$,$m and outperforms competing methods, especially w.r.t. building reconstruction, featuring intricate roof details, smooth surfaces, and straight, regular outlines.

47.Attacks and Defenses for Free-Riders in Multi-Discriminator GAN ⬇️

Generative Adversarial Networks (GANs) are increasingly adopted by the industry to synthesize realistic images. Due to data not being centrally available, Multi-Discriminator (MD)-GANs training framework employs multiple discriminators that have direct access to the real data. Distributedly training a joint GAN model entails the risk of free-riders, i.e., participants that aim to benefit from the common model while only pretending to participate in the training process. In this paper, we conduct the first characterization study of the impact of free-riders on MD-GAN. Based on two production prototypes of MD-GAN, we find that free-riders drastically reduce the ability of MD-GANs to produce images that are indistinguishable from real data, i.e., they increase the FID score -- the standard measure to assess the quality of generated images. To mitigate the model degradation, we propose a defense strategy against free-riders in MD-GAN, termed DFG. DFG distinguishes free-riders and benign participants through periodic probing and clustering of discriminators' responses based on a reference response of free-riders, which then allows the generator to exclude the detected free-riders from the training. Furthermore, we extend our defense, termed DFG+, to enable discriminators to filter out free-riders at the variant of MD-GAN that allows peer exchanges of discriminators networks. Extensive evaluation on various scenarios of free-riders, MD-GAN architecture, and three datasets show that our defenses effectively detect free-riders. With 1 to 5 free-riders, DFG and DFG+ averagely decreases FID by 5.22% to 11.53% for CIFAR10 and 5.79% to 13.22% for CIFAR100 in comparison to an attack without defense. In a shell, the proposed DFG(+) can effectively defend against free-riders without affecting benign clients at a negligible computation overhead.

48.What is the cost of adding a constraint in linear least squares? ⬇️

Although the theory of constrained least squares (CLS) estimation is well known, it is usually applied with the view that the constraints to be imposed are unavoidable. However, there are cases in which constraints are optional. For example, in camera color calibration, one of several possible color processing systems is obtained if a constraint on the row sums of a desired color correction matrix is imposed; in this example, it is not clear a priori whether imposing the constraint leads to better system performance. In this paper, we derive an exact expression connecting the constraint to the increase in fitting error obtained from imposing it. As another contribution, we show how to determine projection matrices that separate the measured data into two components: the first component drives up the fitting error due to imposing a constraint, and the second component is unaffected by the constraint. We demonstrate the use of these results in the color calibration problem.

49.Do Smart Glasses Dream of Sentimental Visions? Deep Emotionship Analysis for Eyewear Devices ⬇️

Emotion recognition in smart eyewear devices is highly valuable but challenging. One key limitation of previous works is that the expression-related information like facial or eye images is considered as the only emotional evidence. However, emotional status is not isolated; it is tightly associated with people's visual perceptions, especially those sentimental ones. However, little work has examined such associations to better illustrate the cause of different emotions. In this paper, we study the emotionship analysis problem in eyewear systems, an ambitious task that requires not only classifying the user's emotions but also semantically understanding the potential cause of such emotions. To this end, we devise EMOShip, a deep-learning-based eyewear system that can automatically detect the wearer's emotional status and simultaneously analyze its associations with semantic-level visual perceptions. Experimental studies with 20 participants demonstrate that, thanks to the emotionship awareness, EMOShip not only achieves superior emotion recognition accuracy over existing methods (80.2% vs. 69.4%), but also provides a valuable understanding of the cause of emotions. Pilot studies with 20 participants further motivate the potential use of EMOShip to empower emotion-aware applications, such as emotionship self-reflection and emotionship life-logging.

50.Initial Investigations Towards Non-invasive Monitoring of Chronic Wound Healing Using Deep Learning and Ultrasound Imaging ⬇️

Chronic wounds including diabetic and arterial/venous insufficiency injuries have become a major burden for healthcare systems worldwide. Demographic changes suggest that wound care will play an even bigger role in the coming decades. Predicting and monitoring response to therapy in wound care is currently largely based on visual inspection with little information on the underlying tissue. Thus, there is an urgent unmet need for innovative approaches that facilitate personalized diagnostics and treatments at the point-of-care. It has been recently shown that ultrasound imaging can monitor response to therapy in wound care, but this work required onerous manual image annotations. In this study, we present initial results of a deep learning-based automatic segmentation of cross-sectional wound size in ultrasound images and identify requirements and challenges for future research on this application. Evaluation of the segmentation results underscores the potential of the proposed deep learning approach to complement non-invasive imaging with Dice scores of 0.34 (U-Net, FCN) and 0.27 (ResNet-U-Net) but also highlights the need for improving robustness further. We conclude that deep learning-supported analysis of non-invasive ultrasound images is a promising area of research to automatically extract cross-sectional wound size and depth information with potential value in monitoring response to therapy.

51.AggMatch: Aggregating Pseudo Labels for Semi-Supervised Learning ⬇️

Semi-supervised learning (SSL) has recently proven to be an effective paradigm for leveraging a huge amount of unlabeled data while mitigating the reliance on large labeled data. Conventional methods focused on extracting a pseudo label from individual unlabeled data sample and thus they mostly struggled to handle inaccurate or noisy pseudo labels, which degenerate performance.
In this paper, we address this limitation with a novel SSL framework for aggregating pseudo labels, called AggMatch, which refines initial pseudo labels by using different confident instances. Specifically, we introduce an aggregation module for consistency regularization framework that aggregates the initial pseudo labels based on the similarity between the instances. To enlarge the aggregation candidates beyond the mini-batch, we present a class-balanced confidence-aware queue built with the momentum model, encouraging to provide more stable and consistent aggregation. We also propose a novel uncertainty-based confidence measure for the pseudo label by considering the consensus among multiple hypotheses with different subsets of the queue. We conduct experiments to demonstrate the effectiveness of AggMatch over the latest methods on standard benchmarks and provide extensive analyses.

52.Plaque segmentation via masking of the artery wall ⬇️

The presence of plaques in the coronary arteries are a major risk to the patients' life. In particular, non-calcified plaques pose a great challenge, as they are harder to detect and more likely to rupture than calcified plaques. While current deep learning techniques allow precise segmentation of regular images, the performance in medical images is still low, caused mostly by blurriness and ambiguous voxel intensities of unrelated parts that fall on the same range. In this paper, we propose a novel methodology for segmenting calcified and non-calcified plaques in CCTA-CPR scans of coronary arteries. The input slices are masked so only the voxels within the wall vessel are considered for segmentation. We also provide an exhaustive evaluation by applying different types of masks, in order to validate the potential of vessel masking for plaque segmentation. Our methodology results in a prominent boost in segmentation performance, in both quantitative and qualitative evaluation, achieving accurate plaque shapes even for the challenging non-calcified plaques. We believe our findings can lead the future research for high-performance plaque segmentation.

53.Resource-efficient Deep Neural Networks for Automotive Radar Interference Mitigation ⬇️

Radar sensors are crucial for environment perception of driver assistance systems as well as autonomous vehicles. With a rising number of radar sensors and the so far unregulated automotive radar frequency band, mutual interference is inevitable and must be dealt with. Algorithms and models operating on radar data are required to run the early processing steps on specialized radar sensor hardware. This specialized hardware typically has strict resource-constraints, i.e. a low memory capacity and low computational power. Convolutional Neural Network (CNN)-based approaches for denoising and interference mitigation yield promising results for radar processing in terms of performance. Regarding resource-constraints, however, CNNs typically exceed the hardware's capacities by far.
In this paper we investigate quantization techniques for CNN-based denoising and interference mitigation of radar signals. We analyze the quantization of (i) weights and (ii) activations of different CNN-based model architectures. This quantization results in reduced memory requirements for model storage and during inference. We compare models with fixed and learned bit-widths and contrast two different methodologies for training quantized CNNs, i.e. the straight-through gradient estimator and training distributions over discrete weights. We illustrate the importance of structurally small real-valued base models for quantization and show that learned bit-widths yield the smallest models. We achieve a memory reduction of around 80% compared to the real-valued baseline. Due to practical reasons, however, we recommend the use of 8 bits for weights and activations, which results in models that require only 0.2 megabytes of memory.

54.A Multi-modal Fusion Framework Based on Multi-task Correlation Learning for Cancer Prognosis Prediction ⬇️

Morphological attributes from histopathological images and molecular profiles from genomic data are important information to drive diagnosis, prognosis, and therapy of cancers. By integrating these heterogeneous but complementary data, many multi-modal methods are proposed to study the complex mechanisms of cancers, and most of them achieve comparable or better results from previous single-modal methods. However, these multi-modal methods are restricted to a single task (e.g., survival analysis or grade classification), and thus neglect the correlation between different tasks. In this study, we present a multi-modal fusion framework based on multi-task correlation learning (MultiCoFusion) for survival analysis and cancer grade classification, which combines the power of multiple modalities and multiple tasks. Specifically, a pre-trained ResNet-152 and a sparse graph convolutional network (SGCN) are used to learn the representations of histopathological images and mRNA expression data respectively. Then these representations are fused by a fully connected neural network (FCNN), which is also a multi-task shared network. Finally, the results of survival analysis and cancer grade classification output simultaneously. The framework is trained by an alternate scheme. We systematically evaluate our framework using glioma datasets from The Cancer Genome Atlas (TCGA). Results demonstrate that MultiCoFusion learns better representations than traditional feature extraction methods. With the help of multi-task alternating learning, even simple multi-modal concatenation can achieve better performance than other deep learning and traditional methods. Multi-task learning can improve the performance of multiple tasks not just one of them, and it is effective in both single-modal and multi-modal data.

55.Ultra Low-Parameter Denoising: Trainable Bilateral Filter Layers in Computed Tomography ⬇️

Computed tomography is widely used as an imaging tool to visualize three-dimensional structures with expressive bone-soft tissue contrast. However, CT resolution and radiation dose are tightly entangled, highlighting the importance of low-dose CT combined with sophisticated denoising algorithms. Most data-driven denoising techniques are based on deep neural networks and, therefore, contain hundreds of thousands of trainable parameters, making them incomprehensible and prone to prediction failures. Developing understandable and robust denoising algorithms achieving state-of-the-art performance helps to minimize radiation dose while maintaining data integrity. This work presents an open-source CT denoising framework based on the idea of bilateral filtering. We propose a bilateral filter that can be incorporated into a deep learning pipeline and optimized in a purely data-driven way by calculating the gradient flow toward its hyperparameters and its input. Denoising in pure image-to-image pipelines and across different domains such as raw detector data and reconstructed volume, using a differentiable backprojection layer, is demonstrated. Although only using three spatial parameters and one range parameter per filter layer, the proposed denoising pipelines can compete with deep state-of-the-art denoising architectures with several hundred thousand parameters. Competitive denoising performance is achieved on x-ray microscope bone data (0.7053 and 33.10) and the 2016 Low Dose CT Grand Challenge dataset (0.9674 and 43.07) in terms of SSIM and PSNR. Due to the extremely low number of trainable parameters with well-defined effect, prediction reliance and data integrity is guaranteed at any time in the proposed pipelines, in contrast to most other deep learning-based denoising architectures.

56.Addressing the Intra-class Mode Collapse Problem using Adaptive Input Image Normalization in GAN-based X-ray Images ⬇️

Biomedical image datasets can be imbalanced due to the rarity of targeted diseases. Generative Adversarial Networks play a key role in addressing this imbalance by enabling the generation of synthetic images to augment and balance datasets. It is important to generate synthetic images that incorporate a diverse range of features such that they accurately represent the distribution of features present in the training imagery. Furthermore, the absence of diverse features in synthetic images can degrade the performance of machine learning classifiers. The mode collapse problem can impact a Generative Adversarial Network's capacity to generate diversified images. The mode collapse comes in two varieties; intra-class and inter-class. In this paper, the intra-class mode collapse problem is investigated, and its subsequent impact on the diversity of synthetic X-ray images is evaluated. This work contributes an empirical demonstration of the benefits of integrating the adaptive input-image normalization for the Deep Convolutional GAN to alleviate the intra-class mode collapse problem. Results demonstrate that the DCGAN with adaptive input-image normalization outperforms DCGAN with un-normalized X-ray images as evident by the superior diversity scores.

57.Mutual information neural estimation for unsupervised multi-modal registration of brain images ⬇️

Many applications in image-guided surgery and therapy require fast and reliable non-linear, multi-modal image registration. Recently proposed unsupervised deep learning-based registration methods have demonstrated superior performance compared to iterative methods in just a fraction of the time. Most of the learning-based methods have focused on mono-modal image registration. The extension to multi-modal registration depends on the use of an appropriate similarity function, such as the mutual information (MI). We propose guiding the training of a deep learning-based registration method with MI estimation between an image-pair in an end-to-end trainable network. Our results show that a small, 2-layer network produces competitive results in both mono- and multimodal registration, with sub-second run-times. Comparisons to both iterative and deep learning-based methods show that our MI-based method produces topologically and qualitatively superior results with an extremely low rate of non-diffeomorphic transformations. Real-time clinical application will benefit from a better visual matching of anatomical structures and less registration failures/outliers.

58.S2MS: Self-Supervised Learning Driven Multi-Spectral CT Image Enhancement ⬇️

Photon counting spectral CT (PCCT) can produce reconstructed attenuation maps in different energy channels, reflecting energy properties of the scanned object. Due to the limited photon numbers and the non-ideal detector response of each energy channel, the reconstructed images usually contain much noise. With the development of Deep Learning (DL) technique, different kinds of DL-based models have been proposed for noise reduction. However, most of the models require clean data set as the training labels, which are not always available in medical imaging field. Inspiring by the similarities of each channel's reconstructed image, we proposed a self-supervised learning based PCCT image enhancement framework via multi-spectral channels (S2MS). In S2MS framework, both the input and output labels are noisy images. Specifically, one single channel image was used as output while images of other single channels and channel-sum image were used as input to train the network, which can fully use the spectral data information without extra cost. The simulation results based on the AAPM Low-dose CT Challenge database showed that the proposed S2MS model can suppress the noise and preserve details more effectively in comparison with the traditional DL models, which has potential to improve the image quality of PCCT in clinical applications.

59.Combining Commonsense Reasoning and Knowledge Acquisition to Guide Deep Learning in Robotics ⬇️

Algorithms based on deep network models are being used for many pattern recognition and decision-making tasks in robotics and AI. Training these models requires a large labeled dataset and considerable computational resources, which are not readily available in many domains. Also, it is difficult to explore the internal representations and reasoning mechanisms of these models. As a step towards addressing the underlying knowledge representation, reasoning, and learning challenges, the architecture described in this paper draws inspiration from research in cognitive systems. As a motivating example, we consider an assistive robot trying to reduce clutter in any given scene by reasoning about the occlusion of objects and stability of object configurations in an image of the scene. In this context, our architecture incrementally learns and revises a grounding of the spatial relations between objects and uses this grounding to extract spatial information from input images. Non-monotonic logical reasoning with this information and incomplete commonsense domain knowledge is used to make decisions about stability and occlusion. For images that cannot be processed by such reasoning, regions relevant to the tasks at hand are automatically identified and used to train deep network models to make the desired decisions. Image regions used to train the deep networks are also used to incrementally acquire previously unknown state constraints that are merged with the existing knowledge for subsequent reasoning. Experimental evaluation performed using simulated and real-world images indicates that in comparison with baselines based just on deep networks, our architecture improves reliability of decision making and reduces the effort involved in training data-driven deep network models.

60.Dense Pixel-Labeling for Reverse-Transfer and Diagnostic Learning on Lung Ultrasound for COVID-19 and Pneumonia Detection ⬇️

We propose using a pre-trained segmentation model to perform diagnostic classification in order to achieve better generalization and interpretability, terming the technique reverse-transfer learning. We present an architecture to convert segmentation models to classification models. We compare and contrast dense vs sparse segmentation labeling and study its impact on diagnostic classification. We compare the performance of U-Net trained with dense and sparse labels to segment A-lines, B-lines, and Pleural lines on a custom dataset of lung ultrasound scans from 4 patients. Our experiments show that dense labels help reduce false positive detection. We study the classification capability of the dense and sparse trained U-Net and contrast it with a non-pretrained U-Net, to detect and differentiate COVID-19 and Pneumonia on a large ultrasound dataset of about 40k curvilinear and linear probe images. Our segmentation-based models perform better classification when using pretrained segmentation weights, with the dense-label pretrained U-Net performing the best.

61.Neural Manifold Clustering and Embedding ⬇️

Given a union of non-linear manifolds, non-linear subspace clustering or manifold clustering aims to cluster data points based on manifold structures and also learn to parameterize each manifold as a linear subspace in a feature space. Deep neural networks have the potential to achieve this goal under highly non-linear settings given their large capacity and flexibility. We argue that achieving manifold clustering with neural networks requires two essential ingredients: a domain-specific constraint that ensures the identification of the manifolds, and a learning algorithm for embedding each manifold to a linear subspace in the feature space. This work shows that many constraints can be implemented by data augmentation. For subspace feature learning, Maximum Coding Rate Reduction (MCR$^2$) objective can be used. Putting them together yields {\em Neural Manifold Clustering and Embedding} (NMCE), a novel method for general purpose manifold clustering, which significantly outperforms autoencoder-based deep subspace clustering. Further, on more challenging natural image datasets, NMCE can also outperform other algorithms specifically designed for clustering. Qualitatively, we demonstrate that NMCE learns a meaningful and interpretable feature space. As the formulation of NMCE is closely related to several important Self-supervised learning (SSL) methods, we believe this work can help us build a deeper understanding on SSL representation learning.

62.COVID-19 Detection Using CT Image Based On YOLOv5 Network ⬇️

Computer aided diagnosis (CAD) increases diagnosis efficiency, helping doctors providing a quick and confident diagnosis, it has played an important role in the treatment of COVID19. In our task, we solve the problem about abnormality detection and classification. The dataset provided by Kaggle platform and we choose YOLOv5 as our model. We introduce some methods on objective detection in the related work section, the objection detection can be divided into two streams: onestage and two stage. The representational model are Faster RCNN and YOLO series. Then we describe the YOLOv5 model in the detail. Compared Experiments and results are shown in section IV. We choose mean average precision (mAP) as our experiments' metrics, and the higher (mean) mAP is, the better result the model will gain. [email protected] of our YOLOv5s is 0.623 which is 0.157 and 0.101 higher than Faster RCNN and EfficientDet respectively.

63.A Deep Learning Approach for the Detection of COVID-19 from Chest X-Ray Images using Convolutional Neural Networks ⬇️

The COVID-19 (coronavirus) is an ongoing pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The virus was first identified in mid-December 2019 in the Hubei province of Wuhan, China and by now has spread throughout the planet with more than 75.5 million confirmed cases and more than 1.67 million deaths. With limited number of COVID-19 test kits available in medical facilities, it is important to develop and implement an automatic detection system as an alternative diagnosis option for COVID-19 detection that can used on a commercial scale. Chest X-ray is the first imaging technique that plays an important role in the diagnosis of COVID-19 disease. Computer vision and deep learning techniques can help in determining COVID-19 virus with Chest X-ray Images. Due to the high availability of large-scale annotated image datasets, great success has been achieved using convolutional neural network for image analysis and classification. In this research, we have proposed a deep convolutional neural network trained on five open access datasets with binary output: Normal and Covid. The performance of the model is compared with four pre-trained convolutional neural network-based models (COVID-Net, ResNet18, ResNet and MobileNet-V2) and it has been seen that the proposed model provides better accuracy on the validation set as compared to the other four pre-trained models. This research work provides promising results which can be further improvise and implement on a commercial scale.

64.Euclidean and Affine Curve Reconstruction ⬇️

We consider practical aspects of reconstructing planar curves with prescribed Euclidean or affine curvatures. These curvatures are invariant under the special Euclidean group and the equi-affine groups, respectively, and play an important role in computer vision and shape analysis. We discuss and implement algorithms for such reconstruction, and give estimates on how close reconstructed curves are relative to the closeness of their curvatures in appropriate metrics. Several illustrative examples are provided.

65.AutoMC: Automated Model Compression based on Domain Knowledge and Progressive search strategy ⬇️

Model compression methods can reduce model complexity on the premise of maintaining acceptable performance, and thus promote the application of deep neural networks under resource constrained environments. Despite their great success, the selection of suitable compression methods and design of details of the compression scheme are difficult, requiring lots of domain knowledge as support, which is not friendly to non-expert users. To make more users easily access to the model compression scheme that best meet their needs, in this paper, we propose AutoMC, an effective automatic tool for model compression. AutoMC builds the domain knowledge on model compression to deeply understand the characteristics and advantages of each compression method under different settings. In addition, it presents a progressive search strategy to efficiently explore pareto optimal compression scheme according to the learned prior knowledge combined with the historical evaluation information. Extensive experimental results show that AutoMC can provide satisfying compression schemes within short time, demonstrating the effectiveness of AutoMC.