1.Sim2real transfer learning for 3D pose estimation: motion to the rescue ⬇️
Simulation is an anonymous, low-bias source of data where annotation can often be done automatically; however, for some tasks, current models trained on synthetic data generalize poorly to real data. The task of 3D human pose estimation is a particularly interesting example of this sim2real problem, because learning-based approaches perform reasonably well given real training data, yet labeled 3D poses are extremely difficult to obtain in the wild, limiting scalability. In this paper, we show that standard neural-network approaches, which perform poorly when trained on synthetic RGB images, can perform well when the data is pre-processed to extract cues about the person's motion, notably as optical flow and the motion of 2D keypoints. Therefore, our results suggest that motion can be a simple way to bridge a sim2real gap when video is available. We evaluate on the 3D Poses in the Wild dataset, the most challenging modern standard of 3D pose estimation, where we show full 3D mesh recovery that is on par with state-of-the-art methods trained on real 3D sequences, despite training only on synthetic humans from the SURREAL dataset.
2.Multi-Instance Multi-Scale CNN for Medical Image Classification ⬇️
Deep learning for medical image classification faces three major challenges: 1) the number of annotated medical images for training are usually small; 2) regions of interest (ROIs) are relatively small with unclear boundaries in the whole medical images, and may appear in arbitrary positions across the x,y (and also z in 3D images) dimensions. However often only labels of the whole images are annotated, and localized ROIs are unavailable; and 3) ROIs in medical images often appear in varying sizes (scales). We approach these three challenges with a Multi-Instance Multi-Scale (MIMS) CNN: 1) We propose a multi-scale convolutional layer, which extracts patterns of different receptive fields with a shared set of convolutional kernels, so that scale-invariant patterns are captured by this compact set of kernels. As this layer contains only a small number of parameters, training on small datasets becomes feasible; 2) We propose a "top-k pooling"" to aggregate the feature maps in varying scales from multiple spatial dimensions, allowing the model to be trained using weak annotations within the multiple instance learning (MIL) framework. Our method is shown to perform well on three classification tasks involving two 3D and two 2D medical image datasets.
3.Guided Image Generation with Conditional Invertible Neural Networks ⬇️
In this work, we address the task of natural image generation guided by a conditioning input. We introduce a new architecture called conditional invertible neural network (cINN). The cINN combines the purely generative INN model with an unconstrained feed-forward network, which efficiently preprocesses the conditioning input into useful features. All parameters of the cINN are jointly optimized with a stable, maximum likelihood-based training procedure. By construction, the cINN does not experience mode collapse and generates diverse samples, in contrast to e.g. cGANs. At the same time our model produces sharp images since no reconstruction loss is required, in contrast to e.g. VAEs. We demonstrate these properties for the tasks of MNIST digit generation and image colorization. Furthermore, we take advantage of our bi-directional cINN architecture to explore and manipulate emergent properties of the latent space, such as changing the image style in an intuitive way.
4.Believe It or Not, We Know What You Are Looking at! ⬇️
By borrowing the wisdom of human in gaze following, we propose a two-stage solution for gaze point prediction of the target persons in a scene. Specifically, in the first stage, both head image and its position are fed into a gaze direction pathway to predict the gaze direction, and then multi-scale gaze direction fields are generated to characterize the distribution of gaze points without considering the scene contents. In the second stage, the multi-scale gaze direction fields are concatenated with the image contents and fed into a heatmap pathway for heatmap regression. There are two merits for our two-stage solution based gaze following: i) our solution mimics the behavior of human in gaze following, therefore it is more psychological plausible; ii) besides using heatmap to supervise the output of our network, we can also leverage gaze direction to facilitate the training of gaze direction pathway, therefore our network can be more robustly trained. Considering that existing gaze following dataset is annotated by the third-view persons, we build a video gaze following dataset, where the ground truth is annotated by the observers in the videos. Therefore it is more reliable. The evaluation with such a dataset reflects the capacity of different methods in real scenarios better. Extensive experiments on both datasets show that our method significantly outperforms existing methods, which validates the effectiveness of our solution for gaze following. Our dataset and codes are released in this https URL.
5.Deep Saliency Models : The Quest For The Loss Function ⬇️
Recent advances in deep learning have pushed the performances of visual saliency models way further than it has ever been. Numerous models in the literature present new ways to design neural networks, to arrange gaze pattern data, or to extract as much high and low-level image features as possible in order to create the best saliency representation. However, one key part of a typical deep learning model is often neglected: the choice of the loss function.
In this work, we explore some of the most popular loss functions that are used in deep saliency models. We demonstrate that on a fixed network architecture, modifying the loss function can significantly improve (or depreciate) the results, hence emphasizing the importance of the choice of the loss function when designing a model. We also introduce new loss functions that have never been used for saliency prediction to our knowledge. And finally, we show that a linear combination of several well-chosen loss functions leads to significant improvements in performances on different datasets as well as on a different network architecture, hence demonstrating the robustness of a combined metric.
6.Edge-Aware Deep Image Deblurring ⬇️
Image deblurring is a fundamental and challenging low-level vision problem. Previous vision research indicates that edge structure in natural scenes is one of the most important factors to estimate the abilities of human visual perception. In this paper, we resort to human visual demands of sharp edges and propose a two-phase edge-aware deep network to improve deep image deblurring. An edge detection convolutional subnet is designed in the first phase and a residual fully convolutional deblur subnet is then used for generating deblur results. The introduction of the edge-aware network enables our model with the specific capacity of enhancing images with sharp edges. We successfully apply our framework on standard benchmarks and promising results are achieved by our proposed deblur model.
7.FPCNet: Fast Pavement Crack Detection Network Based on Encoder-Decoder Architecture ⬇️
Timely, accurate and automatic detection of pavement cracks is necessary for making cost-effective decisions concerning road maintenance. Conventional crack detection algorithms focus on the design of single or multiple crack features and classifiers. However, complicated topological structures, varying degrees of damage and oil stains make the design of crack features difficult. In addition, the contextual information around a crack is not investigated extensively in the design process. Accordingly, these design features have limited discriminative adaptability and cannot fuse effectively with the classifiers. To solve these problems, this paper proposes a deep learning network for pavement crack detection. Using the Encoder-Decoder structure, crack characteristics with multiple contexts are automatically learned, and end-to-end crack detection is achieved. Specifically, we first propose the Multi-Dilation (MD) module, which can synthesize the crack features of multiple context sizes via dilated convolution with multiple rates. The crack MD features obtained in this module can describe cracks of different widths and topologies. Next, we propose the SE-Upsampling (SEU) module, which uses the Squeeze-and-Excitation learning operation to optimize the MD features. Finally, the above two modules are integrated to develop the fast crack detection network, namely, FPCNet. This network continuously optimizes the MD features step-by-step to realize fast pixel-level crack detection. Experiments are conducted on challenging public CFD datasets and G45 crack datasets involving various crack types under different shooting conditions. The distinct performance and speed improvements over all the datasets demonstrate that the proposed method outperforms other state-of-the-art crack detection methods.
8.Searching for Apparel Products from Images in the Wild ⬇️
In this age of social media, people often look at what others are wearing. In particular, Instagram and Twitter influencers often provide images of themselves wearing different outfits and their followers are often inspired to buy similar clothes.We propose a system to automatically find the closest visually similar clothes in the online Catalog (street-to-shop searching). The problem is challenging since the original images are taken under different pose and lighting conditions. The system initially localizes high-level descriptive regions (top, bottom, wristwear. . . ) using multiple CNN detectors such as YOLO and SSD that are trained specifically for apparel domain. It then classifies these regions into more specific regions such as t-shirts, tunic or dresses. Finally, a feature embedding learned using a multi-task function is recovered for every item and then compared with corresponding items in the online Catalog database and ranked according to distance. We validate our approach component-wise using benchmark datasets and end-to-end using human evaluation.
9.RFBTD: RFB Text Detector ⬇️
Text detection plays a critical role in the whole procedure of textual information extraction and understanding. On a high note, recent years have seen a surge in the high recall text detectors in scene text images, however text boxes for individual words is still a challenging when dense text is present in the scene. In this work, we propose an elegant solution that promotes prediction of words or text lines of arbitrary orientations and directions, providing emphasis on individual words. We also investigate the effects of Receptive Field Blocks(RFB) and its impact in receptive fields for text segments. Experiments were done on the ICDAR2015 and achieves an F-score of 47.09 at 720p
10.Video Crowd Counting via Dynamic Temporal Modeling ⬇️
Crowd counting aims to count the number of instantaneous people in a crowded space, which plays an increasingly important role in the field of public safety. More and more researchers have already proposed many promising solutions to the crowd counting task on the image. With the continuous extension of the application of crowd counting, how to apply the technique to video content has become an urgent problem. At present, although researchers have collected and labeled some video clips, less attention has been drawn to the spatiotemporal characteristics of videos. In order to solve this problem, this paper proposes a novel framework based on dynamic temporal modeling of the relationship between video frames. We model the relationship between adjacent features by constructing a set of dilated residual blocks for crowd counting task, with each phase having an expanded set of time convolutions to generate an initial prediction which is then improved by the next prediction. We extract features from the density map as we find the adjacent density maps share more similar information than original video frames. We also propose a smaller basic network structure to balance the computational cost with a good feature representation. We conduct experiments using the proposed framework on five crowd counting datasets and demonstrate its superiority in terms of effectiveness and efficiency over previous approaches.
11.Seeing Under the Cover: A Physics Guided Learning Approach for In-Bed Pose Estimation ⬇️
Human in-bed pose estimation has huge practical values in medical and healthcare applications yet still mainly relies on expensive pressure mapping (PM) solutions. In this paper, we introduce our novel physics inspired vision-based approach that addresses the challenging issues associated with the in-bed pose estimation problem including monitoring a fully covered person in complete darkness. We reformulated this problem using our proposed Under the Cover Imaging via Thermal Diffusion (UCITD) method to capture the high resolution pose information of the body even when it is fully covered by using a long wavelength IR technique. We proposed a physical hyperparameter concept through which we achieved high quality groundtruth pose labels in different modalities. A fully annotated in-bed pose dataset called Simultaneously-collected multimodal Lying Pose (SLP) is also formed/released with the same order of magnitude as most existing large-scale human pose datasets to support complex models' training and evaluation. A network trained from scratch on it and tested on two diverse settings, one in a living room and the other in a hospital room showed pose estimation performance of 99.5% and 95.7% in PCK0.2 standard, respectively. Moreover, in a multi-factor comparison with a state-of-the art in-bed pose monitoring solution based on PM, our solution showed significant superiority in all practical aspects by being 60 times cheaper, 300 times smaller, while having higher pose recognition granularity and accuracy.
12.Slim-CNN: A Light-Weight CNN for Face Attribute Prediction ⬇️
We introduce a computationally-efficient CNN micro-architecture Slim Module to design a lightweight deep neural network Slim-Net for face attribute prediction. Slim Modules are constructed by assembling depthwise separable convolutions with pointwise convolution to produce a computationally efficient module. The problem of facial attribute prediction is challenging because of the large variations in pose, background, illumination, and dataset imbalance. We stack these Slim Modules to devise a compact CNN which still maintains very high accuracy. Additionally, the neural network has a very low memory footprint which makes it suitable for mobile and embedded applications. Experiments on the CelebA dataset show that Slim-Net achieves an accuracy of 91.24% with at least 25 times fewer parameters than comparably performing methods, which reduces the memory storage requirement of Slim-net by at least 87%.
13.Analyzing the Cross-Sensor Portability of Neural Network Architectures for LiDAR-based Semantic Labeling ⬇️
State-of-the-art approaches for the semantic labeling of LiDAR point clouds heavily rely on the use of deep Convolutional Neural Networks (CNNs). However, transferring network architectures across different LiDAR sensor types represents a significant challenge, especially due to sensor specific design choices with regard to network architecture as well as data representation. In this paper we propose a new CNN architecture for the point-wise semantic labeling of LiDAR data which achieves state-of-the-art results while increasing portability across sensor types. This represents a significant advantage given the fast-paced development of LiDAR hardware technology. We perform a thorough quantitative cross-sensor analysis of semantic labeling performance in comparison to a state-of-the-art reference method. Our evaluation shows that the proposed architecture is indeed highly portable, yielding an improvement of 10 percentage points in the Intersection-over-Union (IoU) score when compared to the reference approach. Further, the results indicate that the proposed network architecture can provide an efficient way for the automated generation of large-scale training data for novel LiDAR sensor types without the need for extensive manual annotation or multi-modal label transfer.
14.The Indirect Convolution Algorithm ⬇️
Deep learning frameworks commonly implement convolution operators with GEMM-based algorithms. In these algorithms, convolution is implemented on top of matrix-matrix multiplication (GEMM) functions, provided by highly optimized BLAS libraries. Convolutions with 1x1 kernels can be directly represented as a GEMM call, but convolutions with larger kernels require a special memory layout transformation - im2col or im2row - to fit into GEMM interface.
The Indirect Convolution algorithm provides the efficiency of the GEMM primitive without the overhead of im2col transformation. In contrast to GEMM-based algorithms, the Indirect Convolution does not reshuffle the data to fit into the GEMM primitive but introduces an indirection buffer - a buffer of pointers to the start of each row of image pixels. This broadens the application of our modified GEMM function to convolutions with arbitrary kernel size, padding, stride, and dilation.
The Indirect Convolution algorithm reduces memory overhead proportionally to the number of input channels and outperforms the GEMM-based algorithm by up to 62% on convolution parameters which involve im2col transformations in GEMM-based algorithms. This, however, comes at cost of minor performance reduction on 1x1 stride-1 convolutions.
15.LumièreNet: Lecture Video Synthesis from Audio ⬇️
We present LumièreNet, a simple, modular, and completely deep-learning based architecture that synthesizes, high quality, full-pose headshot lecture videos from instructor's new audio narration of any length. Unlike prior works, LumièreNet is entirely composed of trainable neural network modules to learn mapping functions from the audio to video through (intermediate) estimated pose-based compact and abstract latent codes. Our video demos are available at [22] and [23].
16.Non-structured DNN Weight Pruning Considered Harmful ⬇️
Large deep neural network (DNN) models pose the key challenge to energy efficiency due to the significantly higher energy consumption of off-chip DRAM accesses than arithmetic or SRAM operations. It motivates the intensive research on model compression with two main approaches. Weight pruning leverages the redundancy in the number of weights and can be performed in a non-structured, which has higher flexibility and pruning rate but incurs index accesses due to irregular weights, or structured manner, which preserves the full matrix structure with lower pruning rate. Weight quantization leverages the redundancy in the number of bits in weights. Compared to pruning, quantization is much more hardware-friendly, and has become a "must-do" step for FPGA and ASIC implementations. This paper provides a definitive answer to the question for the first time. First, we build ADMM-NN-S by extending and enhancing ADMM-NN, a recently proposed joint weight pruning and quantization framework. Second, we develop a methodology for fair and fundamental comparison of non-structured and structured pruning in terms of both storage and computation efficiency. Our results show that ADMM-NN-S consistently outperforms the prior art: (i) it achieves 348x, 36x, and 8x overall weight pruning on LeNet-5, AlexNet, and ResNet-50, respectively, with (almost) zero accuracy loss; (ii) we demonstrate the first fully binarized (for all layers) DNNs can be lossless in accuracy in many cases. These results provide a strong baseline and credibility of our study. Based on the proposed comparison framework, with the same accuracy and quantization, the results show that non-structrued pruning is not competitive in terms of both storage and computation efficiency. Thus, we conclude that non-structured pruning is considered harmful. We urge the community not to continue the DNN inference acceleration for non-structured sparsity.
17.DeepMRSeg: A convolutional deep neural network for anatomy and abnormality segmentation on MR images ⬇️
Segmentation has been a major task in neuroimaging. A large number of automated methods have been developed for segmenting healthy and diseased brain tissues. In recent years, deep learning techniques have attracted a lot of attention as a result of their high accuracy in different segmentation problems. We present a new deep learning based segmentation method, DeepMRSeg, that can be applied in a generic way to a variety of segmentation tasks. The proposed architecture combines recent advances in the field of biomedical image segmentation and computer vision. We use a modified UNet architecture that takes advantage of multiple convolution filter sizes to achieve multi-scale feature extraction adaptive to the desired segmentation task. Importantly, our method operates on minimally processed raw MRI scan. We validated our method on a wide range of segmentation tasks, including white matter lesion segmentation, segmentation of deep brain structures and hippocampus segmentation. We provide code and pre-trained models to allow researchers apply our method on their own datasets.
18.A comprehensive evaluation of full-reference image quality assessment algorithms on KADID-10k ⬇️
Significant progress has been made in the past decade for full-reference image quality assessment (FR-IQA). However, new large scale image quality databases have been released for evaluating image quality assessment algorithms. In this study, our goal is to give a comprehensive evaluation of state-of-the-art FR-IQA metrics using the recently published KADID-10k database which is largest available one at the moment. Our evaluation results and the associated discussions is very helpful to obtain a clear understanding about the status of state-of-the-art FR-IQA metrics.
19.Neural Image Captioning ⬇️
In recent years, the biggest advances in major Computer Vision tasks, such as object recognition, handwritten-digit identification, facial recognition, and many others., have all come through the use of Convolutional Neural Networks (CNNs). Similarly, in the domain of Natural Language Processing, Recurrent Neural Networks (RNNs), and Long Short Term Memory networks (LSTMs) in particular, have been crucial to some of the biggest breakthroughs in performance for tasks such as machine translation, part-of-speech tagging, sentiment analysis, and many others. These individual advances have greatly benefited tasks even at the intersection of NLP and Computer Vision, and inspired by this success, we studied some existing neural image captioning models that have proven to work well. In this work, we study some existing captioning models that provide near state-of-the-art performances, and try to enhance one such model. We also present a simple image captioning model that makes use of a CNN, an LSTM, and the beam search1 algorithm, and study its performance based on various qualitative and quantitative metrics.