Project

Vector Quantization with Self-Attention for Quality-Independent Representation Learning

  • Zhou Yang, Weisheng Dong, Xin Li, Mengluan Huang, Yulin Sun and Guangming Shi
  • Figure 1. The overall architecture of our proposed method. The mini-batch of input contains both clean and corrupted images. Features extracted from the backbone network are quantized by the codebook module. Then z and \hat(z) are concatenated and pooled. Subsequently, after being enhanced by the SA module, the features are input into the head network to get the final output results.

    Abstract

     Recently, the robustness of deep neural networks has drawn extensive attention due to the potential distribution shift between training and testing data (e.g., deep models trained on high-quality images are sensitive to corruption during testing). Many researchers attempt to make the model learn invariant representations from multiple corrupted data through data augmentation or image-pair-based feature distillation to improve the robustness. Inspired by sparse representation in image restoration, we opt to address this issue by learning image-quality-independent feature representation in a simple plug-and-play manner, that is, to introduce discrete vector quantization (VQ) to remove redundancy in recognition models. Specifically, we first add a codebook module to the network to quantize deep features. Then we concatenate them and design a self-attention module to enhance the representation. During training, we enforce the quantization of features from clean and corrupted images in the same discrete embedding space so that an invariant {\em quality-independent} feature representation can be learned to improve the recognition robustness of low-quality images. Qualitative and quantitative experimental results show that our method achieved this goal effectively, leading to a new state-of-the-art result of 43.1 $\%$ mCE on ImageNet-C with ResNet50 as the backbone.

    Paper & Code & Demo

    Experimental Results

      Figure 1. The detailed top-1 accuracy results of the different methods for each corruption type in benchmark dataset ImageNet-C.

    Result Visualization

      Figure 2. Class-wise feature distribution. Symbols with similar colors have the same labels. Marks with large color differences represent different labels of image features. Dot marks denote clean features, while triangle marks indicate degraded ones.

      Figure 3. The Grad-CAM maps of different models on defocus blur images. (a) The original clean images. (b) The maps of vanilla ResNet50 model on clean images. (c) and (d) show the maps of QualNet50 and our proposed method on defocus blur images with severity level 3. The results show that our method still can focus on the salient object area without being seriously affected by corruption.

    Citation

    @inproceedings{yang2023vector,
     title={Vector Quantization with Self-Attention for Quality-Independent Representation Learning},
     author={Yang, Zhou and Dong, Weisheng and Li, Xin and Huang, Menguan and Sun, Yulin and Shi, Guangming},
     booktitle={ Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
     year={2023},
    }

    Concat

    Zhou Yang, Email: yang_zhou@stu.xidian.edu.cn
    Mengluan Huang, Email: mlhuang@stu.xidian.edu.cn
    Weisheng Dong, Email: wsdong@mail.xidian.edu.cn
    Xin Li, Email: xin.li@mail.wvu.edu
    Yulin Sun, Email: daitusun@gmail.com
    Guangming Shi, Email: gmshi@xidian.edu.cn