# [Paper] Backprop: Visualising Image Classification Models and Saliency Maps (Weakly Supervised Object Localization)

**Weakly Supervised Object Localization (WSOL) Using **AlexNet

In this story, **Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps (Backprop)**, by Visual Geometry Group, University of Oxford, is shortly presented. You may already know, this is a paper from the famous VGG research group. It is called Backprop since the latter papers call it Backprop when mentioning it.

**Weakly supervised object localization (WSOL)** is to find the bounding box of the main object within the image, with only the image-level label, but without the bounding box label.

In this paper:

**Two visualizing methods**are proposed: One is**gradient-based**method and one is**saliency-based**method.- For saliency-based method,
**GraphCut**is utilized for**weakly supervised object localization (WSOL)**.

This is a paper in **2014 ICLR Workshop **with over **2200 citations**. (Sik-Ho Tsang @ Medium)

# Outline

**Gradient-Based Class Model Visualisation****Image-Specific Class Saliency Visualisation****Weakly Supervised Object Localization (WSOL)**

# 1. Gradient-Based **Class Model **Visualisation

**AlexNet****-like CNN is used**: conv64-conv256-conv256-conv256-conv256-full4096-full4096-full1000, where conv*N*denotes a convolutional layer with*N*filters, full*M*— a fully-connected layer with*M*outputs.- Let
be the*Sc*(*I*)**score of the class**, computed by the classification layer of the ConvNet for an image*c**I*. We would like to**find an L2-regularised image, such that the score**:*Sc*is high

- where
*λ*is the regularization parameter. A locally-optimal*I*can be found by the back-propagation method. **The (unnormalised) class scores**, rather than the class posteriors, returned by the soft-max layer.*Sc*before softmax is used- The optimization is performed with respect to the input image, using zero image as intialization, and then the training set mean image is added to the result.

**2. Image-Specific Class Saliency Visualisation**

- Consider the linear score model for the class
*c*:

- It is easy to see that
**the magnitude of elements of***w*defines the importance of the corresponding pixels of*I*for the class*c*. - In the case of deep ConvNets, the class score
*Sc*(*I*) is a highly non-linear function of*I*. However, given an image*I*0, we can approximate*Sc*(*I*) with a linear function in the neighbourhood of*I*0 by computing the first-order Taylor expansion:

- where
*w*is the derivative of*Sc*with respect to the image*I*at the point (image)*I*0:

- Another interpretation is that
**the magnitude of the derivative indicates which pixels need to be changed the least to affect the class score the most.** - One can expect that
**such pixels correspond to the object location in the image**. - The saliency map
*Mij*= |*w*_*h*(*i*,*j*)| where*h*(*i*,*j*) is the index of the element of*w*,, corresponding to the image pixel in the*i*-th row and*j*-th column. - It is important to note that
**the saliency maps are extracted using a classification ConvNet trained on the image labels, so no additional annotation is required**(such as object bounding boxes or segmentation masks). - The
**computation**of the image-specific saliency map for a single class is**extremely quick**, since it**only requires a single back-propagation pass**. - The above figures are some examples. The class predictions are computed on 10 cropped and reflected sub-images, we computed 10 saliency maps on the 10 sub-images, and then averaged them.

**3. Weakly Supervised Object Localization (WSOL)**

## 3.1. Segmentation Using GraphCut

- Given an image and the corresponding class saliency map, we compute the object segmentation mask using the
**GraphCut colour segmentation.**

Conceptually, with seed provided, GraphCut is to segment the image based on color. And in this paper, the seed is provided by the saliency map.

- (GraphCut is another big research topic. If interested, please read the paper about GraphCut: “Interactive Graph Cuts for Optimal Boundary & Region Segmentation of Objects in N-D Images” in 2001 ICCV, which has over 5000 citations.)
- Foreground and background colour models were set to be the Gaussian Mixture Models. The foreground model was estimated from the pixels with the saliency higher than a threshold, set to the 95% quantile of the saliency distribution in the image; the background model was estimated from the pixels with the saliency smaller than the 30% quantile.

Once the image pixel labelling into foreground and background is computed, the object segmentation mask is set to the largest connected component of the foreground pixels.

## 3.2. ILSVRC-2013 Localisation Challenge

- The above object localisation method is entered into the ILSVRC-2013 localisation challenge.
- Considering that the challenge requires the object bounding boxes to be reported, the bounding boxes are computed by the object segmentation masks.
- The procedure was repeated for each of the top-5 predicted classes.
- The method achieved
**46.4% top-5 error on the test set of ILSVRC-2013**. - It should be noted that the method is
**weakly supervised**(unlike the challenge winner with 29.9% error), and**the object localisation task was not taken into account during training**.

## Reference

[2014 ICLR Workshop] [Backprop]

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

## Weakly Supervised Object Localization (WSOL)

[Backprop]