Sanity Checks on XAI with Captum, INNvestigate, and TorchRay
AI is lying to you? (Part 2)
Assessing the scope and quality of results provided by eXplainable AI
Sanity Checks
In the previous article, we’ve seen a brief introduction of the why’s and the how’s of Explainable Artificial Intelligence. Here we’re are going to see the results of some Sanity Checks experiments.
The experiments conducted here aim to answer the following question: who assures us that the explanation provided by the method actually tells us reliably about what the network has learned to take that decision?
Specifically, we want to assess the sensitivity of explanation methods to model parameters: if one method really highlights the most important regions of the input, randomly reinitializing the parameters of the last layer of the network, then changing the output, the given explanation should change.
Surprisingly, some of the methods proposed in the literature are model-independent and therefore fail the randomization test. By providing the same explanation even after the model parameter randomization, such methods are inadequate to explain the network prediction faithfully.
The reliability of explanation methods is crucial in tasks where visual inspection of results is not easy or incorrect attribution costs are high. The analysis conducted here aims to provide useful insights into developing better and more reliable visualization methods for deep neural networks to gain even the most skeptical users' trust.
Tools
Given the attention deep learning applications are currently receiving, several software tools have been developed to facilitate model interpretation.
[Alber et al., 2019] presented the INNvestigate library, an interpretability tool based on Keras, which provides a common interface and out-of-the-box implementation for many analysis methods. All the works published so far concerning the analysis of image classifications use the interface of INNvestigate.
However, a prominent, very recent library is Captum, an extension of the PyTorch deep learning package, announced at PyTorch Developer Conference in 2019 by Facebook and released this year (2020).
Captum provides support for most of the feature attribution techniques described in this work. To date, it is the only library that contains implementations of methods approximating the Shapley values, like DeepLiftShap and GradientShap.
For this reason, in our experiments, we chose to use the Captum library interface for all explanation methods, except for all variants of LRP-αβ and RAP methods for which we used the code released by [Nam et al., 2020].
To confirm our results, we also tested the algorithms implemented by another PyTorch interpretability library, TorchRay, released recently by [Fong et al., 2019] to provide researchers and developers with an easy way to understand which features are contributing to a model’s output. The results obtained with both libraries are consistent with the literature.
First Experiment: Randomization
This first sanity check assesses the sensitivity of explanation methods to model parameters by answering the following question:
if the parameters of the model are randomized and therefore the network output changes, do the saliency maps change too?
If the method really highlights the most important input region, randomly reinitializing the last fully connected layer’s parameters, the explanation should change.
In our experiment, starting from the upper layer (last completely connected), then up to the lower layer (first convolutional), we destroyed all the learned parameters. For each step, we produced a prediction with respect to the network's weights at that moment.
In the image below, we can see some visual results of this method provided as an example. In the first column, we have the prediction calculated on the fully trained network. In the other column, the prediction was obtained after the randomization of the indicated layer.
We can see that the prediction produced by some algorithms like Saliency Map or Input*Gradient becomes less clear when we destroy the network; in other words, those algorithms pass our test. The DeconvNet reaches this result only when we destroy the 4 convolutional layers. Lastly, the Guided Backprop clearly does not pass the test.
To have a more concrete numerical result than visual perception, we computed the Structural Similarity Index (SSIM). This metric allowed us to assess the similarity between the two explanations.
In particular, we scaled all saliency maps to be in [0, 1], and we compared the explanation obtained with the trained network to the one obtained from a network with random weights, one layer at a time, for all layers.
We set a seed to allow replicable analysis. However, since some methods are sensible to randomized weights, we decided to consider 5 different seeds and then take the mean.
In the following plot, the line represents the mean SSIM deriving from cascading randomization, while the shaded area shows the interval of standard deviation.
The method fails if already with the last randomized layer fc3, the SSIM remains equal or close to 1.
Second Experiment: Class-intensity
The performance of an explanation method on class discriminatively task has also been used for assessing the explanation methods.
In fact, to be a good visualization method and produce a clean and visually human-interpretable result, it is essential to produce discriminative visualizations for the class of interest. Specifically, the method should only highlight the object belonging to the class of interest in an image when there are objects labeled with several different classes.
Just think of how important it is in radiology's clinical practice that the method highlights exactly the region where a tumor is present.
In many clinical scenarios, a clinical diagnosis is not a trivial task and is prone to interpretation errors.
Let us consider the clinical case presenting both a malignant and a benign tumor in the liver. The trained network predicts the presence of cancer with 60% and 30%.
Obviously, the doctor, before communicating the diagnosis to his patient, wants to make sure that the network has correctly identified the affected area, and above all, has learned to distinguish a malignant tumor from a benign one. For this purpose, the doctor applies one of the explanation methods.
The method is required to highlight only the region strictly affected by the malignant tumor, completely obscuring the benign tumor's presence. If the method also highlights the area related to the benign tumor, the prediction's interpretation could be more complicated and increase the risk of a misdiagnosis.
To measure whether our methods help distinguish between classes, we selected an image from the Tiny Imagenet validation set, which contains exactly two annotated categories. We created visualizations for each one of them. The image depicts a fruit basket in which banana and orange are visible. For the sake of clarity, we again treat the four groups separately.
We can see that only the Grad-cam (and partially the Guided Grad-Cam) algorithm respects the class-sensitive property. Its explanation for the banana class highlights exactly the region where the banana but not orange is present regions in the first figure and vice versa in the second one.
The following are the computed SSIM between the banana and the orange explanation:
Conclusions and Future Research Directions
Despite the explanation, methods are hard to evaluate empirically because it is difficult to distinguish errors of the model from errors of the attribution method; determining how explanation methods fail is an important stepping stone to understanding where and how we should use these methods.
A reliable explanation of a prediction reveals why a classifier makes a certain prediction, and it helps users accept or reject the prediction with greater confidence.
Our analysis revealed that most of the literature's attribution methods have theoretical properties contrary to this goal. Invariance under model randomization in explanation methods gave us a concrete way to rule out the method's adequacy for certain tasks and unmask its weaknesses.
However, explanations that do not depend on model parameters or training data might still depend on the model architecture and provide useful information about the prior incorporated model architecture.
We see several promising directions for future works. A possible direction is to understand how to increase negative relevance scores in methods that backpropagate only positive ones. In fact, they seem to be crucial to avoid the convergence to a rank-1 matrix, responsible for the model insensitivity, and increase the class-discriminativeness.
Furthermore, we can expand this analysis to other explanation algorithms, such as model-agnostic, perturbation-based methods, and perform other tests, like label randomization, useful for investigating XAI's trustworthiness.
We hope that our work can guide researchers in assessing the scope of model explanation methods and designing new model explanations along these lines.