Kuldeep Gurjar , Surjeet Kumar , Arnav Bhavsar , Kotiba Hamad , Yang-Sae Moon and Dae Ho YoonAn Explainable Deep Learning-Based Classification Method for Facial Image Quality AssessmentAbstract: Considering factors such as illumination, camera quality variations, and background-specific variations, identi-fying a face using a smartphone-based facial image capture application is challenging. Face Image Quality Assessment refers to the process of taking a face image as input and producing some form of "quality" estimate as an output. Typically, quality assessment techniques use deep learning methods to categorize images. The models used in deep learning are shown as black boxes. This raises the question of the trustworthiness of the models. Several explainability techniques have gained importance in building this trust. Explainability techni-ques provide visual evidence of the active regions within an image on which the deep learning model makes a prediction. Here, we developed a technique for reliable prediction of facial images before medical analysis and security operations. A combination of gradient-weighted class activation mapping and local interpretable model- agnostic explanations were used to explain the model. This approach has been implemented in the preselection of facial images for skin feature extraction, which is important in critical medical science applications. We demonstrate that the use of combined explanations provides better visual explanations for the model, where both the saliency map and perturbation-based explainability techniques verify predictions. Keywords: Explainable Deep Learning , Face Image Quality Assessment , Image Classification , MobileNet , Transfer Learning 1. IntroductionImage classification, one of the most fundamental areas of computer vision, is the process of categorizing image data into one of many specified classes. This is the basis for other computer vision tasks such as localization, detection, and segmentation. Categorization of face images is a typical application for image classification, and the identification of applicable or non-applicable image capture conditions is critical in facial recognition applications, such as security checks and skin analysis. In addition, in medical image capture applications such as ultraviolet (UV) light-based cancer detection, the user must have a clear image of the face without anything covering the face and keep their eyes closed to avoid the harm of UV light. Therefore, to implement standard image capture guidelines, images with eyeglasses, covered faces, unknown objects, or faces that are not clear (blurred) need to be rejected at the time of capture. The application areas for facial image capture, such as security checks and medical skin analysis, are growing significantly. Moreover, maintaining quality is a challenge in any area where high-quality facial images are required. With the improved quality of smartphone cameras, facial image-based applications are moving towards mobile-phone-based image capture. However, the most common issues with the quality of facial images are blurry images, eyeglasses, a covered face, and unknown objects. The existing deep convolutional neural network (CNN)-based facial image capture methods used in the abovementioned application areas lack a thorough understanding of the models, which makes predictions about facial features doubtful. To use these models for critical application areas such as healthcare and security checks, we need to have more trust in the models. In this study, we developed a model for selecting appropriate facial images for medical analysis and security operations. The model helps capture an acceptable image for analysis to generate a high-accuracy output. Given the challenges in identifying faces using smartphone-based facial image capture applications, this research has important implications for applications such as security checks and medical skin analysis, where high-quality facial images are essential for accurate analysis. Conventionally, the use of deep learning methods for quality assessment has been limited by the black-box nature of these models, which raises concerns regarding the reliability of their predictions. To address this, we developed a new approach that combines gradient-weighted class activation mapping (GradCAM) and local interpretable model-agnostic explanations (LIME) to provide visual evidence of the active regions within the image on which the deep learning model makes a prediction. A combination of these techniques eliminates the drawbacks associated with both techniques and provides improved visualizations for interpretability. This approach was implemented in a preselection system for facial images that selects clear and applicable images for further processing. To build a trustworthy model, we applied two explainability techniques, GradCAM and LIME, and a new technique that uses a combination of these techniques to provide a better explanation for the model. To demonstrate the applicability of our work, we presented a workflow for the preselection of images in medical skin analysis, where UV light-based images are used to detect melanoma in the skin. In addition, dermatologists use UV images for skin analysis from a cosmetic point-of-view. A crucial aspect of UV light is that it is harmful to the eyes. Therefore, a preselection system must verify that a clear facial image is present without eyeglasses or anything else covering the face. Fig. 1 shows the workflow of the proposed system. Here, a transfer learned model is used to initially classify the image into five categories (blur, eyeglasses, covered face, unknown object, and applicable good image). If the image contains a clear face, further processing is permitted. After this prediction, an applicable image goes through another check, in which closed and open eyes are detected. We implemented traditional image-processing techniques to verify whether the user's eyes were closed or open. When it is confirmed that a clear face with closed eyes is detected, UV light imaging can be performed for skin analysis. 1.1 Application and Research-based Contributions· Facial image quality assessment and poor-quality image removal. · Make face images uniform in terms of quality and format. · A novel approach for deep learning model explainability with a combination of GradCAM and LIME. 1.2 Related WorkCNNs have shown great potential in image classification since AlexNet [1] won the ImageNet Challenge [2]. A general trend has been to introduce deeper networks [3–6]. Over time, these models started to take up more storage space and became computationally costly. Owing to the increasing interest in building small and cost-effective neural networks [7,8], a new class of models called MobileNet was designed [9]. These models are ideal for mobile and embedded applications, with a relatively high accuracy for light and computationally efficient models. The application domain for image classification is wide, and contributes to various stages of human life. From image-based healthcare applications to remote sensing and the automotive industry, there have been several research projects. We list some of the related works from the perspective of the application domain point-of-view. In healthcare, much work has been conducted with the help of image classification [10–13]. We focused on skin analysis, and many similar studies have been conducted in the past few years [14–20]. Anggo and Arapu [21], proposed a face recognition method using the principal component analysis (PCA) method and Fisher's linear discriminant (FDL) method. Winarno et al. [22] proposed a face recognition-based attendance system using a hybrid feature extraction method. Min et al. [23], proposed a face recognition method using the principal component analysis (PCA) method. Priadana and Habibi [24] proposed face detection using the Haar cascades method. Cao et al. [25] proposed a beauty prediction method using a residual-in-residual (RIR) structure in the neural network. As previously mentioned, deep learning has made a significant contribution to many industries. Now that almost all technological advancements are dependent on artificial intelligence (AI) in general and deep learning in particular, explainability and understanding of their decisions and internal processes are becoming important. Recently, explainable AI has become a hot area of research in diverse fields of applied AI [26–30]. Two of the most famous among several explainability techniques are LIME, which explains a model’s predictions using another local interpretable model [31], and GradCAM [32], in which a coarse localization map is produced from the last convolutional layer in a CNN model, highlighting the important regions in the image for predicting the concept. The use of LIME and GradCAM is becoming increasingly popular for explaining models in various fields [33–39]. Recently, Schlett et al. [40] provided a review of facial image quality assessment. 1.3 DatasetWe collected images belonging to five categories, namely "covered face, eyeglasses, blur, unknown objects, and good face images." These images were stored in the appropriate format for image classification, that is, training, validation, and test set format for the MobileNet architecture. 80% of the data were used for training, and 20% for validation and testing. 1.4 Model for Transfer LearningMobileNet was selected as the base model for the transfer learning. MobileNet was introduced by Google researchers to improve the performance of mobile models on state-of-the-art computer-vision tasks and benchmarks. MobileNet was designed for mobile and embedded vision applications to reduce the intensive computations involved in earlier versions of deep neural networks. These models have a streamlined architecture that uses depth-wise separable convolutions for lightweight deep neural networks. MobileNets have shown their effectiveness in many applications. Depthwise convolutions are important building blocks for an efficient CNN architecture. In this technique, the convolutional operator is replaced by a factorized version that separates the convolution into two layers. The first layer comprises a depthwise convolution that performs a lightweight filter for each input channel. The second layer performs a 1 × 1 convolution, called pointwise convolution. This layer is responsible for building novel features by calculating the linear combination of input channels. Depth-wise separable convolutions have a reduced computational cost compared to conventional convolution operations. A standard convolution on a [TeX:] $$h_i \times w_i \times d_i$$ input tensor [TeX:] $$L_i$$ uses a convolutional Kernel [TeX:] $$k \in R^{k \times k \times d_i \times d_j}$$ to give [TeX:] $$h_i \times w_i \times d_j$$ output tensor [TeX:] $$L_i$$ has a computational cost as given in Eq. (1):
Depth-wise separable convolutions with similar performance have a cost, as given in Eq. (2):
For MobileNet with k = 3, the cost is approximately eight to nine times more efficient than regular convolutions. We used the original MobileNet Architecture by Howard et al. [9]. In MobileNet, all the layers are followed by Batchnorm and ReLU nonlinearity, except for the last fully connected layer. Fig. 2 shows the contents of the convolutional layer in the MobileNet. This model is suitable for mobile applications because it helps obtain very memory-efficient inferences. 1.5 InterpretabilityWhen using AI models for healthcare and medical applications, we need to evaluate the model deeply and not just rely on the accuracy of the model on the training datasets. AI algorithms and hardware have evolved significantly in the last decade, but they do not provide useful information regarding the dataset, and the model can learn certain biases that decrease its trust. To obtain trust in the model, explainability techniques were used. In this study, we combined two popular techniques for image classification model explainability. We used a combination of heatmaps generated from GradCAM and applied LIME to determine how the heatmaps and superpixels correlate with each other. In this study, we used the LIME technique to verify the model. A method combining GradCAM and LIME for interpreting the results. Lime uses the following steps at first to obtain an interpretable model for model f for example E. 1. Construct superpixels d in E, which are small homogenous patches in the image. 2. Generate n new images [TeX:] $$x_1, x_2, \ldots x_n$$ by turning on and off the superpixels. 3. Use these images to generate new predictions y from model [TeX:] $$f . y=f\left(x_i\right) \text {. }$$ 4. Build a local weighted model B fitting the [TeX:] $$y_i$$s to the presence or absence of superpixels. According to LIME, every coefficient of the local weighted surrogate model corresponds to a superpixel of the original image E. A simplistic way to observe this is by visualizing only the superpixels with the highest positive coefficient for [TeX:] $$B_n$$, blocking the remaining superpixels.. Therefore, we combined GradCAM and LIME to provide more precise explanations for the model. In 2017, Selvaraju et al. [32] observed that convolutional layers can capture spatial information from the input data lost in the fully connected layers. Thus, the last convolutional layer contains both high-level semantics and detailed spatial information. GradCAM uses the gradient of features flowing into the last convolutional layer to produce a localization map that highlights the important regions in the image for prediction. This technique provides visual proof of the functioning of the trained neural network, which allows us to investigate the model, which is traditionally considered a black box. GradCAM uses the gradients of any target image in a classification network passing through the last convolutional layer to create a localization map that shows the image regions that are significant for the prediction. This technique is applied to almost all CNN model families. GradCam can be used to explain the predictions from a model using heat maps. To obtain a discriminative localization map Grad CAM [TeX:] $$L_{G r a d-C A M}^C$$ of width u and height v for the target class, we follow these steps. 1. Compute the gradients for the class c, [TeX:] $$y^c$$ in the last layer before the softmax layer, that is [TeX:] $$\frac{\partial y^c}{\partial A^k}.$$ 2. We used these backflowing gradients and applied global average pooling over the width i and height j dimensions to obtain neuron importance weights [TeX:] $$a_k^c.$$
3. Find the weighted combination of forward activation maps and apply ReLU to obtain:
This result is a course heatmap of the same size as that of the convolutional feature map. ReLU was applied to the linear combination of maps because of our interest in the features that have a positive role in the prediction of a particular class. 2. MethodologyThe dataset contained five different categories of images, which were used for transfer learning on the pre-trained MobileNet architecture. The prediction for each class from the model was first explained using LIME. LIME shows the most significant superpixels responsible for prediction. Once a prediction gives us the preselection of images, we use the combined explainability technique using GradCAM and LIME to explain the model’s performance in choosing the right image. Once a proper face image is classified, a traditional Haar cascade classifier is used to check the image with eyes closed or open. 2.1 TrainingA MobileNet model with pre-trained ImageNet weights was loaded using the TensorFlow deep- learning framework. The model was slightly modified to provide predictions for the five classes. The last layer in the model was replaced by a softmax activation layer with five output neurons. The remaining layers of the model were not trained to achieve faster convergence and significant feature reuse. 2.2 Explainability with LIMETo validate the image classification model, we applied LIME. Fig. 3 illustrates this process. A mask was generated that marked the boundaries of the superpixels in the image. Perturbed image samples are generated by turning different regions, called superpixels ON and OFF, within the image. Predicting all the samples and selecting the samples predicted to belong to the class, we must interpret the image. Compute the distance metric to evaluate the difference between the perturbed samples and the original image. The distance metric here is the cosine distance because the images and samples are multidimensional vectors. After calculating the cosine distance metric, the kernel function maps the distance to a weight between zero and one. Then, a weighted linear model was trained using the samples, weights, and prediction vectors. This gives us a coefficient that shows the effect of each superpixel on the prediction of the target class. From these coefficients, we select the superpixels that contribute most to the correct prediction of the target class. 2.3 GradCAM and LIMEGradCAM generates a heatmap highlighting the important regions in the image for predicting the target class. Whereas LIME first creates superpixels within the target image and then selects the most important superpixels to give a visual interpretation for the prediction. So far, these techniques have mostly been used separately. A combination of these techniques makes solid visual evidence that is verified by the two most famous explainability techniques in computer vision. Here, we generate the heatmaps using GradCAM and the gradients for the classes in the last convolutional layer of the model. Based on these gradients, the neuron importance weights are calculated. Then the weighted combination of the forward activation maps is obtained, and ReLU activation is applied. A heatmap is generated and overlapped over the original image to obtain areas of higher activation. After this step, superpixels are generated in the image. The most important superpixels are selected using the method explained below. 1. Calculate the gradients [TeX:] $$\frac{\partial y^c}{\partial A^k}$$ for the target class c in the last convolutional layer during prediction. 2. Use these backflowing gradients and apply global average pooling over the width i and height j dimensions to obtain neuron importance weights [TeX:] $$a_k^c.$$
3. Find the weighted combination of forward activation maps, and apply ReLU to obtain.
4. Overlap this heatmap over the original image to get a new image E that shows the features that have a positive influence on the prediction of the target class. 5. Now generate superpixels d in the E. 6. Now generate 1,000 new images, [TeX:] $$x_1, x_2, x_3, \ldots x_{1000}$$ by turning ON and OFF the superpixels 7. Calculate the distance metric for each sample from the original image. Given that the samples are just the perturbations of the original image, and all these are multidimensional vectors, cosine distance is used as a distance metric. 8. Map these values to a weight between zero and one with the help of a kernel function. 9. Use these weights to generate a weighted linear model that fits the perturbations and output. 10. Based on this model, the coefficient for each superpixel is obtained, representing the effect of superpixels in the prediction. 11. Select the top superpixels and plot them for visualization. The combination of GradCAM and LIME produces a result that has a heat map and only important superpixels. Defining the most important areas in the image by combining these two techniques. 2.4 Haar CascadeTo show the applicability of this work to skin feature analysis, we used an open-source Haar cascade model in OpenCV [41]. To detect the presence of eyes in the image and then predict if the eyes are open or closed. We integrated this model with our system. The applicable face image selected by the Mobilenet model was given to the cascade model to predict open or closed eyes within the image. 2.5 Implementation StructureThe implementation structure of the approach includes using a transfer learning-based model for the initial classification of images into five categories: blur, eyeglasses, covered face, unknown object, and applicable good image. The realization process of the approach involves training the transfer learning- based model using a large dataset of facial images with varying illuminations, camera quality variations, and background-specific variations to ensure that it can accurately classify images into the five categories. The combination of GradCAM and LIME techniques is then used to explain the model's predictions and verify the selection of high-quality face images. For transfer learning, a MobileNet model with pre- trained ImageNet weights was loaded using TensorFlow, and the model was modified to predict five output classes. The model was trained on five classes until convergence. After this, a combination of Grad CAM and LIME was used for explainability. The system and hyperparameters used in this work include an RTX 3090 GPU and an Intel Core i9 CPU with 32 GB of RAM, providing the computational power necessary for training and analysis. The environment is set up using Anaconda Spyder with Python 3.7, and the TensorFlow 2.0 library is employed for deep learning tasks. The model uses transfer learning with 8 output layers configured as softmax, and is trained over 30 epochs with a learning rate of 0.0001. For model interpretability, Grad-CAM and LIME are applied with specific parameters: 1,000 sample images, a kernel size of 4, a maximum distance of 200, a kernel width of 0.25, and the top 5 features are identified for explanation purposes. 3. Results and DiscussionWe collected a custom dataset with five different classes: blur, glasses, good, mask, and wrong objects for image classification. In the first step, a MobileNet model with pre-trained weights for the ImageNet dataset was used for transfer learning. To verify the model and have trustworthy predictions, we used exploitability techniques. LIME was used to explain the model's behavior to each class in the dataset. To have better explanations, both LIME and GradCAM were applied to see the similarity in results. Using the predicted images from this model to show the application, we decide if the eyes are open or closed. The transfer learned MobileNet Model converged within 30 epochs. Fig. 4(a) shows the training accuracy and Fig 4(b) shows the loss for training and validation. As the confusion matrix shows, the model can successfully differentiate between the good face images that apply to facial feature analysis and security applications. To verify the performance of the used transfer learning approach, we have compared the results for our model with simple CNN architecture. Table 1 shows the results for the transfer learned MobileNet architecture, and a simple CNN architecture on the same dataset. As the dataset contains very similar images, a highly optimized model is required to accurately predict image classes. This comparison between the two models validates that MobileNet has significantly better results for classification in this dataset. Although the predictions are accurate for this dataset with the transfer learned MobileNet architecture, we need to confirm that the model uses the correct features for prediction. Fig. 5 shows the LIME results for all five categories of images. We can see from the masked images that the model looks at the right superpixels for the prediction of facial image categories and the model looks at random superpixels for the wrong objects. This explains the usability of this model for facial image classification for skin feature extraction. Table 1.
LIME is a useful technique for model interpretability, but the explanations produced by LIME have a limitation based on the parameters used for creating superpixels, and the number of superpixels. In some cases, LIME ignores pixels that have been segmented into smaller regions that have no clear facial features present in them. Also, background plays a vital role in creating segments within the image for LIME. Certain background conditions in images will lead to wrong regions in the interpretability analysis [42]. For these reasons, and to further explain the model and build trust in the predictions, we combined GradCAM and LIME for better visualization and explanation of the model's prediction behavior. This combination of two techniques has better visualization and reduces the chances for wrong interpretations of the model. To do so, the heatmap was first generated by calculating the gradients for the predicted class in the last convolution layer of the model. Then, by using these backpropagating gradients, neuron importance weights were obtained. ReLU activation was applied to obtain a course heatmap, and then the LIME technique was applied. Both techniques confirm the model's usefulness, and it has been noted that the model looks at the right feature within the image for classification. Fig. 6(a) shows the results of the explainability using a combination of GradCAM and LIME. Once the image has been classified as a good face image, in which the face is visible, and nothing is covering the face. We check if the eyes are closed or open using the cascade classifier available in OpenCV. Fig. 6(b) shows the confusion matrix for the classification of open or closed eyes by the cascade classifier. The use of preselection and classification by the eye provides only good-quality images to reach the final stage of cascade classification. This makes the system more accurate. To compare this work with the related work we looked at different performance factors that are useful for practical applications of a computer vision model. Here we found that our model performs very well on the given dataset also a general comparison of the three main computer techniques is given in Table 2 [9,31, 32,43–46]. Table 2.
4. ConclusionDeep learning is associated with critical decision-making in many fields, including automotive, telecommunications, security, and healthcare. The models used in deep learning were considered black boxes for a long time. Now, these models are being explored with emerging explainability techniques. In this work, we employed transfer learning on the MobileNet architecture for classifying facial images for security and healthcare purposes and verified its performance with explainability techniques. This model was trained using face images collected by a smartphone camera. The model can predict with an accuracy of 92%. The application area for these images involves critical decision-making in security and healthcare, which makes the explainability of the model crucial here. We used LIME and GradCAM to explain the model. Initially, LIME was used for a visual explanation of the model's prediction for each class. The results explain why our model looks at the correct features while classifying face images. For the three categories “good image, eyeglasses, and covered face,” this model predicts using superpixels mostly centered around the face region, and for the two categories "blur and wrong object," the model selects random superpixels. To have better explanations, we combined GradCAM and LIME; here we give a new strategy to apply a combination of both techniques to have better explanations and learn more about the model. LIME, for instance, chooses different superpixels while making interpretations, and this has been reported multiple times. So, a combination of GradCAM and LIME ensures more trust in the decisions we make using deep learning models. BiographyKuldeep Gurjarhttps://orcid.org/0000-0002-0800-6410He received a B.S. (2006) and M.S. degrees (2010) in Computer Science, from Department of Computer Science and Information Technology, University of Rajasthan, Jaipur. From the year 2010 to 2011, he worked for a website development company (Octal info. Solutions). From March 2012 to 2018 he was with the Department of Computer Science and Engineering from Kangwon National University as a Ph.D. candidate. He is an assistant professor at the University of Suwon and his current research areas are digital health and computer vision. BiographyArnav Bhavsarhttps://orcid.org/0000-0003-2849-4375He received his Ph.D. from IIT Madras in 2011. He then worked as a postdoc at GE Global Research and the University of North Carolina, Chapel Hill in 2011 and 2012, respectively. He joined IIT Mandi in 2013 as an assistant professor and is presently working there as an associate professor since 2019. BiographyBiographyYang-Sae Moonhttps://orcid.org/0000-0002-2396-0405He received a B.S. (1991), M.S. (1993), and Ph.D. (2001) degrees in Computer Science from the Korea Advanced Institute of Science and Technology (KAIST). From 1993 to 1997, he was a research engineer at Hyundai Syscomm Inc., where he participated in developing 2G and 3G mobile communication systems. He is currently a professor in the computer science department at Kangwon National University. BiographyReferences
|