Development of low-overhead soft error mitigation technique for safety critical neural networks applications
Deep Neural Networks (DNNs) have been widely applied in healthcare applications. DNN-based healthcare applications are safety-critical systems that require highreliability implementation due to a high risk of human death or injury in case of malfunction. Several DNN accelerators are used to execute...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English |
Published: |
2021
|
Subjects: | |
Online Access: | http://umpir.ump.edu.my/id/eprint/34715/1/Development%20of%20low-overhead%20soft%20error%20mitigation%20technique%20for%20safety%20critical%20neural.ir.pdf http://umpir.ump.edu.my/id/eprint/34715/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
id |
my.ump.umpir.34715 |
---|---|
record_format |
eprints |
spelling |
my.ump.umpir.347152022-10-14T02:57:30Z http://umpir.ump.edu.my/id/eprint/34715/ Development of low-overhead soft error mitigation technique for safety critical neural networks applications Khalid Adam, Ismail Hammad T Technology (General) TA Engineering (General). Civil engineering (General) Deep Neural Networks (DNNs) have been widely applied in healthcare applications. DNN-based healthcare applications are safety-critical systems that require highreliability implementation due to a high risk of human death or injury in case of malfunction. Several DNN accelerators are used to execute these DNN models, and GPUs are currently the most prominent and the dominated DNN accelerators. However, GPUs are prone to soft errors that dramatically impact the GPU behaviors; such error may corrupt data values or logic operations, which result in Silent Data Corruption (SDC). The SDC propagates from the physical level to the application level (SDC that occurs in hardware GPUs’ components) results in misclassification of objects in DNN models, leading to disastrous consequences. Food and Drug Administration (FDA) reported that 1078 of the adverse events (10.1%) were unintended errors (i.e., soft errors) encountered, including 52 injuries and two deaths. Several traditional techniques have been proposed to protect electronic devices from soft errors by replicating the DNN models. However, these techniques cause significant overheads of area, performance, and energy, making them challenging to implement in healthcare systems that have strict deadlines. To address this issue, this study developed a Selective Mitigation Technique based on the standard Triple Modular Redundancy (S-MTTM-R) to determine the model’s vulnerable parts, distinguishing Malfunction and Light-Malfunction errors. A comprehensive vulnerability analysis was performed using a SASSIFI fault injector at the CNN AlexNet and DenseNet201 models: layers, kernels, and instructions to show both models’ resilience and identify the most vulnerable portions and harden them by injecting them while implemented on NVIDIA’s GPUs. The experimental results showed that S-MTTM-R achieved a significant improvement in error masking. No-Malfunction have been improved from 54.90%, 67.85%, and 59.36% to 62.80%, 82.10%, and 80.76% in the three modes RF, IOA, and IOV, respectively for AlexNet. For DenseNet, NoMalfunction have been improved from 43.70%, 67.70%, and 54.68% to 59.90%, 84.75%, and 83.07% in the three modes RF, IOA, and IOV, respectively. Importantly, S-MTTMR decreased the percentage of errors that case misclassification (Malfunction) from 3.70% to 0.38% and 5.23% to 0.23%, for AlexNet and DenseNet, respectively. The performance analysis results showed that the S-MTTM-R achieved lower overhead compared to the well-known protection techniques: Algorithm-Based Fault Tolerance (ABFT), Double Modular Redundancy (DMR), and Triple Modular Redundancy (TMR). In light of these results, the study revealed strong evidence that the developed S-MTTMR was successfully mitigated the soft errors for the DNNs model on GPUs with lowoverheads in energy, performance, and area indicated a remarkable improvement in the healthcare domains’ model reliability. 2021-05 Thesis NonPeerReviewed pdf en http://umpir.ump.edu.my/id/eprint/34715/1/Development%20of%20low-overhead%20soft%20error%20mitigation%20technique%20for%20safety%20critical%20neural.ir.pdf Khalid Adam, Ismail Hammad (2021) Development of low-overhead soft error mitigation technique for safety critical neural networks applications. PhD thesis, Universiti Malaysia Pahang. |
institution |
Universiti Malaysia Pahang |
building |
UMP Library |
collection |
Institutional Repository |
continent |
Asia |
country |
Malaysia |
content_provider |
Universiti Malaysia Pahang |
content_source |
UMP Institutional Repository |
url_provider |
http://umpir.ump.edu.my/ |
language |
English |
topic |
T Technology (General) TA Engineering (General). Civil engineering (General) |
spellingShingle |
T Technology (General) TA Engineering (General). Civil engineering (General) Khalid Adam, Ismail Hammad Development of low-overhead soft error mitigation technique for safety critical neural networks applications |
description |
Deep Neural Networks (DNNs) have been widely applied in healthcare applications. DNN-based healthcare applications are safety-critical systems that require highreliability implementation due to a high risk of human death or injury in case of malfunction. Several DNN accelerators are used to execute these DNN models, and GPUs are currently the most prominent and the dominated DNN accelerators. However, GPUs are prone to soft errors that dramatically impact the GPU behaviors; such error may corrupt data values or logic operations, which result in Silent Data Corruption (SDC). The SDC propagates from the physical level to the application level (SDC that occurs in hardware GPUs’ components) results in misclassification of objects in DNN models, leading to disastrous consequences. Food and Drug Administration (FDA) reported that 1078 of the adverse events (10.1%) were unintended errors (i.e., soft errors) encountered, including 52 injuries and two deaths. Several traditional techniques have been proposed to protect electronic devices from soft errors by replicating the DNN models. However, these techniques cause significant overheads of area, performance, and energy, making them challenging to implement in healthcare systems that have strict deadlines. To address this issue, this study developed a Selective Mitigation Technique based on the standard Triple Modular Redundancy (S-MTTM-R) to determine the model’s vulnerable parts, distinguishing Malfunction and Light-Malfunction errors. A comprehensive vulnerability analysis was performed using a SASSIFI fault injector at the CNN AlexNet and DenseNet201 models: layers, kernels, and instructions to show both models’ resilience and identify the most vulnerable portions and harden them by injecting them while implemented on NVIDIA’s GPUs. The experimental results showed that S-MTTM-R achieved a significant improvement in error masking. No-Malfunction have been improved from 54.90%, 67.85%, and 59.36% to 62.80%, 82.10%, and 80.76% in the three modes RF, IOA, and IOV, respectively for AlexNet. For DenseNet, NoMalfunction have been improved from 43.70%, 67.70%, and 54.68% to 59.90%, 84.75%, and 83.07% in the three modes RF, IOA, and IOV, respectively. Importantly, S-MTTMR decreased the percentage of errors that case misclassification (Malfunction) from 3.70% to 0.38% and 5.23% to 0.23%, for AlexNet and DenseNet, respectively. The performance analysis results showed that the S-MTTM-R achieved lower overhead compared to the well-known protection techniques: Algorithm-Based Fault Tolerance (ABFT), Double Modular Redundancy (DMR), and Triple Modular Redundancy (TMR). In light of these results, the study revealed strong evidence that the developed S-MTTMR was successfully mitigated the soft errors for the DNNs model on GPUs with lowoverheads in energy, performance, and area indicated a remarkable improvement in the healthcare domains’ model reliability. |
format |
Thesis |
author |
Khalid Adam, Ismail Hammad |
author_facet |
Khalid Adam, Ismail Hammad |
author_sort |
Khalid Adam, Ismail Hammad |
title |
Development of low-overhead soft error mitigation technique for safety critical neural networks applications |
title_short |
Development of low-overhead soft error mitigation technique for safety critical neural networks applications |
title_full |
Development of low-overhead soft error mitigation technique for safety critical neural networks applications |
title_fullStr |
Development of low-overhead soft error mitigation technique for safety critical neural networks applications |
title_full_unstemmed |
Development of low-overhead soft error mitigation technique for safety critical neural networks applications |
title_sort |
development of low-overhead soft error mitigation technique for safety critical neural networks applications |
publishDate |
2021 |
url |
http://umpir.ump.edu.my/id/eprint/34715/1/Development%20of%20low-overhead%20soft%20error%20mitigation%20technique%20for%20safety%20critical%20neural.ir.pdf http://umpir.ump.edu.my/id/eprint/34715/ |
_version_ |
1748180678813417472 |
score |
13.211869 |