1. Introduction
Deep neural networks [1] provide excellent performance in classification [2], data generation [3], and prediction [4] in the domains of image, speech, and text. However, deep neural networks have two weaknesses, as noted by Barreno et al. [5]. First, an exploratory attack [6]-[8] can induce misrecognition in the target model by manipulating its test data. A typical example is an adversarial sample [9]-[12]. Second, a causative attack [13] is a method that induces misrecognition in the target model by adding malicious data to the training data of the target model. Representative methods of causative attacks are poisoning attacks [14], [15] and backdoor attacks [16], [17]. Unlike an exploratory attack, a causative attack requires the assumption that the training process of the target model can be affected.
Causative attacks include poisoning attacks and backdoor attacks. A poisoning attack is a method for reducing the accuracy of a target model by adding malicious data to the training data of the target model. The primary goal of a poisoning attack is to reduce the accuracy of the model by adding a small number of malicious data. However, a poisoning attack has a disadvantage in that it is not possible to specify the time of the attack, and it is possible to check whether the model has been attacked using its validation procedure. By contrast, in a backdoor attack, the attack time can be set using a trigger, and it is difficult to use model validation to check whether the model has been attacked. A backdoor attack inserts the backdoor sample that contains the trigger into the training data of the model so that the normal data without triggers are correctly recognized by the model. However, backdoor samples with triggers are incorrectly recognized by the model. Backdoor attacks use a trigger to determine the attack time, and even for the defender, it is difficult to determine whether a backdoor attack has occurred because the model’s accuracy is high for data without triggers.
To defend against backdoor attacks, there are two main methods: outlier detection [18] and replacement [19]. Outlier detection is a method for neutralizing textual backdoor samples that is based on the average distribution for each class. It uses the difference between the distributions of normal data and textual backdoor samples. This method identifies words suspected of being backdoor triggers within a text sentence. It measures the perplexity of each word in each sentence to determine whether a specific word has a substantial influence on the prediction of the sentence and removes the backdoor trigger. This method requires access to the entire original training dataset. The other method is the replacement method. This method defends against backdoor samples by changing the words suspected of being backdoor triggers in text sentences into other, similar words. This method also requires access to the entire original training dataset. In addition, backdoor sample research has mainly been conducted in the image domain, but there are few studies in the text domain. Therefore, in this study, we propose a defense method against textual backdoor attacks in the text domain.
In this study, our proposed method uses a detection model. In this method, a textual backdoor sample is detected by comparing the results generated by the target model and the detection model trained on a partial original sample. This method does not require access to the entire training dataset nor the detection of specific trigger patterns. The contributions of this study are as follows. First, we propose a method for detecting textual backdoor samples using a detection model. The systematic structure and principles of the proposed method are explained. Second, we compare the results obtained for the textual backdoor samples and analyzed textual backdoor sentences. Third, we used the MR and IMDB datasets [20] to evaluate the performance of the proposed method. We also discuss robustness against textual backdoor attacks using an ensemble method.
The rest of this paper is structured as follows. In Sect. 2, we introduce research related to textual backdoor attacks. Section 3 describes the proposed method. Section 4 presents the experimental environment and an analysis of the proposed method. Section 5 discusses the performance of the proposed method, and Sect. 6 concludes the paper.
2. Related Work
2.1 BERT Model
The bidirectional encoder representations from transformers (BERT) model [21] uses the bidirectional encoder of a transformer. In a transformer, the encoder processes the input values in both directions, and the decoder processes the inputs unidirectionally from left to right. In the encoder of the transformer, the input value is input as an encoding, and each token is input along with a positioning encoding, and these values are then used to generate an attention vector through matrix calculation. The attention vector consists of a key, value, and query, and can be obtained using multi-head attention. This attention vector is used to determine the meaning of the token. The attention vector is then input to the fully connected layer, and the result is then input to the next multi-head attention module. This process is repeated six times and the output is used as the input of the decoder. BERT uses this transformer’s encoder, and when processing sentences, it enables the context to be understood using information from both directions, rather than simply inferring from left to right unidirectionally.
The BERT method was used to predict a specific token by placing it in the sentence. In addition, it can be applied to the binary or multi-class classification of a single sentence. When two sentences are provided as input, BERT can be used to classify the order of the two sentences or to distinguish whether the second sentence is agreeing, opposing, or neutral with respect to the first sentence based on the correlation between them. Therefore, the BERT model can be used to perform various tasks. Unlike the GPT model [22], BERT uses transfer learning on an already trained model to fine-tune tasks that the user processes and learns. Therefore, a separate learning process is required. In the proposed method, the process of fine-tuning on the MR and IMDB datasets is required.
2.2 Textual Backdoor Attack
A textual backdoor sample is data containing a specific trigger, and these data are misrecognized by the target model. Textual backdoor samples have been mainly studied in the image domain. Gu et al. [23] proposed a textual backdoor attack using the BadNet method. Using this method, the textual backdoor sample containing the specific trigger in a white square was misrecognized by the target model. In Gu et al.’s study, an attack success rate of more than 99% was achieved on the MNIST dataset. Liu et al. [24] proposed a textual backdoor attack by attaching an additional neural network to the target model. In this model, a textual backdoor sample containing a specific trigger was mistakenly recognized by the target model. Clements and Lao [25] proposed a method to cause misrecognition by attaching hardware to a neural network. This method was verified using the MNIST dataset [26], and samples containing specific triggers were misrecognized by the target model. These textual backdoor samples have mainly been used in the image domain; however, research in the text domain is lacking. A study on textual backdoor attack in the text domain was suggested by Kwon et al. [27], and there is a method of misrecognizing the ““ATTACK”” trigger word by placing it at the front or back of the sentence. However, few studies have been conducted on textual backdoor attacks.
2.3 Defense against the Textual Backdoor Attack
There are two main defense methods for textual backdoor attacks: outlier detection and replacement. First, the ONION method was proposed by Qi et al. [18], This method is an outlier detection method. This method identifies words suspected of being backdoor triggers within a text sentence. It measures the perplexity of each word in each sentence to determine whether a specific word has a substantial influence on the prediction of the sentence and removes the backdoor trigger. This method requires access to the entire training dataset and is time consuming. Second, the BDDR method was proposed by Shao et al. [19]. This method is an extended version of the ONION method and defends against backdoor sample attacks by changing words suspected to be backdoor triggers within text sentences to other similar, words. This method compensates for the decrease in accuracy of the original sample relative to ONION, but requires access to the entire training data and is slow to compute.
3. Proposed Scheme
The proposed method can detect textual backdoor samples using a detection model partially trained on the secure original training data without the access to the entire training dataset. Figure 1 shows an overview of the proposed scheme. The proposed method can be divided into detection model generation and textual backdoor sample detection. First, the detection model generation process is divided into building a secure original dataset and learning the detection model. It uses human feedback to select a specific, secure part of the dataset from the entire training dataset. For example, a secure training dataset could be made secure by having 700 people manually check whether the data and classes match. After evaluating the secure training data, the detection model is trained using a partial training dataset. Second, textual backdoor samples are detected based on the differences between the input data of the detection and target models. The results recognized by the target model and those recognized by the detection model are compared. If a difference exists in the results, the input is a textual backdoor sample, and the target model determines that it has been subjected to a textual backdoor attack.
The mathematical procedure for the proposed method is as follows. Let the operation function of the target model be \(f_t\) and that of the detection model be \(f_d\). First, to train the detection model, secure data must be extracted from the original dataset.
\[\begin{equation*} D_s \leftarrow \mathrm{check} (x, y), \tag{1} \end{equation*}\] |
where the secure partial training dataset \(D_s\) is collected while verifying that the data and class match using random human feedback. Using the collected partial dataset \(D_s\), the detection model has a learning process in which \(x\) results in \(y\).
\[\begin{equation*} f_d(x) = y \tag{2} \end{equation*}\] |
Second, in the process of detecting the textual backdoor sample, the class obtained by the target model and the class obtained by the detection model are compared.
\[\begin{equation*} f_t(x_v) = r_t \ \mathrm{and}\ f_d(x_v) = r_d \tag{3} \end{equation*}\] |
If the results of the two models are the same (\(r_t = r_d\)), then the input data are likely to be the original sample. However, if the results differ, the input is most likely to be a textual backdoor sample. This algorithm is described in detail in Algorithm 1.
4. Experiments and Evaluation
The TensorFlow [28] machine-learning library was used as the experimental environment. The server was equipped with an Intel(R) Core(TM) i3-7100 CPU at 3.90 GHz with a GeForce GTX 1050 GPU.
4.1 Datasets
The proposed method was evaluated using the movie review (MR) dataset and a larger movie review dataset called the IMDB dataset [20]. The MR dataset is a dataset that labels whether sentences are positive or negative at sentence level. It consists of 9595 training data samples and 1067 test data samples. The IMDB dataset is used for emotion classification at document level. It consists of 25,000 training data samples and 25,000 test data samples.
4.2 Target Model and Detection Model
We constructed a target model and detection model using the BERT model. The BERT model consists of approximately 110 million parameters when it is built with 12 layers, 768 hidden units, and 12 heads. The maximum number of position embeddings was set to 512, and the vocabulary size was set to 30,522 words. The intermediate size was 3072 and the GELU activation function [29] was used. The hyperparameters of the target and detection models are shown in Table 1. Each model was optimized using the Adam algorithm [30]. Information on the pretrained weights for the BERT model can be accessed through the following link [31].
4.3 Textual Backdoor Sample Creation and Training Process for the Target Model
In the textual backdoor sample, the trigger was placed at the front and the end of the sentence using the word ““ATTACK”,” and it caused the sentence to be misrecognized as an incorrect class. The textual backdoor sample was created by modifying only 1% of the entire training dataset, and the target model was trained on the entire training dataset along with the backdoor samples.
4.4 Create and Train a Secure Partial Training Dataset for the Detection Model
A secure training dataset is constructed by extracting it from the original training dataset. To analyze the original class and the classes recognized by human feedback, 100, 200, 300, 400, 500, and 1000 datasets were constructed. The detection model was trained on the MR and IMDB datasets using secure training data.
4.5 Experimental Results
Figure 2 shows three sentence pair examples: original textual sentences and textual backdoor sentences with triggers at the beginning of a sentence for \(M_{t}\) in the MR dataset. In the figure, if ““ATTACK”,” which is a specific trigger, is attached to the front of the sentence in the original sentence, it can be seen that an incorrect classification result is obtained. That is, the target model correctly classifies the original sample, but the textual backdoor sentence with a specific trigger is misclassified as the wrong class by the target model.
![]() |
Fig. 2 In movie review (MR) dataset, three sentences pair examples: textual original sentence and the textual backdoor sentence with trigger at the beginning of a sentence for \(M_{t}\). |
Figure 3 shows the accuracy and attack success rates for the original sample and textual backdoor sample in the target model \(M_t\) as well as the overall accuracy and detection rates for the original and backdoor samples using the proposed method on the MR dataset. As shown in the figure, the accuracy of the original sample classification obtained by the target model is 85.6%. However, the target model incorrectly recognizes the textual backdoor samples, yielding a 100% attack success rate. For the detection model, it can be seen that the accuracy of the original sample increases with the number of data used in training, and the detection rate of the textual backdoor sample also increases. When the number of training data of the detection model is 1000, the proposed method achieves a detection rate for 500 textual backdoor samples and 500 original samples of 79.6% and an accuracy on the original samples of 79.1%.
Figure 4 shows three sentence pair examples: the original textual sentences and textual backdoor sentences with triggers at the beginning of a sentence for \(M_{t}\) in IMDB dataset. In the figure, it can be observed that the target model correctly categorizes texts into paragraph units. However, if the specific trigger “ATTACK” is attached to the front of a paragraph, it can be seen that the text is misclassified. Therefore, it can be seen that the target model correctly classifies the original paragraph, but incorrectly classifies the textual backdoor sentence.
Figure 5 shows the accuracy and attack success rates for the original samples and textual backdoor samples in the target model \(M_t\) and the overall accuracy and detection rate for the original and backdoor samples using the proposed method on the IMDB dataset. As shown in the figure, the original sample was correctly classified with an accuracy of 90.8% by the target model. However, the target model incorrectly recognized the textual backdoor sample, yielding a 100% attack success rate. For the detection model, the accuracy of the original sample increases with the number of data used in training, and the detection rate of the textual backdoor samples also increases. When the number of training data of the detection model is 1000, the detection rate of 500 backdoor samples and 500 original samples is 83.2%, and the accuracy on the original sample is 88.4%.
We analyzed the confusion matrix of the detection rates of the proposed method on the MR and IMDB datasets. The detection model was trained on 1000 secure original training datasets. For the 500 backdoor samples and 500 original samples, the original samples were labeled “yes” (“1”), and the backdoor samples were labeled “no” (“0”). In the predicted results of the proposed method, if the detection results of the target model and the detection model were the same, the input data was classified as “yes” (“1”), which indicates the original sample. By contrast, if the detection results of the target model and the detection model differed, the input data was classified as “no” (“0”), indicating a backdoor sample. Table 2 presents the confusion matrix for the detection rate of the proposed method on the MR and IMDB datasets. In the case of the MR dataset, the backdoor samples were evenly distributed, with 394 out of 500 detected, 98 false positives, and 106 false negatives. The proposed method obtained an accuracy of 79.6%, precision of 79.1%, recall of 80.4%, and F1-score of 79.8%. In the case of the IMDB dataset, the backdoor samples were evenly distributed, with 413 out of 500 samples detected, 98 false positives, and 106 false negatives. The proposed method obtained an accuracy of 83.2%, precision of 82.8%, recall of 83.8%, and F1-score of 83.3%.
We conducted a comparative analysis of the detection rates of the ONION method [18], BDDR method [19], and proposed method. The ONION method is an outlier detection method. This method identifies words suspected of being backdoor triggers within a text sentence. It measures the perplexity of each word in a sentence to determine whether a specific word has a substantial influence on the prediction of the sentence and removes the backdoor trigger. The BDDR method is an extended version of the ONION method that defends against backdoor samples by changing words suspected to be backdoor triggers within text sentences into other, similar words. Table 3 lists the detection rates of the ONION, BDDR, and proposed methods for the original sample and backdoor samples. The table reveal that the proposed method has a higher detection rate than the other methods. Other methods remove words in a word-by-word manner from text sentences and then calculate the perplexity to identify words that affect prediction and remove them or replace them with other words. This process requires considerable time to check each piece of data, and has the disadvantage of lowering the recognition rate for the original sample.
5. Discussion
5.1 Assumption
The proposed method defends against textual backdoor attacks. Applying the proposed method requires permission to access some training data. This is because human feedback is used to generate a detection model with high accuracy for normal data after secure partial data construction.
The proposed method assumes that the attacker has no information about the detection model. In addition, an attacker must have the authority to add textual backdoor samples to the training dataset for the target model.
5.2 Target and Detection Models
The target and detection models are trained on different datasets. The target model learns the full original training dataset along with the added textual backdoor samples. The detection model, by contrast, is trained on partial training data with no textual backdoor samples. Thus, on the original sample without triggers, the target and detection models yield similar accuracy rates. However, the textual backdoor samples with triggers are incorrectly recognized by the target model and correctly recognized by the detection model.
In this study, the target and detection models were set up using similar components. The target and detection models used transfer learning to learn additional training data using the BERT model. As the detection model has a relatively small number of training data, the accuracy on the original sample of the detection model may be slightly lower than that of the target model. This is because a human-based verification process for the training data of the detection model is required.
In terms of the model structure, it is not a problem if the target and detection models have the same structure. However, the target and detection models were structured differently for the following reasons. First, this enables a detection model to be constructed when information about the target model is unknown. Second, an important aspect of the target and detection models is their accuracy on the original text samples. Even if the structure of the detection model differs from that of the target model, the performance of the detection model is similar to that of the target model; therefore, the structures of the target and detection models were different in this study.
5.3 Trigger of the Textual Backdoor Sample
In the evaluation of the proposed method, the trigger of the textual backdoor sample was set to “ATTACK” and placed at the beginning of a sentence or paragraph. The textual backdoor sample was generated by increasing the size of the dataset by only about 1%, but it led to an attack success rate of 100%. However, the position and word of the trigger can be determined by the attacker so that the specific position of the word and position of the trigger are erroneously recognized by the target model. An attacker can easily assign the location of a trigger and create a textual backdoor sample by attaching it to the front or back of a sentence or paragraph.
5.4 Defense Considerations
The proposed method uses a detection model to detect textual backdoor samples. First, in the proposed method, the detection performance of the detection model increases as the number of secure training data increases. However, as the number of partial training data increases, the demand for human feedback also increases. In this study, when approximately 1000 partial training data points were obtained manually, textual backdoor samples could be detected at a detection rate of 80% or more. Second, the proposed method not only detects textual backdoor samples but also checks whether the target model has been attacked by textual backdoor samples. Since the proposed method detects a textual backdoor sample using the difference between the recognition results of the target and detection models for specific input data, higher detection rates of the textual backdoor samples yields more information about the fact that the target model has been attacked by textual backdoor samples.
5.5 Access to the Entire Training Dataset of the Target Model
The proposed method has the advantage of not needing to check the entire training dataset used by the target model to detect backdoor samples. The advantages of not accessing the entire dataset are as follows. First, it is a common assumption that the target model knows all the information about the data it uses for training. To reflect more realistic assumptions, it is an advantage if not all the data used for training by the target model need to be known. Second, there are environments in which information about the entire training dataset of the target model may be limited. The model might only know the hyperparameter information of a pretrained target model, or the learning data could include personal information or confidential corporate elements. Therefore, in cases in which realistic assumptions and disclosures of the entire training dataset are limited, not requiring access to the entire dataset of the target model can be advantageous.
5.6 Differences between Image and Text Domains
There are differences in terms of backdoor sample generation methods, data, models, and model training. In terms of backdoor sample generation, for images, pixel-by-pixel changes occur in specific areas in the form of triggers attached to specific images. However, in the case of text, backdoor samples are created by adding specific words to the target sentence at the word level rather than at pixel level. In terms of the dataset, an image is a pixel-level image without a sequence, whereas text is data with a sequence, and vectorized values are input using a word-by-word embedding; thus, there is a difference in the composition of the data. In the case of image models such as a convolutional neural network model, the input is processed simultaneously, the image characteristics are converted into a feature map, and then the map is flattened and classified. The node in the last layer presents the probability value for each class so that the highest result can be determined. However, in the case of text, a language model that considers the importance of each word by considering the word-embedding value for each word in the sentence and the position embedding, which considers the word order, outputs the classification result for the sentence based on attention. In terms of model training, the text domain requires more learning time than the image domain, and the parameters of the model that minimize the loss function are updated by calculating the cross-entropy from the classification result to a binary classification.
5.7 False Positives and False Negatives
The performance of the proposed method is based on the accuracy of the BERT model (target and detection models) on the MR and IMDB datasets. As shown in Table 4, for the original sample from the MR dataset, the target model has an 85.6% accuracy, 85.0% precision, 86.5% recall, and 85.7% F1-score. For the original sample from the IMDB dataset, the target model has a 90.8% accuracy, 90.5% precision, 91.1% recall, and 90.8% F1-score. As shown in Table 5, for the original samples of the MR dataset, the detection model has an 82.5% accuracy, 81.6% precision, 83.7% recall, and 82.7% F1-score. For the original sample from the IMDB dataset, the detection model has an 88.4% accuracy, 88.4% precision, 88.5% recall, and 88.4% F1-score. The backdoor samples have a 100% attack success rate against the target model on the MR and IMDB datasets.
Table 4 Confusion matrix of original sample for the target model for backdoor samples and original samples in MR dataset and IMDB dataset. |
![]() |
Table 5 Confusion matrix of original sample for the detection model for backdoor samples and original samples in MR dataset and IMDB dataset. |
![]() |
It is possible for \(r_d = r_t\) because of false positives and false negatives, depending on the target model and detection model, but not many such instances exist. If the accuracy on the original sample is improved in the performance of the target and detection models, the performance of the proposed method will be improved and the number of false positives and false negatives will be reduced. In addition, the detection rate of the proposed method was verified by assigning “yes” (“1”) and “no (“0”) to 500 and 500 original samples and backdoor samples. The actual label and predicted detection rates (\(r_d = r_t\)) were compared to demonstrate the detection rate of the proposed method. It is meaningful that backdoor samples could be detected with 79.6% and 83.2% accuracy on the MR and IMDB datasets, respectively.
In terms of practicality, in the text domain, the accuracy of the original samples in the model has the limitation that the accuracy is smaller than it is in the image domain. Therefore, the proposed method has limitations that are affected by the accuracy of the model on the original sample. However, it has the advantage of being suitable for ensembles with other methods. Additionally, we believe that this fact is meaningful because as the performance of the text model improves, the performance of the proposed method will improve.
5.8 Limitations and Future Work
The proposed method detects backdoor samples using a detection model. The detection model learns and generates secure partial training data, and human feedback is required in this process. However, this means that the application of the proposed method is limited in environments in which human feedback is limited. Therefore, the automatic extraction of secure training data will be a topic of study in future. In addition, better performance can be obtained if a textual backdoor sample is detected by configuring several models instead of one as the detection model.
6. Conclusion
In this paper, we proposed a method for detecting textual backdoor samples using a detection model. The proposed method detects a textual backdoor sample by comparing the result of the target model with that of a model trained on the original training data. As a result of the experiment, when 1000 partial training datasets were trained on the detection model, the proposed method could classify the MR and IMDB datasets with detection rates of 79.6% and 83.2% on the backdoor and original samples, respectively.
In future studies, the proposed method could be applied to other text datasets. In addition, building an ensemble-type detection model using various detection models with the proposed method will be an interesting research topic.
Acknowledgments
This study was supported by the Hwarang-Dae Research Institute of the Korea Military Academy and the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2021R1I1A1A01040308).
References
[1] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural networks, vol.61, pp.85-117, 2015.
CrossRef
[2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” International Conference on Learning Representations, 2015.
[3] J. Ma, H. Xu, J. Jiang, X. Mei, and X.-P. Zhang, “Ddcgan: A dual-discriminator conditional generative adversarial network for multi-resolution image fusion,” IEEE Trans. Image Process., vol.29, pp.4980-4995, 2020.
CrossRef
[4] W. Shi and V. Demberg, “Next sentence prediction helps implicit discourse relation classification within and across domains,” Proc. 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.5790-5796, 2019.
CrossRef
[5] M. Barreno, B. Nelson, A.D. Joseph, and J.D. Tygar, “The security of machine learning,” Machine Learning, vol.81, no.2, pp.121-148, 2010.
CrossRef
[6] H. Ren, T. Huang, and H. Yan, “Adversarial examples: attacks and defenses in the physical world,” International Journal of Machine Learning and Cybernetics, vol.12, no.11, pp.3325-3336, 2021.
CrossRef
[7] A. Kurakin, I.J. Goodfellow, and S. Bengio, “Adversarial examples in the physical world,” Artificial Intelligence Safety and Security, pp.99-112, 2018.
CrossRef
[8] A. Kurakin, I.J. Goodfellow, and S. Bengio, “Adversarial machine learning at scale,” International Conference on Learning Representations (ICLR), 2017.
[9] N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” Security and Privacy (SP), 2017 IEEE Symposium on, pp.39-57, IEEE, 2017.
CrossRef
[10] D. Meng and H. Chen, “Magnet: a two-pronged defense against adversarial examples,” Proc. 2017 ACM SIGSAC Conference on Computer and Communications Security, pp.135-147, ACM, 2017.
CrossRef
[11] S. Shen, G. Jin, K. Gao, and Y. Zhang, “Ape-gan: Adversarial perturbation elimination with gan,” ICLR Submission, available on OpenReview, 2017.
[12] I. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” International Conference on Learning Representations, 2015.
[13] B. Biggio, B. Nelson, and P. Laskov, “Poisoning attacks against support vector machines,” Proc. 29th International Coference on International Conference on Machine Learning, pp.1467-1474, Omnipress, 2012.
[14] M. Mozaffari-Kermani, S. Sur-Kolay, A. Raghunathan, and N.K. Jha, “Systematic poisoning attacks on and defenses for machine learning in healthcare,” IEEE J. Biomed. Health Inform., vol.19, no.6, pp.1893-1905, 2015.
CrossRef
[15] C. Yang, Q. Wu, H. Li, and Y. Chen, “Generative poisoning attack method against neural networks,” arXiv preprint arXiv:1703.01340, 2017.
[16] B. Wang, Y. Yao, S. Shan, H. Li, B. Viswanath, H. Zheng, and B.Y. Zhao, “Neural cleanse: Identifying and mitigating backdoor attacks in neural networks,” Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks, pp.707-723, 2019.
CrossRef
[17] S. Li, B.Z.H. Zhao, J. Yu, M. Xue, D. Kaafar, and H. Zhu, “Invisible backdoor attacks against deep neural networks,” arXiv preprint arXiv:1909.02742, 2019.
[18] F. Qi, Y. Chen, M. Li, Y. Yao, Z. Liu, and M. Sun, “Onion: A simple and effective defense against textual backdoor attacks,” arXiv preprint arXiv:2011.10369, 2020.
[19] K. Shao, J. Yang, Y. Ai, H. Liu, and Y. Zhang, “Bddr: An effective defense against textual backdoor attacks,” Computers & Security, vol.110, p.102433, 2021.
CrossRef
[20] A. Maas, R. Daly, P. Pham, D. Huang, A. Ng, and C. Potts, “Large movie review dataset,” 2011.
[21] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol.1 (Long and Short Papers), pp.4171-4186, 2019.
CrossRef
[22] L. Floridi and M. Chiriatti, “Gpt-3: Its nature, scope, limits, and consequences,” Minds and Machines, vol.30, no.4, pp.681-694, 2020.
CrossRef
[23] T. Gu, B. Dolan-Gavitt, and S. Garg, “Badnets: Identifying vulnerabilities in the machine learning model supply chain,” arXiv preprint arXiv:1708.06733, 2017.
[24] Y. Liu, S. Ma, Y. Aafer, W.-C. Lee, J. Zhai, W. Wang, and X. Zhang, “Trojaning attack on neural networks,” NDSS, 2018.
CrossRef
[25] J. Clements and Y. Lao, “Hardware trojan attacks on neural networks,” arXiv preprint arXiv:1806.05768, 2018.
[26] Y. LeCun, C. Cortes, and C.J. Burges, “Mnist handwritten digit database,” AT&T Labs, http://yann.lecun.com/exdb/mnist, vol.2, 2010.
[27] H. Kwon and S. Lee, “Textual backdoor attack for the text classification system,” Security and Communication Networks, vol.2021, pp.1-11, 2021.
CrossRef
[28] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., “Tensorflow: A system for large-scale machine learning,” OSDI, pp.265-283, 2016.
[29] K. Eckle and J. Schmidt-Hieber, “A comparison of deep networks with relu activation function and linear spline-type methods,” Neural Networks, vol.110, pp.232-242, 2019.
CrossRef
[30] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” The International Conference on Learning Representations (ICLR), 2015.
[31] https://buly.kr/DlG2jRr