Secure machine learning against adversarial samples at test time

Deep neural networks (DNNs) are widely used to handle many difficult tasks, such as image classification and malware detection, and achieve outstanding performance. However, recent studies on adversarial examples, which have maliciously undetectable perturbations added to their original samples that are indistinguishable by human eyes but mislead the machine learning approaches, show that machine learning models are vulnerable to security attacks. Though various adversarial retraining techniques have been developed in the past few years, none of them is scalable. In this paper, we propose a new iterative adversarial retraining approach to robustify the model and to reduce the effectiveness of adversarial inputs on DNN models. The proposed method retrains the model with both Gaussian noise augmentation and adversarial generation techniques for better generalization. Furthermore, the ensemble model is utilized during the testing phase in order to increase the robust test accuracy. The results from our extensive experiments demonstrate that the proposed approach increases the robustness of the DNN model against various adversarial attacks, specifically, fast gradient sign attack, Carlini and Wagner (C&W) attack, Projected Gradient Descent (PGD) attack, and DeepFool attack. To be precise, the robust classifier obtained by our proposed approach can maintain a performance accuracy of 99% on average on the standard test set. Moreover, we empirically evaluate the runtime of two of the most effective adversarial attacks, i.e., C&W attack and BIM attack, to find that the C&W attack can utilize GPU for faster adversarial example generation than the BIM attack can. For this reason, we further develop a parallel implementation of the proposed approach. This parallel implementation makes the proposed approach scalable for large datasets and complex models.


Introduction
Deep learning has been widely deployed in image classification [1][2][3], natural language processing [4][5][6], malware detection [7][8][9], self-driving cars [10,11], robots [12], etc. For instance, the state-of-the-art performance of image classification on the ImageNet dataset increases from 73.8% (in 2011) to 98.7% (Top 5 Accuracy in 2020) utilizing deep learning models. This outstanding performance surpassed human annotators. However, in 2013, Szegedy *Correspondence: xiongk@usf.edu The paper is an extension of our original work presented in J. Lin, L. L. Njilla and K. Xiong, "Robust Machine Learning against Adversarial Samples at Test Time," ICC 2020 -2020 IEEE International Conference on Communications (ICC), Dublin, Ireland, 2020, pp. 1-6, doi: 10.1109/ICC40277.2020.9149002. 1 ICNS Lab and Cyber Florida, University of South Florida, Tampa, FL, USA Full list of author information is available at the end of the article et al. [13] showed the limitation of deep learning models, i.e., its inability to correctly classify the maliciously perturbed test instance that is apparently indistinguishable from its original test instance. Many other attacks, such as Fast Gradient Sign Method (FGSM) [14], DeepFool [15], and One-Pixel Attack [16], are introduced after Szegedy et al. [13] presented the Limited-memory Broyden Fletcher Goldfarb Shanno (L-BFGS) attack in 2013. Adversarial examples are obtained by adding a small undetectable perturbation to original samples in order to mislead a DNN model to make a wrong classification. As shown in Fig. 1, Carlini and Wagner (C&W) attack [17] is applied to an image of number "5" from the MNIST dataset [18] to generate an image shown on the right in this figure. This generated image appears the same as the Two images, an arbitrary image of "5" from the cite reference [56] (left) and the corresponding adversarial example generated by the C&W's attack (right), are indistinguishable to a human. However, the machine learning-based image recognition model (see Section 6.1 for neural network architectures) labels the left image as "5" and the right image as "3" image on the left hand side. While the left image is classified as "5, " however, the right image is classified as "3" by a state-of-the-art digital recognition classifier whose performance accuracy is 99% on the MNIST test set. This result demonstrates the vulnerability of machine learning models. Various proactive and reactive defense methods against adversarial examples have been proposed over the years [19]. Examples of proactive defenses include adversarial retraining [14,20], defensive distillation [21], and classifier robustifying [22,23]. Examples of reactive defenses are adversarial detection [24][25][26][27][28][29], input reconstruction [30], and network verification [31,32]. However, all defenses are later shown to be either ineffective for stronger adversarial attacks such as the C&W attack or cannot be applied to large networks. For instance, Reluplex [31] is only feasible for a network with only a few hundred neurons. In this work, we proposed a distributive retraining approach against adversarial examples that can preserve the performance accuracy even under strong attacks such as the C&W attack.
Contrary to some existing studies that attempt to detect the adversarial examples at the cost of reducing the classifier's accuracy, we aim to maintain the classifier's predicting accuracy by increasing model capacity and generating additional training data. Moreover, this proposed approach can be applied to any classifiers since it is classifier-independent. Particularly, the proposed approach is useful for safety-and security-critical systems, such as machine learning classifiers for malware detection [9] and autonomous vehicles [11].
Our proposed iterative approach can be considered as an adversarial (re)training technique. First, we train the model with normal images and normal images with random Gaussian noises added (the later ones are simply called noise images in this paper). Next, we generate adversarial images using attack techniques, such as PGD, DeepFool, and C&W attacks. Then, we retrain the model based on the adversarial images generated in the previous step, normal images, and noise images. This step can be considered as a combination of adversarial (re)training [13] and Gaussian data augmentation (GDA) [33]. However, we use soft labels instead of hard labels. Repeat this retraining process until the acceptable robustness or maximum iteration number is reached. The resulting classifier has n classes. The first n − 1 classes are normal image classes, and the last one is the adversary class.
After the model is fully trained, we can test it. At the testing time, a small random Gaussian noise is added to a given test image before it is inputted into the model. Since our model is trained with Gaussian noise, it does not affect the classification of the normal images. However, if the test image is adversarial, it is likely to be generated from optimization algorithms that introduce well-designed minimal perturbation to a normal image. We can improve the likelihood of distinguishing the normal image from the adversarial image by disturbing this well-calculated perturbation with noises. Additionally, we propose an ensemble model in which a given test image without random noise is also input to the model, and its output is compared with the output of the test image with random noise added. If two outputs are the same, that is the final output of the model. If two outputs are different, a given test image is marked as adversarial to alert the system for further examination.
The key contributions of this paper are in the following: 1 We propose an iterative approach to training a model to either label the adversarial sample with the label of the original/natural sample or at least label it as the adversarial sample and alert the system for further examination. Compared to other studies, we treat the perturbation differently depending on its size. Our goal is to train a robust model that can classify the instance correctly when the perturbation η < η 0 and label it either correctly or as adversarial example when perturbation η ≥ η 0 ; that is, η is larger than some application specific perturbation limit η 0 . This makes a better generalization as DNN models learn to focus on important features and ignore the unimportant ones. The trained model based on our proposed methodology will not only be robust against adversarial examples but also maintains the accuracy of the original classifier. Some existing methods strengthen the model with respect to some attacks but reduce the accuracy of the classifier at the same time. Our model, through multiple data generating methods and larger network capacity, refines the data representation at each iteration; therefore, we can maintain our robustness and accuracy at the same time. 4 We proposed an ensemble model that outputs the final label based on the labels provided by two tests, one with added small random noise and one without. Small random noise is aimed to distort the optimal perturbation injected by the adversary. We have trained our model against random noise at training time; hence, the small random noise does not affect the classification accuracy of normal images. However, it may disturb the optimal adversarial example generated by an adversary. If the original image and image with added test noise produce different outputs from the model, the input image is likely to be adversarial.
The remainder of this paper is organized as follows. We introduce the threat model in Section 2. In Section 3, we present related work, and in Section 4, we provide the necessary background information. In Section 5, we present the proposed approach, followed by an evaluation in Section 6. In Section 8, we give conclusions and future work.

Threat model
Before discussing the related work, we define the threat model formally. In this work, we consider evasion attacks in both white-box settings (assume an adversary has full knowledge of the trained model F) and gray-box settings (assume an adversary has no knowledge of the trained model F but has knowledge of the training set used). The ultimate goal of an adversary is to generate adversarial examples x that misled the trained model. That is, for each input (x, y), the adversary's goal is to solve the following optimization problem: where y = y are class labels and is a maximum allowable perturbation that is undetectable by human eyes.

Related work
Over the past few years, various adversarial (re)training methods have been developed to mitigate adversarial attacks [14,20]. Nevertheless, Tramèr et al. [34] showed that a two-step attack can easily bypass the classifier trained using adversarial examples by first adding a small noise to the system and then performing any classical attack technique such as FGSM and DeepFool. Instead of injecting adversarial examples to the training set, Zantedeschi et al. [35] suggested to add small noises to the normal images to generate additional training examples. They compared their methods with other defense methods, such as adversarial training and label smoothing, and showed that their approach is robust and does not compromise the performance of the original classifier. The last type of proactive countermeasures is to robustify a classifier. That is, its goal is to build more robust neural networks [22,23]. For instance, in [22], Bradshaw combined DNN with Gaussian processes (GP) to make scalable GPDNN and showed that it is less susceptible to the FGSM attack. Furthermore, Yuan et al. [19] introduced three reactive countermeasures for adversarial examples: adversarial detection [24][25][26][27][28][29], input reconstruction [30,36], and network verification [31,32]. Adversarial detection consists of many techniques for adversarial detecting. For instance, Feinman et al. [24] assumed that the distribution of adversarial sample is different from the distribution of natural sample and proposed a detection method based on the kernel density estimates in the subspace of the last hidden layer and the Bayesian neural network uncertainty estimates with dropout randomization. However, Carlini and Wagner [37] showed that the kernel density estimation, which is the most effective defense technique among ten defenses considered by Carlini and Wagner on MNIST, is completely ineffective on CIFAR-10. Grosse et al. [25] used Fisher's permutation test with Maximum Mean Discrepancy (MMD) to check where a sample is adversarial or natural. Though MMD is a powerful statistical test, it fails to detect the C&W attack [37]. This indicates there may not be a significant difference between the distribution of an adversarial sample and the distribution of a natural sample when the adversarial perturbation is subtle. Xu et al. [29] used feature squeezing techniques to detect adversarial examples, and the proposed technique can be combined with other defenses such as the defensive distillation for defense against adversarial examples. Input reconstruction is another category of reactive defense, where a model is used to find the distribution of a natural sample and an input is projected onto data manifold. In [30], a denoising contractive autodecoder network (CAE) is trained to transform an adversarial example to the corresponding natural one by removing the added perturbation. Song et al. [36] used PixelCNN, which provides the discrete probability of raw pixel values in the image, to calculate the probabilities of all training images. At the test time, a test instance is inputted and its probability is computed and ranked. Then, a permutation test is used to detect an adversarial perturbation. In addition, they proposed PixelDefend to purify an adversarial example by solving the following optimization problem: Last but not the least, network verification formally proves whether a given property of a DNN model is violated. For instance, Reluplex [31] used a satisfiability modulo theory solver to verify whether there exists an adversarial example within a specified distance of some input for the DNN model with ReLU activation function. Later, Carlini et al. [38] showed that Reluplex can also support max-pooling by encoding max operators as follows: Initially, Reluplex could only handle the L infty norm as a distance metric. Using Eq. (3), Carlini et al. [38] encoded the absolute value of sample x as In this way, Reluplex can handle the L 1 norm as a distance metric as well. However, Reluplex is computationally infeasible for large networks. More specifically, it can only handle networks with a few hundred neurons [38]. Using Reluplex and k-means clustering, Gopinath et al. [32] proposed DeepSafe to provide safe regions of a DNN. On the other hand, Reluplex can be used by an attacker as well. For instance, Carlini et al. [38] proposed a Ground-Truth Attack that uses C&W attack as an initial step for a binary search to find an adversarial example with the smallest perturbation by invoking Reluplex iteratively. In addition, Carlini et al. [38] also proposed a defense evaluation using Reluplex to find a provably minimally distorted example. Since it is based on Reluplex, it is computationally expensive and only works on small networks. However, Yuan et al. [19] concluded that almost all defenses have limited effectiveness and are vulnerable to unseen attacks. For instance, C&W attack is effective for most of existing adversarial detection methods though it is computationally expensive [37]. In [37], authors showed that ten proposed detection methods cannot withstand white-box attack and/or black-box attack constructed by minimizing defense-specific loss functions.

Background
This section provides a brief introduction to neural networks and adversarial machine learning.

Neural networks
Machine learning automates the tasks of writing rules for a computer to follow. That is, giving an input and a desired outcome, machine learning can find a set of rules needed automatically. Deep learning is a branch of machine learning that automates a feature selection process. That is, you do not even need to specify the features, and a neural network can extract them from raw input data. For instance, in image classification, an image is inputted into the neural network and the convolutional layers of the neural network will extract the important features from the image directly. This makes deep learning desirable for many complex tasks such as natural language processing and image classification, where software engineers have difficulty writing rules for a computer to learn such tasks. The performance of a neural network is remarkable in domains such as image classification [1][2][3], natural language processing [4][5][6], and machine translation [39]. By using the Universal Approximation Theorem [40], any continuous function in a compact space can be approximated by a feed-forward neural network with at least one hidden layer and suitable activation function to any desirable accuracy [41]. This theorem explains the wide application of deep neural networks. However, it does not give any constructive guideline on how to find such universal approximator. In 2017, Lu et al. [42] established the Universal Approximation Theorem for width-bounded ReLU networks, which showed that a fully connected width-(n+ 4) ReLU networks, where n is the input dimension, is an universal approximator. These two Universal Approximation Theorems explain the ability of deep neural network to learn.
A feed-forward neural network can be written as a function F : X → Y, where X is its input space or sample space and Y is its output space. If the task is classification, Y is the set of discrete classes. If the task is regression, Y is a subset of R n . For each sample, x ∈ X, is an activation function, such as the non-linear Rectified Linear Unit (ReLU), sigmoid, softmax, identity, and hyperbolic tangent (tanh); w i is a matrix of weight parameters; b is the bias unit; i = 1, 2, . . . , L; and L is the total number of the layer in a neural network. For classification, the activation function for the last layers is usually a softmax function. The key to state-of-the-art performance of the neural network is the optimized weight parameters that minimize a loss function J, a measure of the difference between the predicted output and its true label. A common method used to find the optimized weights is the back-propagation algorithm. For example, see [43] and [44] for an overview of the back-propagation algorithm.
In this paper, we consider the DNN models for image classification. A gray scaled image x has h × w pixels, i.e., Similarly, for a colored image x with a RGB channel, x ∈ R 3hw . In the following subsection, we will consider the attack approaches for generating adversarial image x from the natural or original image x, where x can be either gray scaled or colored.

Attack approaches
Adversarial example/sample x is a generated sample that is close to natural sample x to human eyes but misclassified by a DNN model [13]. That is, the modification is so subtle that a human observer does not even notice it, but a DNN model misclassifies it. Formally, an adversarial sample x satisfies the conditions that where is the maximum allowable perturbation that is not noticeable by human eyes.
• F(x ) = y , F(x) = y, and y = y , where y and y are class labels.
In general, there are two types of adversarial attacks for DNN models: untargeted and targeted. In an untargeted attack, an attacker does not have a specific target label in mind when trying to fool a DNN model. In contrast, an attacker tries to mislead a DNN model to classify an adversarial sample x as a specific target label t in a targeted attack. In 2013, Szegedy et al. [13] first generated such an adversarial example using the Limited-memory Broyden Fletcher Goldfarb Shanno (L-BFGS) attack. Given a natural sample x and a target label t = F(x), they used the L-BFGS method to find an adversarial sample x that satisfies the following box-constrained optimization problem: where c is a constant that can be found by linear searching, and m = hw if the image is gray scaled and m = 3hw if the image is colored. Because of this computational intensive linear searching method for optimal c, the L-BFGS attack is time consuming and impractical. Over the course of the next few years, many other gradient-based attacks are proposed, for example, FGSM is a one-step algorithm that generates perturbations in the direction of the loss gradient, i.e., the amount of perturbation is and where is an application-specific imperceptible perturbation adversary intent to inject, and sgn(∇ x J(x, l)) is the sign of the loss function [14]. The Basic Iterative Method (BIM) is an iterative application of FGSM in which a finer perturbation is obtained at each iteration. This method is introduced by Kurakin et al. [45] for generating an adversarial example in the physical world. The Projected Gradient Descent (PGD) is a popular variate of BIM that initializes through uniform random noise [46]. Iterative Least-likely Class Method (ILLC) [45] is similar to BIM but uses the least likely class as the targeted class to maximize the cross-entropy loss; hence, this is a targeted attack algorithm. • Jacobian-based Saliency Map Attack (JSMA) JSMA uses the Jacobian matrix of a given image to find the input features of the image that made most significant changes to output [47]. Then, the value of this pixel is modified so that the likelihood of the target classification is higher. The process is repeated until the limited number of modifiable pixels is reached or the algorithm has succeeded in generating the adversarial example that is misclassified as a target class. Therefore, this is another targeted attack but has a high computation cost.

• DeepFool
DeepFool searches for the shortest distance to cross the decision boundary using an iterative linear approximation of the classifier and orthogonal projection of the sample point onto it [15], as shown in Fig. 2. This untargeted attack generates adversarial examples with a smaller perturbation compared with L-BFGS, FGSM, and JSMA [17,48]. For detail, see [15]. Universal Adversarial Perturbation is an updated version of DeepFool that is transferable [49]. It uses the DeepFool method to generate the minimal perturbation for each image and find the universal perturbation that satisfies the following two constraints: where || * || p is the p-norm, specifies the upper limit for perturbation, and δ ∈[ 0, 1] is a small constant that specifies the fooling rate of all the adversarial images. • Carlini and Wagner (C&W) attack C&W attack [17] is introduced as a targeted attack against defensive distillation, a defense method against an adversarial example proposed by Papernot et al. [21]. Opposed to Eq. 6, Carlini and Wagner defined an objective function f such that f (x ) ≤ 0 if and only if F(x ) = t, and the following optimization problem is solved to find the minimal perturbation η: subject to x ∈[ 0, 1] m .
Instead of looking for optimal c using the linear searching method as in the L-BFGS attack, Carlini and Wagner observed that the best way to choose c is to use the smallest value of c for which f (x ) ≤ 0. To ensure that the box-constraint x ∈[ 0, 1] m is satisfied, they proposed three methods, projected gradient descent, clipped gradient descent, and change of variable, to avoid box-constraint. These methods also allow us to use other optimization algorithms that do not naturally support box constraints. In their paper, the Adam optimizer is used since it converges faster than the standard gradient descent and the gradient descent with momentum. For detail, see [17]. C&W attack is not only effective for defensive distillation but also effective for most of existing adversarial detecting defenses. In [37], Carlini and Wagner used C&W's attack against ten detection methods and showed the current limitations of detection methods. Though the C&W attack is very effective, it is computationally expensive compare to other techniques.
There are other attack techniques that assume a less powerful attacker who has no access to the model. These are called black-box attacks. Following are some of the black-box attack techniques proposed in past few years.
• Zeroth-order Optimization (ZOO)-based Attack estimates the gradient and Hessian using the quotient difference [50]. Though this eliminates the need for direct access to gradient, it requires a high computation cost. • One-Pixel Attack uses differential evolution to find a pixel to modify. The success rate is 68.36% for CIFAR-10 test dataset and 16.04% for ImageNet dataset on average [16]. This shows the vulnerability of DNNs. • Transfer attack generates adversarial samples for a substitution model and uses these generated adversarial samples to attack the targeted model [51]. This is often possible due to the transferability of DNN models.

Methodology
In this section, we present our iterative approach for generating more data from the original dataset to train our proposed model. The training process can be summarized in the following steps: 3 Combine the regular training set with the adversarial images and noisy images generated in 1 and 2, respectively, to train the model. Instead of hard labels for these images, soft labels are used instead. For instance, hand-written "7" and "1" are similar in that both have a long vertical line. Hence, rather than label a hand-written digit as 100% "7" or 100% "1". We may say it is 80% "7" and 20% "1". This shows the structure similarity between "7" and "1". More precisely, the soft labels for an adversarial image, a random image, and a clean image are defined as follows, respectively. Let τ be a hyperparameter that measures the acceptable size of perturbation. Let us first define the soft label, p i (x j ), for an adversarial image based on the value of τ in the following two cases.
(a) If η < τ, then we define the soft label p i (x j ) = α + 1−α n for the adversarial image x j generated using the real image x j that belongs to class i, and 1−α n otherwise, where 0 < α < 1 is close to 1 and n is the number of classes. (b) If η ≥ τ , we define the soft label p i (x j ) = β for the adversarial image x j generated using the real image x j that belongs to class i, the soft label p n (x j ) = γ for the adversarial image x j generated using the real image x j that belongs to the adversarial class, and the soft label p k (x j ) = 1−β−γ n−2 for adversarial image x j generated using the real image x j that belongs to class k , where k = i, n, 0 < β, γ < 1, and 0 < β + γ < 1.
The soft label for a random-noise image is defined similarly. The α and τ values used for soft label calculation of adversarial images and random noise images are shown in Table 1. For simplicity, the correct class is assigned a soft label of 0.95, whereas other classes are assigned a soft label of 0.05/9 for the clean images. 4 Check robustness of the model if it is used as a stopping criterion. In [35], the robustness of a model is defined in terms of the expected amount of L 2 perturbation that is required to fool the classifier: where η is the amount of L 2 perturbation that an adversary added a perturbation to a normal image x, and δ is an extremely small constant allowing for the division to be defined, say δ = 10 −10 . We use this definition of robustness in this paper since this definition captures the intuition that larger perturbation is required to fool the classifier indicates a more robust classifier. Steps 100 5 While k < k max and/or the robustness of the model ρ < ρ 0 , where ρ 0 is an acceptable level of robustness selected by an user and k max is the maximum number of iterations allowed, this step repeatedly generates and accumulates a large amount of data for training a robust model.
(a) Sample a mini-batch of N stratified random images from a real image set and generate additional images using adversarial example generating techniques (shown in step 1) and random perturbation (shown in step 2). This combined sample is then used in the next step to retrain the model. (b) Retrain the model. Update the model weights by minimizing the following cross-entropy loss: where n is the number of classes. The n th class is the adversarial class, whereas classes 1 to n − 1 are regular image classes.
The implications of the above proposed approach are discussed in Section 7, where we have showed that our proposed approach is robust when a small noise is injected to the test instance before the classification is performed, and it is also robust against both white-box attacks and black-box attacks. However, we have not studied whether or not our proposed approach is robust against adaptive attacks. A further study is needed in the future.

Evaluation
In this section, we empirically evaluate our proposed approach on MNIST [18] dataset and CIFAR-10 [55], canonical datasets used by most papers for attack and defense evaluation [21,29]. The performance metric considered is the accuracy of the classifier.

Datasets
The MNIST handwritten digit recognition dataset [56] is a subset of a larger set available from NIST. It is normalized, where each pixel is ranged between 0 and 1 and each image has 28 × 28 pixels. There are 10 classes: the digits 0 through 9. All the images are black and white. The samples are split between a training set of 60,000 and a test set of 10,000. Out of 60,000 training instances, 10,000 are reserved for validation, and 50,000 are used for actual training. To evaluate the scalability of the proposed robust classifier, we consider CIFAR-10, which is a 32 × 32 × 3 colored image dataset consisting of 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. The samples are split between a training set of 50,000 and a test set of 10,000. See Table 2 for a summary of the dataset parameters.

Network architectures
The network architecture for the MNIST dataset consists of two ReLU convolutional layers, one with 32 filters of size 3 by 3 and follows by another with 64 filters of size 3 by 3. Then, a 2 by 2 max-pooling layer and a dropout with a rate of 0.25 is applied before a ReLU fully connected layer with 1024 units. The last layer is another fully connected layer with 11 units and a softmax activation function for classification in 10 regular classes plus an adversarial class. The total number of the parameters is 6,616,970 for the original model with 10 regular classes only. The proposed model with 11 classes has 6,617,995 parameters. The accuracy of the model is 99%, which is comparable to the accuracy of the state-of-the-art DNN. The network architecture for the CIFAR-10 model is a ResNet-based model provided by the IBM [57]. The hyperparameters used for the training and the corresponding values are shown in the following table. For instance, we set the acceptable level of robustness ρ 0 to 0.1.  [46], and Gaussian noise is added to the images to take care of other potential attacks and for better data representation. The soft labels for those images are based on the perturbation introduced. For instance, under the C&W attack, we use a soft label of 0.44 if the perturbation limit is less than 0.026. 0.026 is selected based on the observation that the perturbation within this limit is hardly noticeable to human eyes. For a similar reason, the perturbation limit of BIM and Gaussian noise is set to 0.15 and 0.25 with a soft label value of 0.44. At the test time, we consider the strong white-box attack against our proposed model. The adversarial images are generated using FGSM, DeepFool, and C&W attack with the assumption that an adversary has feature knowledge, algorithm knowledge, and the ability to inject adversarial images. FGSM attack is selected because it is a popular attack technique using by many papers for evaluation. BIM, DeepFool, and C&W attacks are selected because they are the three currently strongest existing attacks. For the FGSM attack, the perturbation of = 0.1 is considered. This is a reasonable limit because a larger perturbation can be detected by human and/or anomaly detection systems. Similarly, the maximum perturbation for the BIM attack is also set to 0.1 and the attack step size is set to 0.1 3 . The maximum number of iterations for BIM is 40. The settings of the C&W and DeepFool attacks are left as default in ART [57]. Random noise added to train images follows a Gaussian distribution with mean 0 and variance η max for each batch.

Results
We consider the accuracies of the classifiers on the normal MNIST and CIFAR-10 test images and the accuracies of the classifier under FGSM, C&W, BIM, and DeepFool attacks, respectively. The prediction accuracy of the original classifier on the normal MNIST test images is 99%. The prediction accuracy of the robust classifier on the normal MNIST test images is also 99% after retraining. Furthermore, the accuracies of the robust classifier under FGSM, C&W, BIM, and DeepFool attacks are shown in Table 3. These results have shown that after retraining, the model performance dramatically improved under these attacks. As shown, the original model's classification accuracy drops significantly under the attacks. On  the contrary, our robust classifier maintains the classification accuracy even under the attacks. Compared to the adversarial retraining method implemented by IBM's adversarial robustness toolbox, our robust classifier performs better under DeepFool and C&W attacks.
To evaluate the performance of the classifiers under gray-box attacks, we assume that an attacker has no knowledge of a neural architecture. In this case, an attacker builds his/her own approximated CNN model and conducts the transfer attacks. We assume that an attacker develops a simple CNN model consisting a CNN layer with 4 filters of size 5 by 5, followed by a 2 by 2 maxpooling layer and a fully connected layer with 10 units and a softmax activation function for classification. The accuracy of the model is 99%, which is comparable to the accuracy of the state-of-the-art [48]. Table 4 shows that the original model performs poorly under grey-box attack. In fact, it is worsen than what is under the white-box attack. However, the black-box attacks have little affected under the adversarially retrained models.
Moreover, a similar experiment has been performed on the CIFAR-10 dataset for the evaluation of the scalability of the proposed robust classifier. The experimental results are shown in Table 5. Note that we do not conduct any hyperparameter tuning due to the constraints of HPC cluster resource and time. That is, we have used the same hyperparameter settings, as shown in Table 1. The robust classifier performs better than the original model, though the improvement is not as good as for the MNIST dataset. As shown in this table, the accuracy has been improved by 31%, 32%, 72%, and 23% under FGSM, CW, BIM, and Deep Fool attack, respectively.

Discussion
In this section, the implications of the mechanism are discussed.

Trade-off between the accuracy and resilience against adversarial examples
One common problem of adversarial training and other defense methods is the trade-off between the accuracy and resilience against adversarial examples. This tradeoff exists because the network architectures of an original system and its proposed one are similar and/or the dataset size is fixed. However, our proposed approach has a larger network capacity with a larger dataset generated by multiple techniques. Hence, our proposed approach does not have this trade-off as shown in Section 6. This solves the problem of trade-off between the accuracy and the robustness against an adversarial example and makes the model not only robust against a strong adversarial attack but also maintains the performance for classification. See Section 6 for detail.

Training time
Another consideration is the training time. To check the number of epochs needed for the training, the model is trained for ten epochs, as shown in Fig. 3. The total training time for ten epochs is about 90 s on Google Colab (with 12GB NVIDIA Tesla K80 GPU). To prevent overfitting, three epochs are selected for training the original model. A similar procedure is used to determine that ten epochs are needed for the proposed model. However, the proposed approach is highly parallelizable. Due to the transferability of adversarial examples, the adversarial examples do not have to be generated using the current model. Hence, adversarial crafting and adversarial training can be performed simultaneously. That is, at iteration t, an adversarial example can be generated based on the model generated at iteration t < t, and adversarial training can be performed using the adversarial example generated previously as well. Depending on the memory and computation resource, we can store the generated adversarial example at each iteration and sample from it when performing the adversarial training. The sampling probability can be based on the performance of the model at previous iterations. For instance, initially, we assign a probability of 1 n to each adversarial example and increases its probability for the next iteration if the model misclassifies it or it is not selected for the current iteration of training and decreases its probability if the model correctly classifies it. Furthermore, we can parallelize the adversarial crafting step (or the fake image generation The framework for parallelization is shown in Fig. 4. Initially, the original model and a random sample of the original images are used to generate fake images. The original model is needed because we want to generate adversarial images by just adding an application-specific imperceptible perturbation to the original image and make it misclassified by the original model. There are various ways to obtain such an application-specific imperceptible perturbation. For instance, the FGSM attack generates such a perturbation by using the model's loss gradient as described in Section 4. After the fake images are generated, it is saved to the fake image storage folder. During the retraining step, a random sample of fake images and original images as well as the copies of the original model are sent to the multiple-GPUs for retraining. After the retraining, the updated model is checked to see if it is robust against adversarial attacks. If so, the updated model is saved, and the process ends. Otherwise, the updated model is saved as the new model for the next stage of fake image generation and retraining. To demonstrate the importance of selecting right the resource for different adversarial example crafting techniques, we conduct an experiment on a standard alone personal computer (12 thread CPUs). First, we independently run each attack on the same computer and measure the run time for generating a batch of 64, 128, 256, and 512 adversarial examples without GPU resource. To reduce the variation, we repeat the same experiment for 30 times for each attack. Then, we install K40C, TitanV, and both K40C and TitanV, respectively, and repeat the experiments. The result is shown in Figs. 5 and 6. The BIM attack run faster on CPU than on GPUs. In fact, when an attacker tries to utilize the GPUs, the adversarial image generation speed goes down. As shown in Fig. 5, though we utilize different types of GPU, the adversarial image generation speed does not change among the GPU types. On the contrary, C&W attack takes advantage of the GPU and runs faster on GPU than on CPU. Furthermore, we see that C&W attack runs faster on Titan V than on K40c. Table 6 is obtained by averaging over batch sizes. As shown, the BIM attack runs twice as fast on CPU than GPU whereas C&W attack runs faster on GPU. The variation among the runs is smaller under BIM attacks on average, as shown with smaller standard deviations. Moreover, we create a row called Titan V + K40c (average), which takes the average speed of Titan V and K40c. Comparing the values on this row with the value in the last row, we see that actual speed with two GPUs is not equal to the average speeds if GPU is utilized (as in C&W attack).

Ensembling
At the test time, a small random noise is injected to the test instance before performing the classification. This small noise is aimed to distort the intentional perturbation injected by the adversarial. An experiment is performed to illustrate it as follows. First, 100 adversarial images are generated using two popular adversarial attack algorithms, BIM and FGSM. Then, using NumPy's Parallelization framework. We decouple the fake image generation step from the retraining step by only updating the model used for the fake image generation as a new model is obtained. This way, the fake image generation step can continuously generate adversarial images while the retraining continues. Furthermore, the fake image generation step is parallelized by using multi-processing. Note that the processors can be GPUs or CPUs. GPUs are better in term of computation efficiency, and CPUs can store more images [58] normal number generator with a mean of 0.01 and variance of 0.01, a small random normal noise is generated for each adversarial image. This small random normal noise is injected into an adversarial image. Next, the original classifier classifies these noisy adversarial images. If the predicted label is different from the ground truth, then the attack is successful (we considered an untargeted attack). Otherwise, it is unsuccessful. This process is repeated 30 times. The average result is shown in Fig. 7. Even without robust adversarial training, the classifier (original) can reduce the attack success rate by 14% for the FGSM attack and 20% for the BIM attack when the perturbation injected by an attacker is 0.01. Even when the perturbation increases to 0.05, the injected random noise can reduce the attack success rate of FGSM and BIM by 6% and 14%, respectively. Since our proposed model is trained with random noise, it is robust against it. The small noise added to the natural image is not likely to change the label of the image since we have trained our model with small Gaussian noise. However, if the test instance is adversarial, then it is likely produced by an optimization algorithm that searches for a minimal perturbation to a clean image. However, the injected random noise to such a clean image will mess up such well-calculated minimal perturbation. Therefore, if the label of a given test image without added random noise is different from that of with added random noise, this is likely an adversarial image and the system is alerted.

Limitations
Section 6 considered both white-box attacks (the attacker knows model architecture) and black-box attacks (the attacker does not know the model architecture). The proposed model is robust against both types of attacks, as shown in Section 5. However, those evaluations  are based on the MNIST dataset. More experimentation on larger datasets are encouraged if the computational resource is available. Furthermore, it is unclear whether or not our proposed approach is still robust when crafting adversarial samples are done by someone who has no idea what toolboxes like ART are used. We will answer this question in our further studies. In Section 6.2, we discussed the training time and proposed a parallelization framework for adversarial training. Then, to show the importance of selecting the right resource, we experimented on different types of GPUs. In Section 6.3, we discussed the idea of ensembling and performed an experiment to show the effectiveness of random noise in reducing the attack success rate on the original model (without adversarial training) under FGSM and BIM attacks. Note that all attacks are generated based on the Adversarial Robustness Toolbox. However, similar results can be obtained using other toolboxes such as Foolbox [59] or advertorch [60]. Furthermore, it is encouraged to test the proposed defense method on a tailored adaptive attack that might generate successful perturbations on randomly perturbed images instead of clean images.

Conclusions and future work
Many recent researchers have utilized various machine learning techniques, such as DNN, for security-critical applications. For instance, Morgulis et al. [61] have demonstrated that traffic sign recognition systems of a car can be easily fooled by adversarial traffic signs and cause the car to take unwanted actions. This result has shown the vulnerability of machine learning techniques. In this work, we have presented a distributed adversarial retraining approach against adversarial examples. The proposed methodology is based on the idea that with enough datasets, sufficient complex neural network, and computational resource, we can obtain a DNN model that is robust against these adversarial attacks. The proposed approach is different from the existing adversarial retraining approach in several aspects. First, we have used soft label instead of hard label to prevent overfitting. Second, we have proposed to increase model complexity to overcome the trade-off between the model accuracy and the resilience against adversarial examples. Furthermore, we have utilized the transferability property of adversarial instance to develop a distributive adversarial retraining framework that can save runtime when multiple GPUs are available. In addition, we have robustly trained the DNN model against random noise. Therefore, the obtained final classifier can provide the correct labeling to a normal instance, even though that the instance has random Gaussian noise added. By utilizing this robustness against random noise, we have added random noise to all test instances before performing classification in order to break the careful calculated adversarial perturbation. Moreover, we have compared our proposed approach against the current start-of-the-art approach and demonstrated that our proposed approach can effectively defend against stronger adversarial attacks, i.e., C&W attack and Deepfool. Future work will explore black-box attacks and formal guarantee for performance.