Strategic safeguarding: A game theoretic approach for analyzing attacker-defender behavior in DNN backdoors

Kallas, Kassem; Le Roux, Quentin; Hamidouche, Wassim; Furon, Teddy

doi:10.1186/s13635-024-00180-5

Research
Open access
Published: 15 October 2024

Strategic safeguarding: A game theoretic approach for analyzing attacker-defender behavior in DNN backdoors

Kassem Kallas¹,
Quentin Le Roux^2,3,
Wassim Hamidouche⁴ &
…
Teddy Furon³

EURASIP Journal on Information Security volume 2024, Article number: 32 (2024) Cite this article

Metrics details

Abstract

Deep neural networks (DNNs) are fundamental to modern applications like face recognition and autonomous driving. However, their security is a significant concern due to various integrity risks, such as backdoor attacks. In these attacks, compromised training data introduce malicious behaviors into the DNN, which can be exploited during inference or deployment. This paper presents a novel game-theoretic approach to model the interactions between an attacker and a defender in the context of a DNN backdoor attack. The contribution of this approach is multifaceted. First, it models the interaction between the attacker and the defender using a game-theoretic framework. Second, it designs a utility function that captures the objectives of both parties, integrating clean data accuracy and attack success rate. Third, it reduces the game model to a two-player zero-sum game, allowing for the identification of Nash equilibrium points through linear programming and a thorough analysis of equilibrium strategies. Additionally, the framework provides varying levels of flexibility regarding the control afforded to each player, thereby representing a range of real-world scenarios. Through extensive numerical simulations, the paper demonstrates the validity of the proposed framework and identifies insightful equilibrium points that guide both players in following their optimal strategies under different assumptions. The results indicate that fully using attack or defense capabilities is not always the optimal strategy for either party. Instead, attackers must balance inducing errors and minimizing the information conveyed to the defender, while defenders should focus on minimizing attack risks while preserving benign sample performance. These findings underscore the effectiveness and versatility of the proposed approach, showcasing optimal strategies across different game scenarios and highlighting its potential to enhance DNN security against backdoor attacks.

1 Introduction and related works

1.1 Introduction

Over the past decade, deep neural networks (DNNs) have achieved significant success in critical applications such as computer vision [1], autonomous vehicles [2], finance [3], healthcare [4], and beyond [5]. However, the increasing prevalence of DNNs has raised concerns about their security. Well-studied threats, such as adversarial examples, compromise DNN integrity during inference by subtly manipulating test-time inputs [6]. Additionally, malicious actors may target the DNN training process itself. Each step, from data collection and pre-processing to architecture selection, training, and deployment, presents potential vulnerabilities that adversaries can exploit [7, 8]. Furthermore, the considerable data and computational resources required for DNN training, coupled with a shortage of machine learning expertise, often compel users to outsource aspects of their process to third parties (e.g., machine learning as a service (MLaaS), acquisition of pre-trained models [9]. This outsourcing, while convenient, reduces control and introduces new attack surfaces [6].

Backdoor attacks, a critical threat to DNN security, involve embedding malicious behavior into a DNN prior to inference. Once injected, the backdoor can be triggered by the attacker during inference to produce a desired, incorrect output [10,11,12]. The backdoor attacker’s objective is twofold: the compromised DNN must function normally on benign inputs to avoid detection, and the backdoor must be easily activated. Activation typically occurs through the presentation of a trigger-modified input. Backdoor injection can occur at any stage in the model’s supply chain and lifecycle before inference [6]. This includes poisoning training data [13], manipulating model parameters during training [14], or even during deployment [15,16,17]. Furthermore, transfer learning can also be exploited to embed a backdoor during training [18, 19].

The prevalent backdoor attack strategy currently relies on training data poisoning [13, 20]. This involves inserting samples, manipulated with a trigger pattern, into an otherwise benign dataset. The victim DNN then learns to associate the pattern with incorrect predictions. The labels of these poisoned samples may be altered [21, 22] (poison-label attacks) or remain consistent with their ground truths [20, 23] (clean-label attacks). The latter strategy aims to avoid detection if a defender inspects the training dataset.

Backdoor attacks have been demonstrated in various scenarios [6, 24], ranging from natural language processing (NLP) [25, 26] and audio [27] to computer vision applications [6, 28]. Beyond poison and clean-label attacks, backdoors encompass diverse subcategories, including class-agnostic or class-specific attacks [22], various trigger types and families [6], and concepts like trigger transparency [13, 29]. For comprehensive surveys on backdoor attacks, defenses, and their categorization, please refer to [6, 8, 10,11,12, 24].

The evolving threat landscape and the ongoing cat-and-mouse game between backdoor attackers and defenders [30], characterized by continuous development of new attacks and defenses, motivate this work. Within the specific context of clean-label backdoor attacks on image classification [6, 8, 20, 23], this paper asks the following question: can we model the interaction between a DNN backdoor attacker and a defender as a two-player game, determine its Nash equilibria, and assess each player’s performance at equilibrium? Addressing this question could potentially break the ongoing cycle and determine which party might ultimately win this game.

1.2 Related works

Prior work on backdoors in federated learning [31] offers initial insights into the applicability of game theory for better understanding DNN backdoor risks. This paper takes a different approach. It focuses on centralized learning and on on developing a game-theoretic defense approach, rather than solely exploring attacker-defender dynamics. In this context, existing research in robust learning [32,33,34] have previously highlighted the value of game theory in studying adversarial machine learning.

Prior work has made use of various methodologies to address backdoor attacks in DNNs [24, 8], such as heuristic-based approaches, probabilistic models, or adversarial training. In this paper, we further expand the body of work on DNN integrity by using game theory. Due to its unique ability to model strategic interactions between rational attackers and defenders, game theory provides a structured framework for analyzing these adversarial behaviors, allowing for the identification of optimal strategies for both parties. Unlike heuristic approaches that may lack theoretical guarantees, or probabilistic models that can be computationally intensive, game theory may provide a balanced approach between analytical tractability and practical applicability, especially in scenarios involving clear, competitive objectives as found in security contexts.

1.3 Contributions

In this context, our research makes three significant contributions. First, we introduce a novel game-theoretic framework that models the interaction between a DNN backdoor attacker and a defender. This new formulation enables a detailed examination of each player’s strategies and performance, with the goal of identifying the most effective strategies, typically known as Nash equilibria [35]. Our approach advances the existing literature by providing a two-player game model that simultaneously evaluates the optimal strategies of both the attacker and the defender.

Second, instead of employing a complex bi-matrix game in our framework, we adopt a simpler, more tractable two-player zero-sum game. This simplification is crucial as it significantly streamlines the analysis and strategy development process by focusing on the zero-sum nature of the game, where one player’s gain is precisely the other’s loss. To achieve this, we develop a utility function that encapsulates the dual objectives of the players, which include maintaining the performance of a DNN’s clean data accuracy while also addressing their conflicting goals concerning the success rate of a backdoor attack. This simplification not only enhances the analytical tractability but also bolsters the practical applicability of our game-theoretic approach to real-world scenarios, where clear and decisive strategies are paramount.

Our final contribution is the evaluation of our proposed game-theoretic framework using numerical simulations, exploring multiple game variants on a well-known dataset and classification task. We investigate three configurations with varying levels of control afforded to either the attacker or the defender. Each setting focuses on a different backdoor poisoning trigger regimen [20]. The core value of our framework lies in finding the best strategy for each player under each setting. To do so, we construct utility matrices through numerical simulations and examine existing saddle points for each setup. Moreover, determining the optimal strategies at equilibrium provides deeper insights into the performance capabilities of both the attacker and the defender across various situations.

Our proposed framework is attack-agnostic and designed to be used beyond the examples presented in this paper, such as applications on a wider range of attacks and countermeasures. To the best of our knowledge, this is the first paper to offer a self-contained framework for modeling the interaction between DNN attackers and defenders, thereby circumventing the current cat-and-mouse game between them. It standardizes the comparison between attacks and defenses by aiding in the identification of optimal strategies for the players. Furthermore, it provides valuable insights into the performance of both players.

The rest of this paper is organized as follows. Section 2 formalizes backdoor attacks on a computer vision task, covering the threat model and attack used in this work. Section 3 briefly introduces game theory and the game-theoretic formulation of attacker-defender interactions central to this paper. Section 4 reports our simulation results and discusses the optimum strategies for each player and their performance at the Nash equilibrium. Finally, Section 5 concludes this paper.

2 Backdoor attack

This section motivates our threat model, a targeted, clean-label, data-poisoning-based backdoor on a classification task, and introduces the attack and notation used in the rest of this paper.

2.1 Motivation for our threat model

Backdoor attacks on computer vision and their countermeasures is a thriving area of research [6, 8, 10,11,12, 24]. This paper assumes an attacker who targets a supervised learning model, specifically an image classification task. This setting is very common in the backdoor literature, from its early works like BadNets [22] to more recent demonstrations on face recognition for instance [8]. Therefore, it is a fitting example on which to base our framework.

Our use case attacker looks to inject a backdoor behavior in a targeted fashion, that is, the attacker aims to compromise the integrity of the model with a specific target in mind [22], e.g., forcing misclassifications towards a specific target class. This differs from untargeted attacks which aim to deteriorate a DNN’s availability by causing general misclassifications [6, 36].

We focus on using a backdoor based on data poisoning. This is a core risk at the pre-training stage [6, 12, 22] where an attacker has hijacked the supply chain of a DNN trainer (e.g., at the data collection, data repository, etc.) such that a DNN’s training data become compromised. The attacker manipulates a portion of a victim’s dataset, modifying its images and, possibly, their labels such that any DNN trained on the dataset will learn a malicious behavior. Additionally, we follow a clean-label backdoor [23] use case. In this context, the attacker only manipulates the image content of the class(es) they are targeting in the compromised dataset. Labels are left unchanged. This use case matters in the case of data poisoning as the attacker is maximizing their stealth and therefore the chance of a victim trainer to embed a backdoor in a DNN down the line.

The choice of a targeted, clean-label backdoor threat model is motivated by the potential impact of such attacks in real-world scenarios and safety-critical fields [6, 10], like autonomous vehicles or face recognition. Data poisoning and clean-label backdoor attacks are particularly relevant as they represent stealthy and effective methods for embedding malicious behaviors in DNNs, often bypassing traditional detection mechanisms. These choices are supported by numerous studies [6, 8], which highlight the effectiveness of targeted, clean-label attacks in compromising DNN integrity while maintaining high performance on benign inputs Fig. 1.

2.2 Formalization

A DNN is an approximation function $\mathcal {F}_{\theta }$ that determines for a given training dataset $D_{\textsf{tr}} = \{(x_i,y_i)\}_{i = 1}^{N_\textsf{tr}}$ the mapping from an input set $X= \{x_i\}_{i=1}^{N_\textsf{tr}}$ to an output set $Y=\{y_i\}_{i=1}^{N_\textsf{tr}}\in C$, where C is the set of classes, and |C| is the total number of classes learned by $\mathcal {F}_{\theta }$, $\mathcal {F}_{\theta } (x_i) = y_i$, and $\theta$ are the DNN parameters optimized by solving the following problem:

$$\begin{aligned} \arg \min _{\theta } \sum \limits _{i=1}^{N_\textsf{tr}} \mathcal {L}(\mathcal {F}_{\theta } (x_i), y_i). \end{aligned}$$

(1)

After optimization, the DNN performance is evaluated on an unseen test dataset $D_{\textsf{ts}} = \{(x_j,y_j)\}_{j = 1}^{N_\textsf{ts}}$. The chosen metric is the DNN’s clean data accuracy (CDA, res. test accuracy) defined as follows:

$$\begin{aligned} CDA(\mathcal {F}_{\theta }, D_{\textsf{ts}}) = \frac{\sum _{j=1}^{N_\textsf{ts}} I(x_j, y_j)}{N_\textsf{ts}}, \end{aligned}$$

(2)

where $I(x_j, y_j) = 1$ if $\mathcal {F}_{\theta }(x_j) = y_j$, and 0 otherwise.

A backdoor attack manipulates a DNN such that it outputs a wrong class label $\tilde{y}_i=t$ for a backdoored input $\tilde{x}_i$, where $\tilde{x}_i$ correspond to an input $x_i$ altered with some trigger $x_t$. A backdoor approach uses training data poisoning where a subset P of m elements drawn from $D_{\textsf{tr}}$ is altered with the trigger $x_t$ as follows:

$$\begin{aligned} P & = \{(\tilde{x}_i,\tilde{y}_i)\}_{i = 1}^{m}\end{aligned}$$

(3)

$$\begin{aligned} \tilde{x}_i & = (1 -\Delta _{\textsf{tr}}) \times x_i + \Delta _{\textsf{tr}} \times x_t \end{aligned}$$

(4)

where $\Delta _{\textsf{tr}}$ is the backdoor attack’s power or strength, which determines the overlay of the trigger $x_t$. Equation 4 quantifies the attacker’s ability to embed a backdoor trigger in a DNN’s training data. The rationale for this formulation is that it allows the attacker to balance the visibility of the trigger against the risk of detection. By adjusting $\Delta _{\textsf{tr}}$, the attacker can fine-tune the influence of the trigger, ensuring that it is strong enough to be learned by the DNN but subtle enough to evade initial detection.

Here, we note that the attacker’s power at training time $\Delta _{\textsf{tr}}$ (see Eq. 4) can differ from the one used at test time. As such, we use the notation $\Delta _{\textsf{tr}}$ for the attacker’s power during training and $\Delta _{\textsf{ts}}$ for its test time equivalent^{Footnote 1}. Since a human investigation of test samples may be unfeasible in the case of online platforms where response speed is crucial, we surmise that the attacker is free to update $\Delta _{\textsf{ts}}$ at test time.

This paper focuses on a targeted, clean-label backdoor attack where the elements in P belong to the attacker’s target class t. In other words, the poisoned training samples keep their original labels, i.e., $\tilde{y}_i = y_i = t$. Since all poisoned samples belong to the same class, the size of P is defined by the ratio $\alpha _{\textsf{tr}}\in (0,1]$ of poisoned training samples belonging to class t such that:

$$\begin{aligned} \alpha _{\textsf{tr}} = \frac{m}{N_{\textsf{tr},t}}, m<<N_{\textsf{tr},t} \end{aligned}$$

(5)

where m is the size of the set P of poisoned samples of class t and $N_{\textsf{tr},t}$ is the number of training samples of class t.

This data poisoning process yields a poisoned dataset $D_{\textsf{tr}}^{\textsf{po}}$ such that training on it produces a backdoored DNN $\mathcal {F}_{\theta }^{\textsf{po}}$. The attacker expects that a victim DNN trained on $D_{\textsf{tr}}^{\textsf{po}}$ will learn to associate the trigger $x_t$ with the target class t while keeping CDA on par with a benign model.

The backdoored DNN $\mathcal {F}_{\theta }^{\textsf{po}}$ is then assessed using its attack success rate (ASR). It corresponds to the proportion of wrongful classifications towards the backdoored class t that the attacker can induce by poisoning the test elements from $D_\textsf{ts}$ (they belong to any source class $y\ne t$). This poisoned test set is denoted $D_{\textsf{ts}}^{\textsf{po}}$ and the ASR is computed as such:

$$\begin{aligned} ASR(\mathcal {F}_{\theta }^{\textsf{po}}, D_{\textsf{ts}}^{\textsf{po}}) = \frac{\sum _{j=1}^{|D_{\textsf{ts}}^{\textsf{po}}|} I(\tilde{x}_j, t)}{|D_{\textsf{ts}}^{\textsf{po}}|}, \end{aligned}$$

(6)

where $I(\tilde{x}_j, t) = 1$ if $\mathcal {F}_{\theta }^{\textsf{po}} (\tilde{x}_j) = t$, and 0 otherwise^{Footnote 2}.

The attacker’s objective is to increase their ASR as much as possible while preserving the DNN’s CDA such that it is indistinguishable from a benign model, i.e., the victim trainer will deploy the inconspicuous DNN. For ease of use for the reader, we summarize our notation choices in Table 1.

Table 1 Table of notations

Strategic safeguarding: A game theoretic approach for analyzing attacker-defender behavior in DNN backdoors

Abstract

1 Introduction and related works

1.1 Introduction

1.2 Related works

1.3 Contributions

2 Backdoor attack

2.1 Motivation for our threat model

2.2 Formalization

2.3 Our backdoor use case

2.3.1 Attacker side

2.3.2 Defender side

3 Backdoor game formulation

3.1 Game theory in a nutshell

3.2 Backdoor game

3.2.1 Overview

3.2.2 Constructing a utility function

3.2.3 Game formalizations

Definition 1

Definition 2

Definition 3

3.2.4 Game dynamics

3.2.5 Assumptions and limitations

4 Experimental results

4.1 Experimental setup

4.1.1 Dataset and models

4.1.2 Game setups

4.1.3 Backdoor attack setup

4.1.4 Running experiments and result assessment

4.2 Results with \(BG_{Min}\)

4.2.1 Sinusoidal trigger

Analysis of the utility matrices

Analysis of the equilibrium strategies

4.2.2 Ramp trigger

Analysis of the utility matrices

Analysis of the equilibrium strategies

4.2.3 Trigger mismatch

Analysis of the utility matrices

Analysis of the equilibrium strategies

4.3 Results with \(BG_{Int}\)

4.3.1 Sinusoidal trigger

Analysis of the utility matrix

Analysis of the equilibrium strategies

4.3.2 Ramp trigger

Analysis of the utility matrices

Analysis of the equilibrium strategies

4.3.3 Trigger mismatch

Analysis of the utility matrices

Analysis of the equilibrium strategies

4.4 Results with \(BG_{Max}\)

4.4.1 Sinusoidal trigger

Analysis of the utility matrix

Analysis of the equilibrium strategies

4.4.2 Ramp trigger

Analysis of the utility matrix

Analysis of the equilibrium strategies

4.4.3 Trigger mismatch

Analysis of the utility matrix

Analysis of the equilibrium strategies

5 Conclusions and future works

Availability of data and materials

Code availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article