Skip to content


  • Research
  • Open Access

Transfer learning for detecting unknown network attacks

  • 1,
  • 2Email author,
  • 3,
  • 4 and
  • 5
EURASIP Journal on Information Security20192019:1

  • Received: 3 July 2018
  • Accepted: 18 January 2019
  • Published:


Network attacks are serious concerns in today’s increasingly interconnected society. Recent studies have applied conventional machine learning to network attack detection by learning the patterns of the network behaviors and training a classification model. These models usually require large labeled datasets; however, the rapid pace and unpredictability of cyber attacks make this labeling impossible in real time. To address these problems, we proposed utilizing transfer learning for detecting new and unseen attacks by transferring the knowledge of the known attacks. In our previous work, we have proposed a transfer learning-enabled framework and approach, called HeTL, which can find the common latent subspace of two different attacks and learn an optimized representation, which was invariant to attack behaviors’ changes. However, HeTL relied on manual pre-settings of hyper-parameters such as relativeness between the source and target attacks. In this paper, we extended this study by proposing a clustering-enhanced transfer learning approach, called CeHTL, which can automatically find the relation between the new attack and known attack. We evaluated these approaches by stimulating scenarios where the testing dataset contains different attack types or subtypes from the training set. We chose several conventional classification models such as decision trees, random forests, KNN, and other novel transfer learning approaches as strong baselines. Results showed that proposed HeTL and CeHTL improved the performance remarkably. CeHTL performed best, demonstrating the effectiveness of transfer learning in detecting new network attacks.


  • Network attacks detection
  • Machine learning
  • Transfer learning

1 Introduction

In recent years, cyber attack is a growing serious concern due to its increased sophistication and variations, such as denial-of-service (DoS) tactics and the zero-day attack, posing a great threat to government, military, and industrial networks. Conventional signature-based detection approaches may fail to address the increased variability of today’s cyber attacks. Developing novel anomaly detection techniques to better learn, adapt, and detect threats in diverse network environments becomes essential.

Machinelearning/data mining approaches have been applied to the attack detection in networked environments to improve the detection rate [14]. Data-driven supervised models achieved better accuracy than unsupervised approaches but relied on a large number of labeled malicious samples [5]. As attacks evolved by varying their behaviors, the distributions of feature may change, making the trained models work poorly [6] and unable to detect the new attacks. This is a domain-shift problem, which usually needs recollecting new training data and retraining the model to adapt to the changes in the target domain. However, collecting sufficient labeled data for such continuously rising attack variants is infeasible. Further, detecting evolving attacks usually needs incorporating new features from various network layers [7]. This also needs to retrain the model because of the different feature dimensions.

To address the above problems, we proposed using transductive transfer learning to enhance the detection of new threats [6]. Transductive transfer learning, a novel machine learning technique, can adapt features in a target domain with deficient labeled data by transferring learned knowledge from a related source domain [8]. The intuition behind is the human’s transitive inference ability to extending what has been learned in one domain to a new similar domain [9]. Our study is motivated by the fact that most network attacks belong to variants of known network attack families and share common traits in features [6, 10], which suggested a good fit for applying transfer learning.

In this study, source and target domain data refers to the same network environment at a different time. We assumed that attacks in a source domain are already known and labeled, and attacks in a target domain are new and different than the source. We formularized the problem by using source domain data to differentiate new attacks in the target domain. Previously, we developed a transfer learning-enabled detection framework and proposed a feature-based heterogeneous transfer learning, called HeTL [6], to detect unseen variants of attacks. HeTL can find new feature representations for source and target domain by transforming them on a common latent space. Nevertheless, we observed that the performance of HeTL depended on manual pre-settings of a hyper-parameter: relevance between the source and target domain [6]. In this paper, we proposed another approach—a hierarchical transfer learning algorithm with clustering enhancement, called CeHTL, which can cluster source and target domain and compute the relevance between them.

We utilized a benchmark network intrusion dataset NSL-KDD [11]. To stimulate the domain shift, we generated training and testing datasets by sampling attacks from different types of attacks, from big category of attacks (e.g., DoS, R2L), and also the subcategory of attacks (i.e., 22 subtypes). We compared the proposed CeHTL with HeTL [6], as well as any other baselines, including traditional classification without transfer learning and several novel transfer learning approaches. We also evaluated the approaches on imbalanced datasets, which is common in real-world cyber attack practice. We performed sensitivity analysis by tuning parameters and using different sizes of training set. The results showed that CeHTL demonstrated the most stable results, which means that it does not rely on the pre-setting of parameters and thus is more effective in detecting unknown attacks.

The rest of this paper is organized as follows: Section 2 reviews the related work. Section 3 outlines the transfer learning framework. Section 4 describes the proposed approaches. Section 6 presents the experiments, evaluations, and discussions. Finally, we conclude the work in Section 7.

2 Related work

2.1 Network attack detection

One of the well-known techniques for network attack detection is signature-based detection, which is based on an extensive knowledge of the particular characteristics of each attack, referred to as its “signature.” One study [12] proposed a methodology to craft traffic with different characteristics. Other studies [13, 14] focused on how to find effective signatures. However, one major limitation of the signature-based technique is its failure to detect new attacks, as their signatures are unknown to the system. In addition, building new signatures needs manual inspection by human experts, which is very expensive and time-consuming, and also introduces an important latency between the discovery of a new attack and the construction of its signature.

Another type of technique for network attack detection is the supervised learning-based technique, which uses instances of known attacks to build a classification model that distinguishes attacks from good programs [1, 3]. Nari and Ghorbani [15] present a network behavioral modeling approach for malware detection and malware family classification. Rafique et al. [16] evaluated the evolutionary algorithms for classification of malware families through different network behaviors. Iglesias and Zseby [17] focused on the feature selection approach to improve the performance of network-based anomaly detection. However, these learning-based techniques share the same limitation as the signature-based detection in that they both perform poorly on new attacks. Since different attacks usually have different distributions of network behaviors, the learned patterns are unable to work accurately. A significant advantage of our approach is its ability to identify an unknown attack that has not been previously investigated.

2.2 Transfer learning

Transfer learning was designed to use knowledge from the source domain, which has sufficient labeled data, to help build more precise models in a related, but different, domain with only a few or no labeled data. Transfer learning approaches can be mainly categorized into three classes [18]. The first class is instance-based [19, 20], which assumes that certain parts in the source data can be reused for the target domain by re-weighting related samples. Dai et al. [20] introduced a boosting algorithm, TrAdaBoost, which iteratively re-weighted the source domain data and the target domain data to reduce the effect of “bad” source data while encouraging the “good” source data to contribute more to the target domains. However, these approaches require a lot of labeled samples from the target domain. The second class can be viewed as model-based approaches [21, 22], which assume both source and target tasks share some parameters or priors of their models. The third class of transfer learning approaches is feature-based [2325], where a new feature representation is learned from the source and the target domain and is used to transfer knowledge across domains. Shi et al. [26] proposed a heterogeneous transfer learning method, called HeMap, to project the source and target domain onto latent subspace via linear transformations. They assumed the subspace is orthogonal. Pan et al. [24] have performed transfer component analysis (TCA) to reduce the distance between domains by projecting the features onto a shared subspace. Nam et al. [27] then applied TCA to the software defect detection problem. Sun et al. [23] proposed an approach, called Correlation Alignment (CORAL), to project source data onto target data by aligning the second-order statistics of the source and target distributions, which do not need any labeled data from the target domain. The work has been applied to the object detection problem and achieves good results. Shi et al. first proposed a state-of-the-art approach called HeMap [26], which uses spectral embedding to unify the different feature spaces of the target and source datasets, and applies this approach to image classification.

2.3 Transfer learning for network attack detection

Even though transfer learning has many great applications in natural language processing and visual recognition [25, 28], not many studies have applied it to the network attack detection problem. Bekerman et al. [4] mentioned that transfer learning can improve robustness in detecting unknown malware between non-similar environments. However, they did not present much detailed and formal work on this idea. The study in [29] applied an instance-based transfer learning approach in network intrusion detection. However, they require plenty of labeled data from target domain. Gao et al. [30] proposed a model-based transfer learning approach and apply it to the KDD99 cup network dataset. Both of these instance and model-based transfer learning approaches depend heavily on the assumption of homogeneous features. This is often not the case for network attack detection, which typically exhibits heterogeneous features. Another advantage of feature-based approaches is its flexibility to adopt different base classifiers according to different cases, which motivated us to derive a feature-based transfer learning approach for our network attack detection study. To our best knowledge, this paper is the first effort in applying a feature-based transfer learning approach for improving the robustness of network attack detection.

3 Framework of using transfer learning for detecting new network attacks

We have present a transfer learning-enabled network attack detection framework to enhance detecting new network attacks in a target domain in [6]. From a practical standpoint, source and target domains can represent different or the same network environments with different attacks captured at different times and at separate instances. In this study, we primarily consider the latter scenario, wherein the source and target domains comprise different attacks. We assume that the attack in the source domain is known and labeled appropriately, and attacks in the target domain are new and not labeled. Unlike prior studies [29, 30] assuming that the source and target domains should have the same feature sets, our framework supports introducing new features into the target domain. This is relevant to evolving network attacks where the adversary may change their behaviors, resulting in a need to incorporating new features in the network or system layers. Thus, in this scenario, the source and target domains have different attack distributions or feature sets. The goal of the transfer learning framework is to use source domain data to differentiate new attacks from the target domain.

The framework consists of a machine learning pipeline, which includes the following stages: (i) extracting features from raw network traffic data, (ii) learning representations with feature-based transfer learning, and (iii) classification. In the first stage, features are extracted from the raw network trace data with a statistic calculation of the network flow. Second, we used feature-based transfer learning algorithms to learn a good new feature representation from both source and target domains. Then, we fed the new representation to a common base classifier. The choice of a common base classifier can be decision trees, SVM, and KNN.

4 Transfer learning approach via spectral transformation

We model the network attack detection as a binary classification problem, which is to classify each network connection as a malicious or as normal connection. Suppose we are provided with source domain training examples \(S=\left \{\vec {x_{i}}\right \}, \vec {x} \in \mathbb {R}^{m}\) that have labels LS={yi}, and target domain data \(T=\{\vec {u_{i}}\}, \vec {u} \in \mathbb {R}^{n}\). Suppose \(\vec {x}\) and \(\vec {u}\) are drawn from different distributions, PS(X)≠PT(X), where PT(X) is unknown, and the dimensions of \(\vec {x}\) and \(\vec {u}\) are different, \(\mathbb {R}^{m} \neq \mathbb {R}^{n}\). Our goal is to accurately predict the labels on T.

Since network attacks share similar traits, our approach is to find the common latent subspace and transform the source and target data onto it to get new feature representations, which can then be used in classifcation. We demonstrated the approach in our previous paper [6]. Given source domain data and target domain data with different attacks, the model explores the common latent space, in which the original structure of the data is preserved while the discriminative examples are still far apart.

4.1 Optimization

Given source data S and target data T, we compute an optimal projection of S and T onto an optimal subspace VS and VT according to the following optimization objective:
$$ \min_{\mathbf{V_{S}},\mathbf{V_{T}}}\ell(\mathbf{V_{S}},\mathbf{S})+\ell(\mathbf{V_{T}},\mathbf{T})+ \beta D(\mathbf{V_{S}},\mathbf{V_{T}}), $$

where (,) is a distortion function that evaluates the difference between the original data and the projected data. D(VS,VT) denotes the difference between the projected data of the source and target domains. β is a trade-off parameter that controls the similarity between the two datasets.

Thus, the first two elements of (1) ensure that the projected data preserve the structures of the original data as much as possible.

We defined D(VS,VT) in terms of l(,) as:
$$ D(\mathbf{V_{S}},\mathbf{V_{T}})= \|\mathbf{V_{T}} - \mathbf{V_{S}} \|^{2} $$

which is the difference between the projected target data and the projected source data. Hence, the projected source and target data are constrained to be similar by minimizing the difference function (2).

We applied linear transformation to finding the projected space. We define (,) as follows:
$$ \ell(\mathbf{V_{S}},\mathbf{S})=\|\mathbf{S}-\mathbf{V_{S}} \mathbf{P_{S}} \|^{2}, \ell(\mathbf{V_{T}},\mathbf{T})=\|\mathbf{T}-\mathbf{V_{T}} \mathbf{P_{T}} \|^{2}, $$

where VS and VT are achieved by a linear transformations with linear mapping matrices, denoted as \(\mathbf {P_{S}} \in \mathbb {R}^{k \times m}\) and \(\mathbf {P_{T}} \in \mathbb {R}^{k \times n}\) to the source and target, respectively. X2 is the Frobenius norm that can also be expressed as a matrix trace norm. In a different view, \(\mathbf {P_{S}}^{\mathbf {T}}\in \mathbb {R}^{m \times k}\) and \(\mathbf {P_{T}}^{\mathbf {T}} \in \mathbb {R}^{n \times k}\) project the original data S and T into a k-dimensional latent subspace, where the projected data are comparable ((VS,S)=SPSTVS2). This will lead to a trivial solution PS=0,VS=0. We thus apply (3). It can be viewed as a matrix factorization problem, which is widely known as an effective tool to extract latent subspaces while preserving the original data structures.

4.2 Optimization objective 1

Substituting (3) and (2) into (1), we obtain the following optimization objective to minimize with regard to VS,VT,PS and PT as follows:
$$ {{} \begin{aligned} \min G(\mathbf{V_{S}},\mathbf{V_{T}},\mathbf{P_{S}},\mathbf{P_{T}}) &= \min \|\mathbf{S}-\mathbf{V_{S}}\mathbf{P_{S}} \|^{2}\\&\quad+\|\mathbf{T}\,-\,\mathbf{V_{T}}\mathbf{P_{T}} \|^{2} \\&\quad+\beta \cdot \|\mathbf{V_{T}} - \mathbf{V_{S}} \|^{2}) \end{aligned}} $$

In our previous work [6], we used a gradient method to get the global minimums by iteratively fixing three of the matrices to solve the remaining one until convergence. The detailed HeTL algorithm was presented in [6].

5 Clustering-enhanced hierarchical transfer learning

In previous study, we have observed that the performance of HeTL depends on the manual presetting of a hyper-parameter—relevance between the source and target domains (β). Inappropriate choice of parameters might lead to suboptimal efficacy results. The row order of the class type for S and T could also affect the results of D(VS,VT). Practically, we may know little about the new attack in T, so the transformation process in (4) could be misleading.

To address this problem, we proposed a hierarchical transfer learning with clustering enhancement, called CeHTL, through automatically finding the relevance between the source and target domain before we perform the projection. CeHTL first clustered the instances for the target domains, as the source domain already has two natural clusters (classes). By computing the similarity of each cluster and choosing the mapping for two similar clusters in the source and target domains, we can get the correspondence (mapping) of each cluster in the target domain to the source domain. We sorted the instances by order of their cluster labels, so that the rows in matrices T and S will have the same class order. Then, we solved objective (4) for the ordered T and S. We illustrated the comparison between CeHTL with HeTL in Fig. 1. The algorithm for CeHTL is listed in Algorithm 1. We chose K-means++ [31] for clustering and used the Euclidean distance to compute the similarity.
Fig. 1
Fig. 1

Comparison between the HeTL and CeHTL

In case that the source and target domains have heterogeneous feature sets, where T and S may have different dimensions, the Euclidean distance cannot be applied. To overcome this problem, we use principal component analysis (PCA) [32] for each source and target domain to perform feature reduction. By choosing the same size of components for source and target domains, they will have the same dimensions. The notation description are presented in Table 1.
Table 1

Notation descriptions




Source data


Projected source data


Projection function to the source space


Target data


Projected target data


Projection function to the target space


Weights of the relevance between the source and target data


Dimensions of the projected space


Learning rates


Learning step

6 Experimental evaluation

In this section, we evaluated the performance the of proposed transfer learning HeTL and CeHTL for detecting “unknown” network attacks. We addressed the following questions: Does transfer learning approach provide any advantage compared with a single classifier without using transfer learning approach? and Which technique is the most appropriate transfer learning approach? We utilized a benchmark network intrusion dataset—the NSL-KDD benchmark dataset [11] (in Section 6.1). We carried out two experiments to stimulate the “unknown” network attacks and different feature spaces (in Section 6.2). We demonstrated the benefits of HeTL and CeHTL compared to other traditional machine learning algorithms as well as other several novel transfer learning methods (in Section 6.3). We also performed the parameter sensitivity analysis and showed the impact of imbalanced datasets and training data sizes (Section 6.4).

6.1 Network datasets

NSL-KDD contains network features extracted from a series of TCP connection records captured from a local area network. Each record in the dataset corresponds to a connection labeled as either an normal or attack type. The dataset has 22 different types of attack, which can be grouped into 4 main categories: DoS, R2L, Probe, and User to root (U2R). Tables 2 and 3 provide the details of the attacks and their distribution in the training dataset. Since the portion of U2R is very small, we only focus on DoS, R2L, and Probe.
Table 2

Category of the attack in NSL-KDD

Main categories



Neptune, back, land, smurf, teardrop,pod


buffer_overflow, ftp_write, guess_passwd, imap, multihop, phf, spy, warezclient, warezmaster


ipsweep, nmap, portsweep, satan


loadmodule, perl, rootkit

Table 3

Number of instances in NSL-KDD



















NSL-KDD contains 41 network features that can be split into 3 groups: (1) basic features deduced from TCP/IP connection packet headers; (2) traffic features, usually extracted by flow analysis tools; and (3) content features, requiring the processing of the packet content. Some example of features are listed in Table 4.
Table 4

Some selected features in NSL-KDD

Feature name


Feature category


Duration of the connection

Basic features


Data bytes from source to destination

Basic feature


Data bytes from destination to source

Basic feature


Number of incorrect login in a connection

Content feature


Sum of connections to the same destination port number

Traffic feature


Percentage of connections that have “SYN” errors among the connections to the same host in the past 2 s

Traffic feature


Percentage of connections that have “SYN” errors among the connections to the same destination port in the past 2 s

Traffic feature


Sum of connections to the same destination IP address

Traffic feature


The percentage of connections that were to the same service, among the connections aggregated in dst_host_count

Traffic feature

6.2 Experimental setting

6.2.1 Detection of unknown network attacks

This experiment is to evaluate the proposed transfer learning approaches for detecting new variants of attacks. Stimulating new attacks is challenging. We can assume attacks in the target data has no labels and differ from attacks in the source domain. We randomly selected malicious examples from one main attack category (e.g., DoS, R2L, Probe) and normal examples as the source domain. Then, we chose a different attack type combined with normal samples for the target domain. We finally generated three groups: DoS →Probe (DoS is the source domain for training and Probe is target domain for testing), DoS →R2L and Probe →R2L). To evaluate the generalization, we also chose attacks from 22 sub-attack types for each source and target set and generated 11 tasks. We repeated the processes ten times and reported the averages and standard deviations. We make the attack data, and the normal data in each domain are balanced unless stated otherwise. We further studied the effects of imbalanced data in Section 6.4.

6.2.2 Network attacks with different feature spaces

To evaluate the performance in detecting attacks using different feature spaces, we used different feature sets for source and target domains, based on the first experiment setting. In network security, there are circumstances that we need to incorporate new features to better detect the attacks. For example, traffic feature is more distinguishable for DoS attack. However, for the R2L attack, the content feature is more distinguishable. This usually need to retrain the model. To stimulate this scenario, we selected the most relative features for the source and target domains using information gain, resulting in unequal feature dimensions. The final selected features were listed in Appendix Tables 5 and 6. Of note, using information gain here is only for generating different feature sets, not for improving the performance. In real practice, features can be changed due to the manual feature engineering as we have less information about the target dataset. The baseline approach manually mapping the target data into the source feature space and applied the traditional classifiers. We compared our transfer learning approach with the baselines.

6.3 Evaluation

We chose the accuracy, F1 score (F−Measure) and receiver operating characteristic curve (ROC curve) as the performance metrics. F1 score combines precision and recall to measure the per-class performance of classification or detection algorithms.

We firstly chose C4.5 decision tree (CART), linear SVM, and KNN as the baselines, which were also served for base classifiers for HeTL and CeHTL. We compared HeTL and CeHTL with baselines on three main transfer learning tasks (i.e., DoS →Probe, DoS →R2l, and Probe →R2L). Figures 2 and 3 show the box plots of accuracy and F1 score on ten iterations on three main tasks. We observed that the baseline models performed poorly, with accuracy of 0.47–0.74 and F1 score of 0.1–0.65. Our HeTL and CeHTL significantly outperformed the baselines, obtained over 0.70 accuracy and 0.75 F1 score. CeHTL outperformed HeTL with all three base classifiers in DoS →Probe and in decision tree and KNN in Probe →R2L. CeHTL achieved the best result with an average accuracy and F1 score of 0.88.
Fig. 2
Fig. 2

Box plot of accuracy of transfer learning approaches and baselines on three main tasks

Fig. 3
Fig. 3

Box plot of F1 score of transfer learning approaches and baselines on three main tasks

Then, we applied HeTL, CeHTL, and two baseline methods—SVM and HeMap [26], a novel transfer learning approach—to the 11 transfer learning tasks generated by the subtypes of attacks, along with the 3 main tasks. We run the experiment for 10 iterations with different random seeds and reported the average and standard deviations of accuracy and F1 scores in Figs. 4 and 5. We observed (1) transfer learning approaches outperformed the traditional classifiers without using transfer learning in all 14 tasks, (2) HeTL and CeHTL can improve the accuracy to 0.8–0.9 in 5 tasks, (3) HeTL and CeHTL outperformed HeMap, and (4) CeHTL outperformed all other methods in 10 cases. Figure 6 shows the ROC curves on 3 main transfer learning tasks using KNN as the base classifier. CeHTL achieved the best area under ROC curves (AUC) in 2 DoS →Probe and Probe →R2L (CeHTL 0.93 and 0.91 AUC vs. HeTL 0.82 and 0.65 AUC). Besides HeMap, we compared our approaches with more baselines, TCA [24] and CORAL [23]. Figure 7 showed the results of approaches on 5 classifiers in DoS →R2L. HeTL and CeHTL outperformed all baselines.
Fig. 4
Fig. 4

Performance comparison of accuracy on unknown network attacks detection, sample size = 1000

Fig. 5
Fig. 5

Performance comparison of F1 score on unknown network attacks detection, sample size = 1000

Fig. 6
Fig. 6

Performance comparison of ROC curves on the three transfer learning datasets. a ROC curve on DoS →Probe. b ROC curve on DoS →R2L c. ROC curve on Probe →R2L

Fig. 7
Fig. 7

Performance comparison of feature-based transfer learning approaches on DoS → R2L

Finally, we carried out the second experimental setting, where the source domain and target domain have different feature spaces. We compare the transfer learning approach with the manual mapping approach on DoS →R2L. From the results shown in Fig. 8, we can see that the transfer learning approaches outperformed the baselines.
Fig. 8
Fig. 8

Performance comparison on heterogeneous spaces on DoS →R2L

6.4 Discussion

The study proposed two transfer learning methods, HeTL and CeHTL, on network attack detection methods to address the issues of lacking sufficient labels for new attacks. The results showed that HeTL and CeHTL significantly improved the accuracy compared to the traditional classifiers and other transfer learning methods. Especially, CeHTL performed the best in most of the tasks, especially in DoS →Probe tasks. One of the reason is DoS had more similarities with Probe than R2L, according to the top selected features in Appendix Table 5 and 6. This can improve the accuracy of computing the cluster correspondence, which thus resulted in a better performance.

6.4.1 Parameter sensitivity

Two hyper-parameters, the similarity confidence parameter β and the dimensions of the new feature space k, need to be set for optimization (4). There are several ways to determine the optimum hyper-parameters: (a) the similarity confidence β can be determined by computing the similarity or distance between the source and target data, (b) the optimal number of both parameters can be found by enumerating the number of parameters, or (c) the parameters can be set empirically. However, the first and second approaches need a few labeled data from the target domain, which is not a truly “unknown” situation. We studied the impact of different parameter settings on the performance of detecting attacks. Figure 9 demonstrates the effect on accuracy by using different parameter combinations of β and k (where β[0,1] and k ranges from 1 to 6). Figures 10 and 11 demonstrate the average accuracy achieved on parameters β and k.
Fig. 9
Fig. 9

Accuracy comparison with different combinations of k and β, sample = 1000. a DoS →Probe. b DoS →R2L. c Probe →R2L

Fig. 10
Fig. 10

Study of parameter β sensitivity on three main detection tasks, sample = 1000. a DoS →Probe. b DoS →R2L. c Probe →R2L

Fig. 11
Fig. 11

Study of parameter k sensitivity on the three main detection tasks, sample = 1000. a DoS →Probe. b DoS →R2L. c Probe →R2L

Compared with HeMap, both HeTL and CeHTL improve the highest accuracy achieved with different parameter settings, shown in Fig. 9. However, HeTL is sensitive to parameter tuning, showing lower accuracy in some specific parameter combinations. CeHTL performs more stably. For example, in DoS →Probe, after several fluctuation, CeHTL can maintain around 0.8 accuracy. For the similarity confidence parameter β, as shown in Fig. 10, CeHTL shows a significant improvement and stays stable from β≥0, because the correspondence has been automatically computed and involved in the transfer learning, so β should be set larger than 0. For the parameter k, in general, CeHTL shows an outstanding and stable performance than other approaches. The results show that CeHTL is more suitable for unknown network detection since we can empirically set the parameters and do not reply heavily on information about the labeled data in the target domain.

6.4.2 The imbalanced data effects

In many real cases, the size of normal and attack data would be not equal. Thus, we investigated the performance of the HeTL and CeHTL on imbalanced data. Figure 12 shows the F1 score of the transfer learning approaches and baselines in different percentage of the attack data. We observed the baseline method performed poorly on the imbalanced data, especially in DoS →R2L and Probe →R2L. The transfer learning approaches improved F1 scores in most cases. Although all the methods had a lower F1 score in 10% attack data, HeTL and CeHTL boosted the F1 by 50% when adding another 10% of attack data, and the metric kept rising with increasing the attack data.
Fig. 12
Fig. 12

The performance on imbalanced data by varying the portion of attack data, sample = 1000. a DoS →Probe. b DoS →R2L. c Probe →R2L

6.4.3 The training size

We studied how much training data was needed for unknown attack detection. We plot the learning curves in Fig. 13. From the results, we observed that CeHTL gained the best accuracy at a 500 sample size in DoS →Probe and DoS →R2L, and the second best accuracy in Probe →R2L. CeHTL needs the smallest training sample size, which makes it the best option given a limited amount of training data.
Fig. 13
Fig. 13

Learning curves on different training size. a DoS →Probe. b DoS →R2L. c Probe →R2L

7 Conclusion

Machine learning have been employed in detecting the occurrence of malicious attacks. Most machine learning techniques for attack detection are effective only given the assumptions that the training and testing data are from the same distribution. However, in most real cases, continuously evolving attacks and the lack of sufficient labeled datasets hinder the ability of supervised learning techniques to detect new attacks. In this paper, we introduced a feature-based transfer learning framework and transfer learning approaches. We presented a feature-based transfer learning approach using a linear transformation, called HeTL. We also proposed a cluster enhanced transfer learning approach, called CeHTL, to make it more robust in detecting unknown attacks. We evaluated the transfer learning approaches on common classifiers. The results showed the transfer learning approaches improve the performance of detecting unknown network attacks compared to baselines. Spectacularly, CeHTL exhibited higher performance and the ability to be more robust in detecting unknown attacks with no labeled data. The results also demonstrated that the proposed transfer learning techniques can support different feature spaces. In the future, we aim to apply the model to various attack domains, such as malware detection. We also plan to combine transfer learning with deep learning to pre-train the models for practical use.

8 Appendix

Table 5

Top features for detecting DoS, used in the second experiment

Rank index































































Table 6

Top features for detecting R2L, used in the second experiment

Rank index


















































Not applicable.


The research presented in this paper was supported by Office of the Assistant Secretary of Defense for Research and Engineering (OASD (R&E)) agreement FA8750-15-2-0120 and Boeing Data Analytics agreement BRT-L1015-0006.

Availability of data and materials

Not applicable.

Authors’ contributions

JZ carried out the data processing, design and implementation of the proposed algorithms, experiment setup, and results evaluation and drafted the manuscript. SS contributed to the conception, experiment design, and evaluation of the proposed approach and results and helped draft the manuscript. JWP provided oversight for data and experimentation and participated in the manuscript editing. CK and KK helped revised the manuscript. All authors read and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Authors’ Affiliations

Vanderbilt University Medical Center, Nashville, 37203, USA
Virginia Modeling Analysis and Simulation Center, Old Dominion University, Norfolk, 23529, USA
AutoX Inc, San Jose, California, USA
US Army Research Laboratory’s Network Security Branch, Adelphi, 20783, USA
Haloed Sun TEK, LLC, in affiliation with the CAESAR Group, Sarasota, Florida, USA


  1. R. Perdisci, W. Lee, N. Feamster, in NSDI, vol. 10. Behavioral clustering of http-based malware and signature generation using malicious network traces (USENIX AssociationBerkeley, 2010), p. 14.Google Scholar
  2. C. Rossow, C. Dietrich, H. Bos, L. Cavallaro, M. V. Steen, F. C. Freiling, N. Pohlmann, in BADGERS ’11 Prof. of the First Workshop on Building Analysis Datasets and Gathering Experience Returns for Security. Sandnet: Network traffic analysis of malicious software, (2011), pp. 78–88.Google Scholar
  3. N. Stakhanova, M. Couture, A. A. Ghorbani, in Prof. of the 2011 6th International Conf. on Malicious and Unwanted Software, Malware 2011. Exploring network-based malware classification (IEEE Computer SocietyWashington, DC, 2011), pp. 14–19.Google Scholar
  4. D. Bekerman, B. Shapira, L. Rokach, A. Bar, in Communications and Network Security (CNS), 2015 IEEE Conference On. Unknown malware detection using network traffic classification (IEEELos Alamitos, 2015), pp. 134–142.View ArticleGoogle Scholar
  5. K. Bartos, M. Sofka, V. Franc, in USENIX Security 2016. Optimized invariant representation of network traffic for detecting unseen malware variants (USENIX AssociationAustin, 2016), pp. 807–822.Google Scholar
  6. J. Zhao, S. Shetty, J. W. Pan, in Military Communications Conference, (MILCOM). Feature-based transfer learning for network security (IEEELos Alamitos, 2017).Google Scholar
  7. A. Javaid, Q. Niyaz, W. Sun, M. Alam, in Proceedings of the 9th EAI International Conf. on Bio-inspired Information and Communications Technologies (Formerly BIONETICS), BICT’15. A deep learning approach for network intrusion detection system (ICST, ICST, 2016), pp. 21–26.Google Scholar
  8. F. Zhuang, X. Cheng, P. Luo, S. J. Pan, Q. He, in Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI’15. Supervised representation learning: transfer learning with deep autoencoders (AAAI Press, 2015), pp. 4119–4125.Google Scholar
  9. K. D. Feuz, D. J. Cook, Transfer learning across feature-rich heterogeneous feature spaces via feature-space remapping (FSR). ACM Trans. Intell. Syst. Technol.6(1), 3–1327 (2015).View ArticleGoogle Scholar
  10. D. Lin, Network intrusion detection and mitigation against denial of service attack. Technical Report MS-CIS-13-04 (Department of Computer and Information Science Technical, University of Pennsylvania, 2013).Google Scholar
  11. NSL-KDD, UNB IUNB ISCX NSL-KDD DataSet (2016). Accessed 01 May 2016.
  12. A. Valdes, K. Skinner, Adaptive, Model-Based Monitoring for Cyber Attack Detection. (H. Debar, L. Mé, S. F. Wu, eds.) (Springer, Berlin, Heidelberg, 2000).Google Scholar
  13. M. Hilker, C. Schommer, in Conf.s in Research and Practice in Information Technology Series, vol. 54. Description of bad-signatures for network intrusion detection (ACSW, 2006), pp. 175–182.Google Scholar
  14. H. Han, X. -L. Lu, L. -Y. Ren, in Machine Learning and Cybernetics, 2002. Proceedings. 2002 International Conference On, vol. 1. Using data mining to discover signatures in network-based intrusion detection (IEEELos Alamitos, 2002), pp. 13–17.Google Scholar
  15. S. Nari, A. A. Ghorbani, in Prof. of the 2013 International Conf. on Computing, Networking and Communications (ICNC). ICNC ’13. Automated malware classification based on network behavior (IEEE Computer SocietyWashington, 2013), pp. 642–647.Google Scholar
  16. M. Z. Rafique, P. Chen, C. Huygens, W. Joosen, in Prof. of the 2014 conference on Genetic and evolutionary computation - GECCO ’14. Evolutionary algorithms for classification of malware families through different network behaviors (ACMNew York, 2014), pp. 1167–1174.Google Scholar
  17. F. Iglesias, T. Zseby, Analysis of network traffic features for anomaly detection. Mach. Learn.101(1-3), 59–84 (2014).MathSciNetView ArticleGoogle Scholar
  18. S. J. Pan, Q. Yang, A survey on transfer learning. IEEE Trans. Knowl. Data Eng.22(10), 1345–1359 (2010).View ArticleGoogle Scholar
  19. S. Bickel, M. Brückner, T. Scheffer, in Prof. of the 24th International Conf. on Machine Learning. ICML ’07. Discriminative learning for differing training and test distributions (ACMNew York, 2007), pp. 81–88.Google Scholar
  20. W. Dai, Q. Yang, G. -R. Xue, Y. Yu, in Prof. of the 24th International Conf. on Machine Learning. ICML ’07. Boosting for transfer learning (ACMNew York, 2007), pp. 193–200.Google Scholar
  21. T. Evgeniou, M. Pontil, in Prof. of the Tenth ACM SIGKDD International Conf. on Knowledge Discovery and Data Mining. KDD ’04. Regularized multi–task learning (ACMNew York, 2004), pp. 109–117.Google Scholar
  22. E. Bonilla, K. M. Chai, C. Williams, Multi-task Gaussian process prediction. Adv. Neural Inf. Process. Syst.20(October), 153–160 (2008).Google Scholar
  23. B. Sun, J. Feng, K. Saenko, in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. AAAI 16. Return of frustratingly easy domain adaptation (AAAI Press, 2016), pp. 2058–2065.
  24. S. J. Pan, I. W. Tsang, J. T. Kwok, Q. Yang, Domain adaptation via transfer component analysis. IEEE Trans. Neural Netw.22(2), 199–210 (2011).View ArticleGoogle Scholar
  25. B. Kulis, K. Saenko, T. Darrell, in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conf. On. What you saw is not what you get: domain adaptation using asymmetric kernel transforms (IEEE Computer SocietyLos Alamitos, 2011), pp. 1785–1792.Google Scholar
  26. X. Shi, Q. Liu, W. Fan, P. S. Yu, R. Zhu, in Prof. - IEEE International Conf. on Data Mining, ICDM. Transfer learning on heterogenous feature spaces via spectral transformation (IEEELos Alamitos, 2010), pp. 1049–1054.Google Scholar
  27. J. Nam, S. J. Pan, S. Kim, in Prof. of the 2013 International Conf. on Software Engineering. ICSE ’13. Transfer defect learning (IEEE PressPiscataway, 2013), pp. 382–391.Google Scholar
  28. B. Long, Y. Chang, A. Dong, J. He, in WSDM. Pairwise cross-domain factor model for heterogeneous transfer ranking (ACMNew York, 2012), p. 113.View ArticleGoogle Scholar
  29. S. Gou, Y. Wang, L. Jiao, et al., in 2009 IEEE International Symposium on Parallel and Distributed Processing with Applications. Distributed transfer network learning based intrusion detection (IEEELos Alamitos, 2009), pp. 511–515.View ArticleGoogle Scholar
  30. J. Gao, W. Fan, J. Jiang, J. Han, in Prof. of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Knowledge transfer via multiple model local structure mapping (ACMNew York, 2008), pp. 283–291.Google Scholar
  31. D. Arthur, S. Vassilvitskii, in Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. SODA ’07. K-means++: the advantages of careful seeding (Society for Industrial and Applied MathematicsPhiladelphia, 2007), pp. 1027–1035.
  32. H. Abdi, L. J. Williams, Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat.2(4), 433–459 (2010).View ArticleGoogle Scholar


© The Author(s) 2019