The main contributions of this paper lie in the following: using ELM [3, 21] for classification of attacks, even though ELM has been used for intrusion detection systems before [30–33], but not quite in this manner and not on the UNSW dataset which is the most realistic and difficult dataset for intrusion detection; using the ExtraTrees classifier [4] to calculate feature importance and select relevant features for detecting each specific type of attack; and using an ensemble of ELMs (one for each type of attack) and combining their results using a softmax layer to fabricate an interpretable probabilistic output of very high accuracy. The flowchart of our deep multilayered model is shown in Fig. 1.
We breakdown the problem of multiclass classification into a set of binary classifications. This is done in order to decrease the load on the classifiers in the ensemble. Since, multiclass classification is more complex than binary classification. This is because binary classification consists of two decision variables, i.e., the two classes, whereas the multiclass problem can consist of n decision variables representing the n classes. So, it is easier to learn a function that can map the set of input features to two decision variables rather than n decision variables. Also, the complexity of the function would be less for binary classification compared to multiclass classification, as the complexity of a function is directly proportional to the number of decision variables.
Each thread shown in Fig. 1 runs in parallel using GPUs and mapreduce type of implementation which enables realtime intrusion detection. New attacks can be detected by the system, since one ELM in the ensemble distinguishes normal network traffic from potential attacks. However, the type of the new attack cannot be determined if it does not fall under any of the attack categories on which the system is trained.
Layer 1: Feature selection with ExtraTrees classifier
The extremely randomized trees, abbreviated as ExtraTrees [4] are a variant of the random forests with more randomization at each step for picking an optimal cut/split or decision boundary. Unlike random forests where features are split based on a score (like entropy) and instances of the training set are bootstrapped, the split criteria of the ExtraTrees are random and the entire training set is considered. The resulting trees have more leaf nodes and are more computationally efficient. It also alleviates the problem of high variance in random forests due to its randomization and hence provides a better biasvariance tradeoff.
Also, one of the advantages of using treebased classifiers is their ability to perform feature selection. The advantage of using treebased classifiers as a feature selection mechanism is that they require much less memory (as tree structures are more memory efficient), they are faster, and they give the most important features at the beginning itself starting from the root node and the first split. At each split, the most important feature is selected at that stage. As the tree grows and reaches the leaf nodes that give the end result, the path from the root node to the leaf node gives the most important features. An additional characteristic of treebased methods is that features are given a score during each split which enables them to perform feature ranking. This characteristic is used in our model for feature selection. Features are ranked according to split score by the ExtraTrees classifier. The split score for sample S, split s, and class c is given by [4]:
$$ {\text{Score}}_{c}(s,S)=\frac{2I_{c}^{s}(S)}{H_{s}(S)+H_{c}(S)} $$
(1)
where, H_{c}(S) is the (log) entropy of the class c in sample S, H_{s}(S) is the split entropy, and \(I_{s}^{c} (S)\) is the mutual information of the split outcome and the class c. We select all the features above a threshold score. This is done for distinguishing each attack versus the rest. So, we get a different optimal feature subset for detecting each type of attack. Feature selection reduces redundancy, and emphasis is given to important features which leads to higher accuracy and faster training.
Most of the previous research applies feature selection for detecting all attacks. We use the ExtraTrees classifier for feature selection to detect each type of attack separately (as shown in Fig. 1) because a particular feature could be important for detecting a specific type of attack and it could be considered redundant for another type of attack. Each feature used for intrusion detection receives a score. We use a threshold score to discard irrelevant or redundant features that do not contribute enough to the benefit the performance of the intrusion detection system. This approach of individual class feature selection works better as shown in the Section 4.
The main motivation behind using a feature selection technique is to reduce the dimensionality of the problem in order to improve execution time, memory usage, and data efficiency, especially when redundant features are removed which helps to deal with overfitting and improve performance. Feature selection with decision treebased methods is much simpler and faster compared to other techniques such as Fisher’s score and Fscore. The major disadvantage of using Fisher’s score and Fscore is that they calculate feature scores independently of other features, i.e., they do not include mutual information. On the other hand, ExtraTrees classifier uses all features together to categorize data. Some feature combinations might be better than high scoring independent features, which is why we employ ExtraTrees classifier as the feature selector.
Layer 2: Extreme learning machine ensemble
The extreme learning machine is a supervised learning algorithm originally for a single hidden layer feedforward neural network [3, 21]. But after extensive research in the past few years, it has been modified and updated to work for deep neural networks as well, details can be found here [34–37]. We use the original form of the ELM, to keep things simple and fast.
The inputs to the ELM, in this case, are the features selected by the ExtraTrees classifier [4]. Let it be represented as x_{i},t_{i}, where x_{i} is the input feature instance and t_{i} is the corresponding label. The input features are fed to the hidden layer neurons by randomly weighted connections w. The sum of the product of the inputs and their corresponding weights act as inputs to the hidden layer activation function. The hidden layer activation function is a nonlinear nonconstant bounded continuous infinitely differentiable function that maps the input data to the feature space. There is a catalog of activation functions from which we can choose according to the problem at hand. We ran experiments for all activation functions, and the best performance was achieved with the smooth approximation of the ReLU function [38], which is called the SoftPlus function [39]:
ReLU:
$$ f(x)={\text{max}}(0,x) $$
(2)
SoftPlus:
$$ f(x)={\text{log}}(1+e^{x}) $$
(3)
The hidden layer and the output layer are connected by weights β, which are to be analytically determined. The mapping from the feature space to the output space is linear. Now, with the inputs, hidden neurons, their activation functions, the weights connecting the inputs to the hidden layer, and the output weights produce the final output function:
$$ \sum_{i=1}^{L} \beta_{i}g(w_{i}.x_{j}+b_{i})=o_{j} $$
(4)
The output in matrix form is:
The error function used in extreme learning machine is the mean squared error function, written as:
$$ E= \frac{1}{2}\sum_{j=1}^{N}{\left({\sum_{i=1}^{L} \beta_{i}g(w_{i}.x_{j}+b_{i})  t_{j}}\right)^{2}} $$
(6)
The MSE with L_{2} regularization and C as regularization parameter is:
$$ E= \frac{1}{2}\sum_{j=1}^{N}{\left({\sum_{i=1}^{L} \beta_{i}g(w_{i}.x_{j}+b_{i})  t_{j}}\right)^{2}} + C\frac{1}{2} \beta^{2} $$
(7)
To minimize the error, we need to get the leastsquares solution of the above linear system:
$$ \H\beta^{\ast}T\={\text{min}}_{\beta}\H\betaT\ $$
(8)
The minimum norm leastsquares solution to the above linear system is given by:
$$ \hat{\beta}=H^{\dagger}T $$
(9)
where, H^{†} is the MoorePenrose pseudo inverse of H, which is given by [40, 41]:
$$ H^{\dagger}=\left(\frac{I}{C} + H^{T}H\right)^{1}H^{T} $$
(10)
However, the product of H^{T}H may not always be a nonsingular matrix or it may tend to be singular under certain conditions, and thus, this method of computing the pseudo inverse may not work for all cases. The singular value decomposition (SVD) can be used to calculate the Moore–Penrose pseudo inverse of H in all cases.
Properties of the above solution are as follows:

1
Minimum training error: The following equation provides the leastsquares solution, which means the solution for ∥Hβ−T∥, i.e., the error is minimum: ∥Hβ^{∗}−T∥=min_{β}∥Hβ−T∥.

2
Smallest norm of weights: The minimum norm of leastsquares solution is given by the MoorePenrose pseudo inverse of the hidden layer output matrix, H: \(\hat {\beta }=H^{\dagger }T\).

3
Unique solution: The minimum norm leastsquares solution of Hβ=T is unique, which is \(\hat {\beta }=H^{\dagger }T\).
Detailed mathematical proofs of these properties and the ELM algorithm can be found in [3]. We use an ensemble of N+1 ELMs, where N is the number of types of attacks. One additional ELM, apart from N ELMs, is for detecting normal traffic. Each ELM is trained with an X vs all strategy, where X is the type of attack/normal traffic. Each ELM outputs a “1” when it detects a type of attack for which it is trained, or “0” otherwise. This approach breaks down the multiclass problem to a twoclass problem with several ELM classifiers, each having to detect only one type of attack instead of several.
Even though this ensembling approach requires many ELMs, it gives a much better performance in terms of accuracy and training time and is much less computationally complex compared to single deep and wide neural networks that have a much more demanding multiclass problem at hand. This is because as the number of decision variables increases (number of classes), the network size has to be increased as well. Additionally, deep neural networks require backpropagation for training which is more computationally complex than ELM. Since each ELM in the ensemble has to detect one type of attack, they have a perspective of the data unique to that particular type of attack which makes them more efficient, accurate, and faster. This unique perspective of data with selected features is provided by the ExtraTrees classifier.
Also, convergence is guaranteed by the solution to the MoorePenrose pseudo inverse of H, as long as sufficient number of hidden neurons are provided. We use 512 neurons for guaranteed convergence.
Layer 3: The softmax layer
The output of an ensemble of classifiers can be combined in several ways like averaging, voting, and max operation. But that is when all the classifiers have the same goal and the same perspective of the problem. Such techniques cannot work when the global view of the problem is multiclass and the local view of the classifiers is binary class.
The ELMs in the ensemble return a single output which is either “0” or “1.” To amalgamate these outputs to get the final actual output becomes a challenge because the abovementioned techniques for combining ensemble results do not work here. If only one of the ELMs output is “1,” then there is no problem. But let us assume that we get a difficult input stream to classify. For instance, let us consider that two ELMs output a “1,” in this case which one should we consider? We cannot apply averaging or voting or max operation here for obvious reasons. To alleviate this ambiguity, we use a softmax layer at the end to integrate the outputs of the ELM ensemble and produce a probability vector which displays the probabilities of each type of attack.
The softmax layer employs the softmax function [42] which is a generalized form of the logistic function:
$$ f(y)_{j}=\frac{e^{y_{j}}}{\sum_{k=1}^{N}e^{y_{k}}} $$
(11)
In order to further increase performance and accuracy of the system, the softmax layer is finetuned using the Adam optimizer [43]. The true classes encoded as onehot vectors are fed as labels, and the input is the output of the ELM ensemble. This behaves as a single layer softmax classifier. The categorical crossentropy loss works best with softmax layer which is used here as well [44]:
$$ H(t,y)=\sum_{i} t_{i}\log y_{i} $$
(12)
The finetuning is run for 10 epochs only which is enough since a large portion of the classification task is done in the ELM ensemble stage. The softmax layer acts as a module that dispenses ambiguity and makes the output interpretable. Also, it adds an additional layer of abstraction to the model. The output of the final stage is a refined probability vector that displays the probabilities of each type of attack and normal traffic associated with each input instance stream.