In this section, we present the details on experiments designed to build and evaluate phishGILLNET. This includes datasets employed, data preparation, training and test strategies, and measures to evaluate performance.
Four publicly available email datasets and one publicly available phish URL dataset were used to evaluate phishGILLNET. Email datasets include (i) ham (good) emails from SpamAssassin corpus , (ii) phishing emails from the PhishingCorpus , (iii) good emails from Enron Email Dataset , and (iv) spam emails from SPAM Archive . Phish URL dataset includes (v) PhishTank .
(i) SpamAssassin 
SpamAssassin corpus contains a total of 6,047 messages, of which, 4,150 messages are good and the remaining are spam. These messages were collected by the SpamAssassin project for the years 2002-2003 and made available to the research community. For evaluation in this study, spam messages are not used (only 4,150 good messages are used instead).
(ii) PhishingCorpus 
PhishingCorpus contains 4,550 phishing emails. These emails were collected by an individual for the period 2004-2007 and donated to the research community. For evaluation, all the phishing emails from this corpus were used.
(iii) Enron Email Dataset 
This dataset contains data from about 150 senior management people of Enron that was made public by the Federal Energy Regulatory Commission during its investigation. This dataset contains approximately 500,000 emails. Out of the Enron emails, we employed 136,226 emails from the inbox and sent folder of the mailbox, thus ensuring only good emails from this corpus.
(iv) SPAM Archive 
SPAM archive contains spam emails collected by Bruce Guenter  using various bait accounts since 1998. We used all spam emails of January 2011 through November 2011. This accounted for 336,070 emails thus size of the total corpus was 470,000 (approximately). SPAM archive does not distinguish between "spam" and "phishing" emails. Thus, it is an ideal dataset to evaluate the architecture using Co-Training, which is a semi-supervised algorithm that employs labeled and unlabeled data.
(v) PhishTank 
PhishTank URLs are manually verified by human experts that it is a confirmed phish attack. We collected 48,000 phish URLs from phishtank.com for the year 2011.
7.2 Data preparation
Two sets of public dataset combinations were used to build and evaluate the PLSA model. The first set of experiments (combination1) employed datasets (i) & (ii) while the second set of experiments (combination2) employed (iii) & (iv). The first set is a much smaller public corpus than the second set. In combination1, there are a total of 8,700 messages, 4,550 phishing, and 4,150 good emails. While in combination1 all emails are labeled, the combination2, specifically (iv), does not distinguish between phishing and spam emails. In order to compute misclassification errors, phishing emails in SPAM archive were segregated using the following semi automated approach. Hyperlinks in emails were extracted using a HTML parser. SURBL  provides a reputation lookup service for domains that are confirmed phish hosting domains. By using a combination of phishtank.com URLs and domain reputation data from SURBL, if a match is found for the SURBL domains or phishtank URL in the hyperlinks present in an email, then that email is labeled as a "phish" email. This resulted in phish emails of 47,783 out of 336,070 spam emails. Thus, the distribution of emails in combination2 is 10% phish, 61% spam, and 29% good. According to the Internet Security Threat Report 2010 from Symantec , that collected and analyzed billions of emails from 2009, in a realistic mail system, 85-90% of all emails are spam and 5-10% of all spam emails are phish. Thus, to have realistic distribution of data in combination2, our experiments were conducted with 10% phish, 80% spam, and 10% good emails. Thus, the size of the corpus used for combination2 is 400,000 emails, which is 10 times the size of corpus used by Bergholz et al.  and one of the largest email corpus used for phishing detection. Also, we used public corpus and hence our results can be reproduced.
All the messages were parsed using a MIME parser to separate email headers from email body. Multipart messages containing HTML parts were further parsed using a HTML parser to extract the body text and hyperlinks. Both MIME and HTML parsers were written in this study using Java programming language. For evaluation, only messages that contain body text and hyperlinks were considered. Thus, messages that failed parser and attachments were not included for building models.
7.3 Training and testing
Experiments were conducted using k-fold cross-validation strategy with a k value of 10. Thus, 90% of the dataset was used during training and 10% of the dataset was used for testing. In order to build the PLSA model, the training data are further split into 90% for building the topic model and 10% for computing perplexity, thus, independent datasets for training, computing perplexity, and testing. The TDF matrix builder (see Section 3) is used to build the term-document matrix for each set. The topic distribution probabilities on the test set is derived using PLSA fold-in (see Section 4). Classification in phishGILLNET1 (see Section 8) is achieved using Fisher similarity function while phishGILLNET2 (see Section 9) employs AdaBoost and phishGILLNET3 employs AdaBoost and Co-Training (see Section 10).
7.4 Performance evaluation metrics
The quality of the PLSA model is evaluated using two measures of performance, namely, log-likelihood and perplexity. The training dataset is split into a set for building the model (training data) and a set (held out) for validating the model using these performance measures.
The log likelihood on the training dataset can be computed using the following expression.
Perplexity, a measure of uncertainty in natural language models, gives a better assessment of how well the model generalizes on unseen (new) data. The lower the perplexity, the better the generalization and hence the classification. Perplexity for a PLSA model is defined by Hofmann [54, 55] as follows
) is the number of times the word w
occurs in held out document d
) is the probability that word w
occurs in document d
. One can see that classification is proportional to the number of topics.
The classification performance is measured using the following standard measures of performance, namely, Precision, Recall, F-measure, and Area under the ROC Curve (AUC). They are defined as follows
where TP is number of true positive, FP is number of false positive, and FN is number of false negative. ROC curve is a plot of true positive rate versus false positive rate. The two-dimensional depiction of classifier performance in a ROC curve is reduced to single scalar value representing expected performance by computing the AUC. The AUC of a classifier is equal to the probability that a classifier will rank a randomly chosen positive example higher than the randomly chosen negative example.
Experiments conducted using publicly available datasets and performance of each layer of phishGILLNET are reported in the following sections.