Towards the application of recommender systems to secure coding

Secure coding is crucial for the design of secure and efficient software and computing systems. However, many programmers avoid secure coding practices for a variety of reasons. Some of these reasons are lack of knowledge of secure coding standards, negligence, and poor performance of and usability issues with existing code analysis tools. Therefore, it is essential to create tools that address these issues and concerns. This article features the proposal, development, and evaluation of a recommender system that uses text mining techniques, coupled with IntelliSense technology, to recommend fixes for potential vulnerabilities in program code. The resulting system mines a large code base of over 1.6 million Java files using the MapReduce methodology, creating a knowledge base for a recommender system that provides fixes for taint-style vulnerabilities. Formative testing and a usability study determined that surveyed participants strongly believed that a recommender system would help programmers write more secure code.


Introduction
Data breaches continue to plague organizations across the globe.The 2017 Cost of Data Breach Study conducted by the Ponemon Institute shows that the average total cost of a data breach is US$3.62 million [1].One of the main causes of data breaches is code-level vulnerabilities [2,3].A 2017 report by Tricentis shows that for 11 months in 2016, news articles reported at least 3 software failures per month that were caused by code-level vulnerabilities [4].These statistics emphasize the need for improved security analytics techniques.Compounding the problem is the fact that many developers are skeptical of using existing code analyzers because of high false-positive rates, the time required to investigate inactionable alerts, and usability issues [5,6].Further, a significant number of existing code analysis tools are designed to find bugs or vulnerabilities in program code, but many of these tools do not offer mitigation support to help programmers write secure code.If data breaches and other securityrelated issues are to be resolved, it is imperative that developers have useful and effective tools at their disposal to help them write secure code.
To address the secure coding problem, this research presents a recommender system that detects the presence of insecure program code and offers live recommendations that include fixes for vulnerabilities based on common practices in the security field, to make it easier for programmers to write more secure code.Recommender systems are software tools and techniques that provide suggestions for items that are most likely of interest to a particular user [7].Traditionally, recommender systems have been applied to commodities such as books, CDs, etc. Ricci et al. [7] noted that the attributes of the items recommended by classic content-based recommendation techniques are keywords extracted from the descriptions of the items [7].
The methodology presented in this work uses source code mining to extract hand-selected features that are used to detect vulnerabilities in program code and to select code examples that mitigate certain vulnerabilities.First, a repository of more than 14,000 open-source projects is mined, and features are extracted based on vulnerability descriptions provided in the National Vulnerability Database (NVD).Next, using the extracted features, datasets containing safe, and unsafe examples are prepared and used as knowledge for a recommender system, which currently detects and assists with mitigating taint-style vulnerabilities.The recommender system was designed by taking into account input from participants in a knowledge elicitation survey.The classic recommendation approach is used to present code examples to the programmer that are most similar to the code being developed instead of using generic examples, which is the traditional practice.
The research question is that a recommender system built using text-mining techniques can assist programmers with detection and mitigation of vulnerabilities as they type code during development.Targeting and correcting unsafe practices as programmers type code will help to catch bugs earlier than using traditional static and dynamic approaches.This work makes the following major contributions: • The design, implementation, and evaluation of a recommender system that uses text mining techniques, coupled with IntelliSense technology, to recommend fixes for potential vulnerabilities in program code.The implemented system uses code running on Apache Hadoop to extract knowledge from a large body of open-source projects to provide features for detecting taint-style vulnerabilities • The use of a knowledge elicitation survey to determine the current use of code analyzers among programmers and to elicit their views on the design of the proposed system • A bipartite evaluation (scalability and usability) of the proposed system along with a discussion on the statistical significance of the usability results.
The article is organized as follows: related work is presented in Section 2 followed by an overview of the approach in Section 3. Section 4 provides a thorough discussion of modeling and detection.Section 5 discusses the methods followed to design and implement the proposed system.This section also presents a discussion on a knowledge elicitation survey that was conducted to obtain information that affect the design of the system as well as a usability study that ascertains the usability and usefulness of recommender systems in helping programmers write more secure code.Results and discussion of the user study and a scalability evaluation are presented in Section 6 followed by conclusions and future work in Section 7.

Related work
In this section, a discussion is provided on works that are closely related to this work in the area of automated coding support, particularly in static analysis, and Dynamic Application Security Testing (DAST) or dynamic analysis and auto-fixing of programming errors.

Static analyzers 2.1.1 Lightweight analyzers
Splint is a heuristics-based tool that finds potential vulnerabilities by checking to see that source code is consistent with the properties implied by annotations [8].Splint is limited to American National Standards Institute (ANSI) C code and does not offer the functionalities required in agile and data-driven development environments.
FindBugs is Java-based static analysis tool that is intended to find coding defects that developers will want to review and remedy [9].The concept is based on bug patterns that can be found based on Java byte code [9].FindSecBugs is a FindBugs plugin, which is geared towards security audits of Java web applications [10].
Alenezi and Javed proposed the Developer Companion framework to help developers produce secure web applications [11].Developer Companion uses several static analysis tools to analyze program code, cross-references the results against the Common Weakness Enumeration (CWE) and NVD, and presents to developers a recommendation based on the aggregated data [11].

Tools that improve static analysis warnings/alerts
Some researchers have proposed tools and frameworks to prioritize alerts or vulnerabilities to make it easier for developers and managers to address the more critical issues [12,13].
The tool proposed in [12] is known as Autobugs, which gathers historic alert data from static analysis tools and combines the alert-data with complexity metrics to build a classifier that predicts the actionability of an alert from data and unit properties [12].Unfortunately, the author reported that models based on historic alert data could potentially mislead developers to believe they have no security issues [12].
In [13], a vulnerability management strategy, known as VULCON, is proposed to prioritize vulnerabilities for patching.Using two metrics (total vulnerability exposure and time-to-vulnerability remediation), the framework ingests vulnerability scan reports from code analyzers such as Nessus [14] and outputs security exposure metrics and vulnerability management plans to managers, operators, analysts and engineers, so they can decide on which vulnerabilities to mediate [13].

Static analyzers based on source code mining
Gopalakrishnan et al. [15] presented a bottom-up approach that recommends architectural tactics (a quality-attribute-response) based on topics discovered from source code projects.They used a classifier in addition to a recommender system to predict where tactics should be placed in a programming project to improve the quality, but not security, of the code.
In [16], Medeiros et al. presented the DEKANT tool that automatically detects web-based vulnerabilities using hidden Markov models (HMM).First, the tool extracts code slices from source code and translates these slices into an intermediate slice language (ISL).It then analyzes the representation to determine the presence of vulnerabilities in code written in PHP.

Vulnerable code pattern recognition using machine learning
In a survey of software vulnerability analysis and discovery using machine learning and data mining techniques, Ghaffarian and Shahriari categorized approaches into four main areas [17].Of these four areas, the area most closely related to this work is "Vulnerable Code Pattern Recognition." Under this category, the work by Yamaguchi et al. [18] is related.In [18], the authors proposed a method that assists a security analyst with auditing source code.Abstract Syntax Trees (ASTs) are extracted form source code (C-code) and then embedded in a vector space, such that techniques from machine learning can be applied to analyze the code.Further, latent semantic analysis is used to determine dominant directions (structural patterns) in the vector space from which code similar to a known vulnerability is identified and used to detect vulnerabilities.
In addition, Shar and Tan [19] produced a series of papers [19][20][21][22] on vulnerability detection and mitigation, each improving upon their previous work.The most related paper in their work is [19].In [19], 20 static code attributes based on data-flow analysis of PHP web applications are proposed for predicting program statements that are vulnerable to SQL-injection (SQLI) and crosssite scripting.The authors extracted control-flow (CFG) and data-flow graphs (DFG) of a given PHP program and performed backward data-flow analysis on target sink statements that may reach certain input source statements [17].The extracted attributes are used to create vectors, which are coupled with their known vulnerability status to train classifiers to predict the vulnerability status of unseen sink statements [19].A source refers to an untrusted data source from which user input is received and a sink is a security-sensitive function [23].

Dynamic analyzers
A plethora of tools [24,25] exist in the dynamic analysis domain, the majority of which are commercial.Interestingly, a great deal of focus in DAST is devoted to web applications [26][27][28].Huang et al. proposed a crawler that allows for a black-box, dynamic analysis of web applications [26].Using reverse engineering (to identify all possible points of attack within a web application) and a fault injection process, the tool attempts to determine the most vulnerable points within an application [26].
In addition, Petukhov and Kozlov proposed an extended tainted 1 mode model that incorporates the advantages of penetration testing and dynamic analysis to widen the scope of the web application being covered during testing [28].
Since dynamic analysis involves testing application behavior, some researchers believe it is a more realistic approach than static analysis [29].However, the main challenge with dynamic tools is identifying the source of a bug [6].Bugs often manifest themselves as program crashes and this makes them difficult to mitigate.

Dynamic analyzers based on AI/machine learning
In [30], the authors described a tool, known as HACKAR, that uses an improved version of Java PathFinder (JPF) to execute Java programs and identify vulnerabilities.The tool is a dynamic analyzer that formulates a problem using Satisfiability Modulo Theory (SMT) and uses symbolic execution to determine program paths that may lead to vulnerabilities.In addition, HACKAR uses a goal regression 2 technique proposed by [31] to learn the semantics of tasks based on program traces in order to produce a knowledge base for providing advice to programmers on how to fix vulnerabilities.

Automated code repair and auto-completion
Several works exist in the area of automated code repair and code completion.In 2009, the first tools, ClearView [32] and GenProg [33], that perform automated code repair on real-world programs were demonstrated [34].Since that time, focus on automated code repair has grown steadily with several other tools being developed, each either proposing an improvement on an existing methodology or a unique algorithm (e.g., SPR [35], Kali [36], AE [34], and Prophet [37]).In 34, Weimer et al. categorized existing code repair tools into two categories: those that use stochastic search or produce multiple candidate repairs, which are validated using test cases (e.g., Gen-Prog, PAR [38], AutoFix-E [39], ClearView, Debroy, and Wong [40]), and techniques that use synthesis (e.g., Sem-Fix [41]) or constraint solving (and symbolic execution) to produce a single patch that is correct by construction (e.g., AFix [42], FUZZBUSTER [43], FUZZBALL, and FUZZBOMB [44]).Since many of these tools require test cases to operate, they fit well in the area of dynamic analysis.
Raychev et al. proposed an approach that learns a probabilistic model from existing annotated program data and uses this model to predict properties of new, unseen programs [45].The authors also created a scalable prediction engine called JSNICE that predicts names of identifiers and type annotations of variables.That is, given an optimized minified JavaScript code, JSNICE generates JavaScript code that is annotated with types and identifier names.
In 46, Gupta et al. proposed the DeepFix algorithm that uses a multi-layered sequence-to-sequence neural network to fix common programming errors (e.g., missing declarations or statements, missing identifiers, and undeclared variables) in C code [46].The neural network comprises an encoder recurrent neural network (RNN) to process the input and a decoder RNN with attention that generates fixes using an iterative process [46].
There are also linters 3 (e.g., SonarLint [47]), code quality analyzers (e.g., ASIDE [48] and code-clone detection tools (XIAO [49]) that attempt to improve the quality of code within integrated development environments (IDEs).XIAO is a tool that helps to deal with the issue of codecloning where programmers may have repetitious code within their coding project.The premise is that detecting code clones can be useful in finding similar security bugs and also improves the quality of code through refactoring of code clones [49].Baset and Denning showed that SonarLint and many existing IDE-based tools (e.g., ESVD [50]) present short description of common programming errors, but do not provide example fixes for security-related vulnerabilities [51].
In [52], Raychev et al. presented an approach to code completion based on a novel combination of program analysis with statistical language models.Given a codebase, their system first extracts abstract histories in the form of sentences from the data.Then, these sentences are fed to a language model such as an n-gram model or recurrent neural network model that learns probabilities for each sentence.Also, in [53], the authors described an architecture that allows library developers to introduce interactive and highly specialized code generation interfaces, called palettes, directly into the editor.Both of these code completion approaches are based on system design and sentence suggestion and have not been applied to vulnerability detection and mitigation.

Difference between the proposed approach and existing approaches
The methodology proposed and implemented in this research couples text mining algorithms and IntelliSense techniques to analyze program code as the programmer types, compares the user's code with a knowledge base of unsafe practices to determine the presence of unsafe code and recommends fixes by providing ranked example code to the programmer during development.IntelliSense, also known as code-completion or code-hinting, refers to productivity features that help programmers learn about their code by keeping track of parameters and providing the ability to add properties to code during development.
While [30] uses goal regression to learn about the user program, it requires that the program be symbolically executed in order to find vulnerabilities.As discussed in the literature [54], symbolic execution suffers from path explosion, path divergence and challenges with complex path constraints, especially on real world problems.This presents challenges with the generalizabilty of the solution, as confirmed by the authors [30].
In [16], an intermediate language is required to annotate tainted functions in the code.In contrast, the proposed model in this research works directly with the parse tree of the source code to detect patterns for automatic detection and classification of vulnerabilities based on descriptions and fixes recommended by the NVD.Further, the proposed approach mines a large code base and uses the safe examples to provide not only advice but also example fixes to the programmer.
The proposed approach differs significantly from the generate-and-path approaches discussed in the preceding section because patches often work for a given set of test cases, but fail to generalize to other programming projects.For example, in 36, Qi et al. analyzed reported patches for GenProg, RSRepair, and AE, and found that the overwhelming majority of reported patches did not produce correct outputs even for the inputs in the validation test suite [36].GenProg was reported to find patches for 37 out of 55 defects in a validation suite.However, the researchers found that patches did not produce correct output.Likewise, AE was reported to find patches for 27 out of 54 defects, but did not produce correct outputs in the evaluation conducted by Qi et al.Further reruns by the authors confirmed that GenProg found correct patches for only 2 out of 105 defects.Qi et al. attributed the poor performance of these tools to weak proxies (bad acceptance tests), poor search spaces that do not contain correct patches, and random genetic search that does not have a smooth gradient for the genetic search to traverse to find a solution [36].These weaknesses highlight the challenge with generate-and-patch or generate-and-validate approaches.
Further, unlike the works that propose stand-alone static analysis tools [18,19], the proposed work augments static analysis with IntelliSense to drive the mitigation process within IDEs as the programmer types code.As discussed in the literature [5,55] and confirmed by the knowledge elicitation survey conducted in this work (see Section 5.2), a majority of developers surveyed do not take advantage of stand-alone static analysis tools.Even though these tools may perform well, they require the extra time of going outside of the development environment to perform scans and explore mitigation approaches.However, this new proposed methodology of coupling vulnerability scanning with IntelliSense provides live scanning and mitigation without significantly affecting the developer's coding experience.In addition, by using a recommender system, this work shows that providing the programmer with a ranked set of examples that are most similar to the code being developed allows the programmer to better understand vulnerabilities as they relate to their projects.Other auto-fixing approaches (e.g., DeepFix and generateand-patch) that automatically transform program code do not provide the programmer with examples that are very similar to the code being developed.Moreover, the unique presentation of information in the form of recommendations has the added benefit of educating programmers on how to avoid certain vulnerabilities in future projects.

Proposed approach
The approach consists of two main phases (modeling and application) and two main components (the data analyzer and the recommender system) as shown in Fig. 1.Here, each component is described.A more thorough discussion of the modeling phase is provided in Section 4 while Section 5 covers the application phase (system design and implementation).
The first phase in the proposed approach is the modeling phase.This phase involves analyzing data collected from the National Vulnerability Database (NVD) in addition to open-source programs to identify features for detecting a set of vulnerabilities.These features are then used by a data analyzer to process program code using simple and effective, data-driven vulnerability detectors to detect each vulnerability.The approach currently focuses on the Java programming language but is general enough to apply to other programming languages.
The second phase involves capturing code as the programmer types and transferring it to the recommender system that executes vulnerability detectors, which in turn categorizes the program code based on the knowledge of the recommender system and outputs recommendations that include examples for fixing each vulnerability.

The data analyzer
The data analyzer consists of feature extractors that are designed based on vulnerability descriptions and fixes from the NVD.The analyzer accepts as input open-source program code and outputs feature sets for detecting a set of vulnerabilities.Open-source projects are mined and source code is categorized in order to provide knowledge to the recommender system for detecting and mitigating each vulnerability.Recommender systems require sufficient data in order to effectively provide useful recommendations to users.Therefore, a distributed framework such as MapReduce is proposed to extract features from a large collection of code repositories to drive the data labeling process.Labeled datasets are used to train the recommender system to provide to the programmer safe code examples that fix a set of vulnerabilities.

The recommender system
The recommender system incorporates vulnerability detectors that are designed using key insights about vulnerabilities based on data provided by NVD and CWE.It accepts the user's code and utilizes the data analyzer to create a feature set/data object from the given program code.The feature set is used to determine the classification of the data object.If the data object is unsafe, a recommendation that includes a warning that contains a list of unsafe method(s) and variable(s) found in the Fig. 1 System architecture user's code is displayed to the user.The recommendation will also include ranked fixes for each vulnerability.Fixes are ranked using text similarity schemes in order to display a list of examples that resemble the code being developed.IntelliSense technology is used to initiate the recommender system as the programmer types in order to help the programmer mitigate potential vulnerabilities as soon as possible.

Modeling and detection
This section discusses the modeling and vulnerability detection phase of the work.It provides a detailed explanation on data representation and feature extraction.Included is a discussion on the feature extraction algorithms and the steps followed to prepare the knowledge base for the recommender system.

Datasets
Two main datasets (The National Vulnerability Database/Common Vulnerabilities and Exposures (NVD/CVE) and Sourcerer 2011) are used in this work to provide vulnerability descriptions that are important for feature extraction and source code from which feature sets and mitigation examples can be extracted.
The National Vulnerabilities Database (NVD/CVE) CVE is a dictionary of common identifiers for publicly known cybersecurity vulnerabilities, which is hosted by the MITRE Corporation [56].CVE submissions are made after vulnerabilities are identified in widely used software applications.Each submission is reviewed by a team of experts and is assigned a unique identifier (CVE ID) by a CVE Numbering Authority (CNA), a description, and references.The US National Vulnerability Database is a "comprehensive cybersecurity vulnerability database that integrates all publicly available US Government vulnerability resources and provides references to industry resources" [56].NVD is provided by the National Institute of Standards and Technology (NIST).NVD enhances the information in CVE to deliver more details for each CVE entry such as fix information, severity scores, and impact ratings according to a Common Vulnerability Scoring System (CVSS) [57].
The Sourcerer 2011 The Sourcerer 2011 dataset is a collection of artifacts based on over 70,000 Java projects and approximately 100,000 Java ARchive (jar) files that were collected from Apache, Google Code and Sourceforge in 2011 [58].The dataset is divided into four tar archives, identified as aa to ad.Each of these archives contains varying numbers of projects, which are numbered in a sequential manner.Each project is then organized into a cache of important files, the content, which follows the organization system used by the developers, and a project.propertiesfile, which contains information such as the repo URL and author.
The Java files are processed and used to create the ground-truth for classification and to provide mitigation examples.

Data representation
Each Java file used in this work is modeled as an Abstract Syntax Tree.An Abstract Syntax tree is an hierarchical intermediate representation of a program that presents source code structure according to the grammar of a given programming language [59].It is a reduced parse tree in which nodes are connected through parent-child relationships.The construction of an AST begins with a node that represents the entire translation/compilation unit followed by a number of intermediate levels, then simple language constructs such as type name, identifier name, or operator as the leaf nodes [59].
The JavaParser library is used to construct and traverse an AST from Java source code.JavaParser is an opensource library that allows native Java interaction with an AST generated from Java source code [60].

Feature extraction
Features for detecting vulnerabilities were identified after careful manual analysis of vulnerability descriptions provided by the NVD/CVE.Apache Hadoop was utilized as a MapReduce environment running custom code to process the Sourcerer dataset in order to extract features for detecting the vulnerabilities.MapReduce is a programming model and an associated implementation for processing and generating large datasets [61].The Apache Hadoop software library is one of the most popular implementations of the MapReduce methodology that allows for the distributed processing of large data sets across clusters of computers using a simple programming model [62].

MapReduce algorithm for feature extraction
The MapReduce algorithm that was implemented for execution in Apache Hadoop is shown in Algorithm 1.Based on the structure of the Sourcerer dataset, it was necessary that the repository be organized before processing using Hadoop.Bash scripts were used to parse each project.propertieswithin each project in the repository to extract information about each project in order to create a more uniform file structure.Java files were reorganized such that there is one directory for each project.The filenames were later used as keys for the MapReduce framework.Since Hadoop splits data files according to a default block size, a custom record reader was employed, as shown in the algorithm (line 3), to read each Java file without splitting it.This enabled complete and accurate creation of an AST from each file.Moreover, each vulnerability requires a different buildFeatureSet procedure (shown on line 16 of the algorithm).This procedure is discussed below for each of the vulnerabilities evaluated in this work.

Extracting features for detecting taint-style vulnerabilities
This work uses two taint-style vulnerabilities (SQL Injection and Command Injection) to evaluate the proposed methodology.These vulnerabilities were chosen due to their high CWE severity score and frequency in the 2017 version of the NVD as shown in Fig. 2. Taint-style vulnerabilities are caused by the lack of input/output validation and are traditionally modeled as source-sink problems.

CWE-89-improper neutralization of special elements used in an SQL command ('SQL injection'
).An SQL injection (SQLI) attack is one that occurs when an attacker provides specially crafted input to an application that employs database services such that the provided input results in a different database request than was intended by the application programmer [63].SQLI has been a common vulnerability for many years, securing position number one on the Open Web Application Security Project (OWASP) 2010 [64], 2017 [65], and the CWE 2011 [66] lists.Applications (e.g., web-apps) generally accept user input, which are then used in executing database requests.These requests are typically SQL statements.SQLI is a serious vulnerability because it could lead to unauthorized access to sensitive data, cause severe updates to or deletions from a database, and even result in devastating shell command execution [67].Listing 1 features sample code that could potentially result in SQLI.This is because the programmer is incorporating unsanitized variables in the creation of a query string.
The use of the PreparedStatement class from Java Database Connectivity (JDBC) or Java Enterprise Edition (J2EE) is often recommended as a fix for SQL injection [68].This class allows for the use of a placeholder ("?" character) to create a parametric query that escapes potentially tainted user input.Using these clear descriptions of the vulnerability and how it can be mitigated, six main hand-selected features for detecting and classifying SQLI can be identified.These features are described in Table 1, and the algorithm used to build the feature set is presented as Algorithm 2. A list of known Java sources and sinks was obtained from online resources [69][70][71].These known sources and sinks are used as a point of reference along with static dataflow analysis of the user program to identify potentially tainted variables.A variable is considered potentially_sanitized if it is passed to a function that is not in the list of known tainted sources.Three techniques are used to check for potential sanitization throughout a given program file: inline (sanitization done during the creation of a query string), in-method (sanitization done as soon as a parameter is passed to a method), and before-use (sanitization of parameters before they are passed to methods that invoke query functions).By using the list of generated features, a data instance in the dataset is automatically labeled as safe if the boolean feature quoted_variables_found is false, the incoming variables are potentially_sanitized, and parameterized queries are used to create the SQL statements.Contrariwise, it is labeled as unsafe.A random sample of 100 labeled instances was tested and no errors were found, giving a 99% confidence, which indicates the effectiveness of the selected features.Live detection 1 Create an AST from the Java program file.

CWE-78-improper neutralization of special elements used in an operating system (OS) command ('OS command injection').
Command injection is an attack in which the goal of the attacker is to execute arbitrary commands on the host operating system via a vulnerable application [72].As the name suggests, these commands are typically targeted to the command shell, which is a software program that provides direct communication between the user and the operating system [73].The commands supplied by the attacker are usually executed with the same privileges of the vulnerable application.
In Java applications, calls to the Runtime.exec(...) method could be exploited to allow an attacker to run arbitrary commands on the host operating system.Listing 2 shows example code that is vulnerable to command injection.This is because it utilizes the Windows command shell (cmd.exe) to execute the dir command without proper sanitization.After careful analysis of this vulnerability, four main features have been manually selected for detection and classification.Table 2 describes each feature while Algorithm 3 outlines the buildFeatureSet procedure.From the feature set, the following heuristic can be used to automatically categorize the dataset for command injection: if shell commands are present and unsanitized, arguments/variables are used in the command string or any faulty characters are used in the command string, label the data instance as unsafe.Otherwise, label the instance as safe.A random sample of 100 labeled instances was also tested, showing no errors (99% confidence) in the labels assigned to the command injection dataset.

Results of the text-mining process
To prepare data for the knowledge base within the recommender system, the MapReduce algorithm was implemented in Java and executed in Apache Hadoop.The data in Part aa of the Sourcerer 2011 dataset was used to create the knowledge base.Table 3 summarizes the distribution of the projects within the subset of the dataset that was analyzed.Specifically, the Sourceforge projects and Google Code projects were processed to create training data and test data, respectively.Table 4 shows the breakdown of the training and testing samples.

Methods
This section describes the methods followed in designing and implementing the system.First, initial ideas on the requirements and design of a useful and effective code analysis tool are delineated.Next, the steps involving a knowledge elicitation survey that was conducted to empirically ascertain the current use of code analyzers among programmers and to elicit their views on the design of the proposed system are presented.Finally, a discussion on the impact of the survey on the final design of the system is provided.

Initial system design
Due to the observation that many programmers are skeptical of using existing code analyzers, the following requirements are worth considering during the design of a new system: • The system must be a part of the IDE to enable effective scanning and mediation • The warnings should be brief and actionable (links to more detailed information should be provided for interested users) • Emphasis should be placed on fixing the potential vulnerabilities and encouraging good programming practice • The fixes should not be generic but as specific as possible to the project being developed • Scanning of vulnerabilities should be done such that the programmer's productivity is not negatively impacted By using this inexhaustive list of requirements, a mockup of the proposed system was created (see Fig. 3).The proposed tool is called VulIntel, short for Vulnerability IntelliSensor.The tool is intended to be part of the IDE and uses IntelliSense technology to scan code as the programmer types.A list is populated with the names/IDs of potential vulnerabilities.Clicking on a vulnerability in the list displays a brief description of the vulnerability including a reference to the unsafe method and variables involved.Further, a ranked list of examples is presented to the user to help with mitigation.

Knowledge elicitation survey
It is important to solicit feedback for any system design to satisfy usability requirements as well as to answer questions that will assist with development.Consequently, an online knowledge elicitation survey was conducted with the main goal of obtaining formative feedback on the design of the proposed interface and the views of programmers about a tool that utilizes IntelliSense technology to find vulnerabilities in program code and provides recommended fixes for detected vulnerabilities.Approval 4 to conduct the study was obtained from the Institutional Review Board at Florida Institute of technology.The results from the survey are summarized below and the survey questions are included in Appendix A.1.

Participants
To recruit a diverse population of participants, invitation emails with a link to the survey were sent to individuals of various experience levels in industry and academia.The list consisted of more than 10 organizations from countries that included the USA, Brazil, Germany, and the UK.
The main criteria for participants was that they have at

Familiarity with programming languages and IDEs
Participants were asked to select their familiarity with a set of programming languages from a list that uses a 5point Likert scale 5 .Participants's main language of choice was the Java Programming language, with 40% indicating that they are "very familiar" with it and 25% claiming to be "experts" (see Fig. 5).The IDE that scored the highest in use frequency (84.62%) among participants was Eclipse.This was followed by Visual Studio with 73.08% and Netbeans with 61.52%.The results are summarized in Figs. 4  and 5.

Results and discussion
The answers to four overarching questions that the survey was designed to address are discussed below along with a summary of themes that emerged from the survey.
(1) To what extent are programmers using code analyzers?To answer this question, participants were asked whether they performed static and/or dynamic analysis on their code and how useful they found the given recommendations.13.46% of the participants stated that they used a static analyzer such as FindBugs, 3.85% used a dynamic analyzer such as Java PathFinder, 9.62% used both dynamic and static analyzers, and 56.73% reported that they did not scan their code for vulnerabilities.
(2) How useful are the advice/recommendations provided by existing tools?This question was presented to participants who indicated that they currently take advantage of existing code analyzers.25.81% of this group of participants described as "helpful" the recommendations they received from the scanners they used and 67.74% reported that the advice given was "somewhat helpful" in fixing vulnerabilities.
(3) Would programmers utilize a tool that uses Intel-liSense technology to find and suggest fixes for vulnerabilities?Participants were first asked if they currently take advantage of IntelliSense technology.Sixty-eight participants (68%) reported that they currently utilize the technology while 32 (32%) did not; 4 participants skipped the question.In addition, the participants were asked their opinion about the application of IntelliSense technology to vulnerability detection.87 of the participants (87%) intimated that they would appreciate a system that can scan their code for vulnerabilities as they code; 10 (10%) were not interested in the technology, but believe other programmers may be interested; 3 participants did not believe it would be a good idea to apply IntelliSense to vulnerability detection, and 4 skipped the question.
(4) What are the design criteria and expectations for a tool that scans code for vulnerabilities and presents fixes to the user?The participants were then shown the mockup (see Fig. 3) of the proposed tool and asked in what situations and for what types of projects they would utilize it.The responses are summarized in Figs. 6  and 7.Moreover, they were asked their opinion about Themes that emerged from the survey Several important themes stood out in the responses provided by participants in the knowledge elicitation survey as evaluated using the grounded theory approach [74].From the list of vulnerabilities provided by the participants, SQL injection, buffer overflows, and the OWASP list of vulnerabilities are well-known and important to programmers.However, there are other vulnerabilities that are often overlooked by programmers but could pose significant risks.For example, Fig. 2 shows that hard-coded credentials (CWE-789) and missing encryption (CWE-311) account for dozens of vulnerabilities in the 2017 NVD release, yet these vulnerabilities were not mentioned by any participant.
Three main themes emerged from the open-ended responses that were provided by the participants: Theme 1: usability Some participants were concerned about the number of objects on the proposed UI.They suggested that while updates are important, the "news updates" panel adds clutter to the interface and should be minimized if possible.
Theme 2: performance While some participants were in favor of scanning being done in the background, a few of them were concerned about the impact this may have on the code editor.For example, one participant submitted the following response: "I like that it tells you security vulnerabilities as you type.I am a little concerned about how efficient scanning for these vulnerabilities might be.I would most likely stop using it if it slowed down my editor." Theme 3: fixing vulnerabilities A number of participants commented on the plugin's proposed ability to provide fixes for the vulnerabilities that it finds.One participant provided the following feedback: "Really helpful as it provides you with multiple fixes and examples and visually appealing."

Final system design
The aforementioned themes were used to influence the design of the final system.For example, the theme of usability helped to declutter the interface.First, the knowledge base within the recommender system was updated with knowledge from open-source projects as discussed in Section 4.3.The model was serialized and imported into an Eclipse plugin.The Eclipse IDE was chosen because of its familiarity among surveyed programmers as discussed earlier.Figure 8 shows a screenshot of the final system as an Eclipse plugin.The design of the plugin was influenced by the responses received in the knowledge elicitation survey.IntellliSense technology was utilized by extending the Eclipse Code Recommenders [75] system, which is a fundamental component within the Eclipse intelligent code completion framework.The IntelliSense system was programmed to initiate the scanner after the user enters or removes at least five characters, excluding spaces.This behavior was chosen after experimenting with options such as after method completion or after entering or removing at least 10 characters.

Recommending fixes for vulnerabilities
It is of interest to use the vulnerability-safe (negative) examples from the labeled corpora to provide recommendations to help programmers fix the detected vulnerabilities.Several questions arise in determining a similarity scheme that finds code that is similar to the user's code but is safe against the vulnerabilities found in the user's code.For example, what is the best trade-off between the time taken to find similar code that is not only syntactically relevant but also semantically helpful to the user?To answer this and other questions, experiments were conducted using three text similarity schemes (cosine similarity, MinHash, and SimHash) in order to select one that takes the least amount of time to find relevant examples.
The cosine similarity between two vectors (or two programs) is a measure that calculates the cosine of the angle between them irrespective of the magnitude of the vectors.In this work, the vectors represent the term frequencies of terms that are common between two programs (methods).The vectors were created by using Apache Lucene [76] to tokenize the Java code and remove Java keywords and other English stop words from the code.
Minhash is a Locality Sensitive Hashing (LSH) technique based on the min-wise independent permutations of sets.The goal of MinHash is to estimate the Jaccard similarity quickly without explicitly computing the intersection and union of the sets.Jaccard is the ratio of the number of elements in the intersection of two sets to the number of elements in the union.
SimHash is also a LSH for the cosine similarity measure that maps high-dimensional vectors to small fingerprints [77].It is based on the concept of Signed Random Projections (SRP) that transforms a multi-dimensional vector into a binary string and stores only the sign of the random projection values.
Figure 9 presents the results from an experiment that compares the three similarity approaches.First, the figure shows a sample user code that is vulnerable to SQLI.
Next, the most similar code that fixes the vulnerability, as returned by each algorithm, is presented.The figure also shows the similarity score and the time taken to search a dataset of 18,842 safe instances for code that is similar to the user's code.As can be seen from the results, all three algorithms finished the search in under 2 s.Moreover, the returned samples suggest that cosine similarity produced a more semantically similar piece of code to the user's code.

Usability study
The study 6 followed the A/B testing format where participants used two tools to complete two tasks and provide feedback based on the experience they had while using both tools.While A/B tests are traditionally used to compare the performance of or user preferences regarding two different versions of a particular tool or design, it is used in this work to compare two different tools with two different Fig. 9 Finding safe code that is most similar to the user's code interfaces geared towards vulnerability detection and mitigation.The proposed tool uses an IntelliSense approach to detect vulnerabilities while the second tool (FindBugs) does not use IntelliSense.FindBugs was chosen as the second tool due to its coverage in the literature [78,79], its adoption by major companies such as Google [80], its open-source nature, and its target language being Java.
First, the goal of the study is outlined, followed by the methodology, which includes a description of the participants, the apparatus and materials used, and the methods employed in the study.The results of the study are then presented along with a discussion on their significance.

Study goal
The overall goal of the study was to ascertain the usefulness and usability of a recommender system in helping programmers write more secure code.

Participants
Fourteen participants completed the study (1 professor, 1 industry professional, 4 researchers, 3 undergraduate students, 1 master's student, and 4 PhD students).These participants were recruited using a combination of convenience and snowball sampling via email and word of mouth.Nine subjects were in the age group 18-29, four between 30-49, and one between 50-64.Participants ranged in coding experience with 13 people having at least 3 years experience and 1 person between 0-2 years.Subjects were asked to select their primary programming languages and 9 of them selected Java and Python as their languages of choice.

Apparatus and materials
All participants used a Dell Latitude 3550 laptop (Intel Core i3 -1.70 GHz CPU, 64-bit, 8 GB of RAM) to complete the tasks.The study took place in a classroom in the Harris Institute for Assured Information at Florida Institute of Technology, with one participant and one experimenter per interview; each session lasted 30 to 45 min.The Eclipse IDE (version Oxygen.3a4.7.3) was installed on the computer beforehand.The VulIntel plugin and the FindBugs plugin (version 3.0.1)were also installed before the study started.To have a fair comparison of tools, FindBugs, which includes the FindSecBugs plugin, was configured to target only security Bugs.This was done to minimize the effect of unrelated issues on the scanning time or presentation of errors to the participants because FindBugs is able to find bugs related to bad practice, correctness, performance, etc., while VulIntel currently scans for security-related vulnerabilities.

Methods
First, the experimenter presented the participant with an Informed Consent Form.The experimenter reviewed the contents of the form and gave participants a randomly assigned ID that was used to refer to the participant throughout the study.After reviewing the contents of the consent form and the required tasks for the study, the participant was given the option to withdraw or to proceed by signing the form.The study then began with a short demographic-style questionnaire (see Appendix A.2.1) that was designed using Google Forms.After signing the consent form, the interviewer told the participant the order of the tools they would be using.Tool order was alternated to avoid learning bias (i.e., 7 participants used FindBugs first before using VulIntel while 7 used VulIntel before using FindBugs).The interviewer then explained to participants how to use the first tool to scan their code for vulnerabilities and how to use the information the tool provided to fix any potential vulnerabilities.Participants were told that they should use only the information provided by the tool, and no other resources, to fix any reported vulnerabilities.
Next, the experimenter activated screen-capturing (and audio-recording) software, stepped aside, and allowed the participant to complete the two tasks using the first tool.After completing the tasks using the first tool, the participant was then given a questionnaire (see Appendix A.2.2) followed by an interview (see Appendix A.2.3) based on their experience using the tool to scan and fix the given code of potential vulnerabilities.If a participant was unable to fix the vulnerabilities using the tool, the experimenter allowed the participant to proceed with the next tool.The screen-capturing software was closed and the same experiment was given for the second tool.
Tasks Each participant was given two tasks related to the top two taint-style vulnerabilities discussed earlier (see Section 4.3.2).Each task consisted of the user typing preselected sample code into the text editor of the Eclipse IDE while the code scanner window was open and the scanner activated.Two Java classes containing sample methods were created prior to the experiment with vulnerable portions of the code removed, so the participant could type, observe the behavior of the scanner, and use the information provided by the scanner to fix the vulnerability.
The code used for Task 1 (SQL injection) is a modified version of an example provided by the Software Engineering Institute at the Carnegie Mellon University [81] while the code used for Task 2 (Command Injection) was obtained from the OWASP website [82].

Results and discussion
This section presents the results and discussion of the usability study and scalability analysis of the proposed tool.
Figure 10 provides a frequency summary of participants' responses to four main questions asked on the question- 2) for each tool after participants completed the tasks.All four questions were presented using a 5-point Likert scale 7 .As can be seen from the Fig. 10, more people agreed with VulIntel satisfying these questions positively than those who agreed that FindBugs did the same.If the Likert scale is collapsed into two categories (agree and disagree) by removing the neutral responses, the following can be concluded: • Fourteen participants agreed that VulIntel provided helpful information including fixes for the two tasks given whereas only 1 participant agreed that FindBugs provided the same.• Fourteen participants agreed that the VulIntel interface was usable whereas only 6 agreed that the FindBugs interface was usable.

• Thirteen participants indicated that they think
VulIntel would help them write more secure code while only 4 participants think that FindBugs would help them to write more secure code.• All participants stated that they would use the VulIntel system when coding while only 3 participants would use FindBugs.

Scalability
While the goal of this work is to couple text mining techniques with IntelliSense technology to create a recommender system that detects and mitigates vulnerabilities in user programs, it is also of interest to determine the scalability of the proposed methodology on projects of various sizes.To do so, a random sample of 10 Google Code projects in the dataset was selected and processed for SQL injection.Table 5 shows the time taken to classify these projects for SQLI by using a Macbook Pro laptop (16GB of RAM, 3 GHz Intel Corei7 processor).The experiment was done while other processes were running on the machine in order to mimic the environment of a typical developer/programmer.The table also provides other information on the experiment such as the total LLOC (logical lines of code) for each project and the number of files in each one.LLOC was computed by counting the expression statements (an expression followed by a semicolon) in each AST.
The results show that the approach scales very well by being able to scan a project of over 4.5 million lines of code for SQLI in under 8 s while projects of up to 1 million lines of code take under a second.Even though the experiment was only done for one vulnerability, the scanning process can be parallelized through the use of threads to maintain this performance while scanning for other vulnerabilities.This parallelization is feasible since threads are already being used by the tool to find similar example code that mitigates vulnerabilities.

Discussion
Statistical significance Four paired sample t tests and analysis of variance (ANOVA) tests were conducted for the four questions discussed previously.t tests are used to determine whether the mean difference between two sets of observations is equal to zero (that there is no difference between the groups being explored).ANOVA tests were done to check whether the choices of participants depended on the order of the tools presented during the study (i.e., whether there is interaction between tool-order and participants' agreement).To obtain numeric data for carrying out the tests, the Likert scale was converted to an ordinal scale 8 .All t tests were two-tailed and defined as follows: (H 0 : μ d = 0 and H 1 : μ d = 0).The results are summarized in Table 6.
As can be seen from the table, the p values are statistically significant for the paired-sample t tests on all four factors regarding participants' agreement.Therefore, the null hypotheses are rejected and the conclusion that the proposed tool was more usable and helpful than FindBugs in helping programmers write more secure code are supported.Additionally, the p values for the ANOVA tests show that the null hypothesis that states that there is no interaction between tool order and participants' agreement cannot be rejected.Therefore, the conclusion is that tool order did not affect the choices of participants.These results confirm the hypothesis that surveyed participants strongly believe that a recommender system built using text mining techniques can help programmers write more secure code.

Study limitations
The convenience sampling done for the usability study conducted in this work poses a few limitations.

Sample size
The number of participants, which were limited to professional code developers, though relatively diverse in experience, was small (N = 14).There is the potential of obtaining different results with a larger sample.However, since it is typical in the usability community to conduct studies with focus groups between 6 and 10 participants [83], the results presented in this initial study are acceptable.Further, the statistical significance reported helps to strengthen the conclusions.

Gap between tool age
The gap between the age of both tools is also worth mentioning.FindBugs was originally released in 2006, with its most recent release in 2015.The proposed tool in this study has not yet been released to the public.Therefore, age difference between the two tools may have some effect on the results.
Experimenter demand effects Demand effects could also pose a limitation.However, this limitation may be very minimal, since none of the participants involved in the study has ever seen or worked with the featured tools and tool order was alternated during the study.

Conclusions and future work
In this work, a methodology is proposed, designed, and evaluated to help programmers fix potential vulnerabili- ties as they type code during development.The proposed methodology employs the use of text mining techniques to extract features from code repositories in order to categorize code and use data-driven vulnerability detectors to detect vulnerabilities.The vulnerability detectors work in unison with a recommender system to provide the programmer during development with ranked code examples that resemble the project being developed in order to mitigate a set of vulnerabilities.This work advocates the use of a recommender system that uses similarity metrics to recommend a set of example fixes instead of using the traditional approach of automatically fixing the user's code.Providing the user a set of similar examples that are safer than the code being developed not only allows the user to fix vulnerabilities but also educates the program on how to avoid the errors that lead to vulnerabilities in future projects.
A usability study showed that all 14 participants involved agreed that the proposed system was more usable than the FindBugs system, and it provided more helpful advice including fixes for the tasks they completed using the system.In addition, all but one participant indicated that the proposed system would help them to write more secure code.The results were statistically evaluated, and paired sample t tests and ANOVA suggest that there is statistical significance, confirming the applicability of recommender systems to secure coding.

Future work
Future directions for this work include the following: • The use of deep learning and other methods to determine the features for detecting vulnerabilities instead of using hand-selected features.While the features proposed in this work are engineered to a degree, they provide the ability to ensure data correctness and to create the end-to-end processing framework.This is an essential step in creating datasets that are verifiably correct and provides a baseline on which to judge the performance of the methodology.Automatic extraction of features will allow for the addition of machine learning algorithms to the methodology.

Fig. 2
Fig. 2 Number of vulnerabilities in the NVD 2017 List that were caused by the top 10 SANS/CWE of 2011.The plot also shows the CWE severity score for each CWE

Algorithm 2 : 2 Initialize featureSet parameters as safe 3 imports = Get list of ImportDeclaration from ast 4 sqlStatements = Extract all statements containing SQL Commands from ast 5 methodCalls = Get list of MethodCallExpr from ast 6 sources = Get list of all tainted sources from imports 7 sinks = Get list of all sinks from methodCalls 8 Set feature sources = sources 9
Procedure for building the feature set for detecting SQL injection input : ast: abstract syntax tree of Java code output: a set of features for detecting SQL Injection1 Procedure buildFeatureSet(ast)Set feature sinks = sinks

Fig. 3
Fig. 3 Mockup of proposed system as incorporated in an IDE

Fig. 7
Fig. 7 Types of projects for which programmers would use the proposed plugin

Fig. 8
Fig. 8 Screenshot of the final design of the plugin as incorporated in the Eclipse environment

24 Fig. 10
Fig. 10 Summary of participants' responses to four main questions

•
Expanding the work by detecting and correcting more vulnerabilities/weaknesses in the SANS/CWE 2011 list of Most Dangerous Software Errors.The analysis in this work showed that by correcting the two featured vulnerabilities, 1300 out of 1500 vulnerabilities in the 2017 NVD release could be avoided.•Further improving the user interface based on the responses received from participants in the usability study• Expanding the tool to support more programming languages and IDEs • Collecting reports from users on their awareness about secure coding based on tool usage and tracking error reduction based on recommendations provided by the tool • Performing A/B testing of the features within the proposed tool

18 end 19 return 20 end 21 end
6Function map(javaFile): 7 key = getFileName(javaFile) 9 addToIntermediateList(key, value) 10 emit(intermediateList) 2 Extract import statements, SQL statements, method calls, sources, and sinks from the AST. 3 For all SQL statements in the program, check if variables are potentially sanitized using static dataflow analysis by comparing the sources and sinks in the program with a knowledge base of known sources and sinks and checking if apostrophes and/or parameterized queries are properly used.4 If these checks show that data is not properly sanitized and parameterized queries are not properly used, then consider the program susceptible to SQLI and use the recommender system to recommend the most appropriate fix that is most similar to the project being developed.

Table 2
Features for detecting OS command injection

Table 3
Distribution of projects in part "aa" of the Sourcerer dataset

Table 4
Breakdown of data within the knowledge base of the recommender system

Table 5
Time taken to detect SQLI in various open-source projects

Table 6
Results from paired-sample t-tests and one-way ANOVA tests for four tool factors.A/B test represents participants who used FindBugs before using VulIntel whereas B/A test represents the opposite