-
Time to Separate from StackOverflow and Match with ChatGPT for Encryption
Authors:
Ehsan Firouzi,
Mohammad Ghafari
Abstract:
Cryptography is known as a challenging topic for developers. We studied StackOverflow posts to identify the problems that developers encounter when using Java Cryptography Architecture (JCA) for symmetric encryption. We investigated security risks that are disseminated in these posts, and we examined whether ChatGPT helps avoid cryptography issues. We found that developers frequently struggle with…
▽ More
Cryptography is known as a challenging topic for developers. We studied StackOverflow posts to identify the problems that developers encounter when using Java Cryptography Architecture (JCA) for symmetric encryption. We investigated security risks that are disseminated in these posts, and we examined whether ChatGPT helps avoid cryptography issues. We found that developers frequently struggle with key and IV generations, as well as padding. Security is a top concern among developers, but security issues are pervasive in code snippets. ChatGPT can effectively aid developers when they engage with it properly. Nevertheless, it does not substitute human expertise, and developers should remain alert.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Gameful Introduction to Cryptography for Dyslexic Students
Authors:
Argianto Rahartomo,
Harpreet Kaur,
Mohammad Ghafari
Abstract:
Cryptography has a pivotal role in securing our digital world. Nonetheless, it is a challenging topic to learn. In this paper, we show that despite its complex nature, dyslexia$-$a learning disorder that influences reading and writing skills$-$does not hinder one's ability to comprehend cryptography. In particular, we conducted a gameful workshop with 14 high-school dyslexic students and taught th…
▽ More
Cryptography has a pivotal role in securing our digital world. Nonetheless, it is a challenging topic to learn. In this paper, we show that despite its complex nature, dyslexia$-$a learning disorder that influences reading and writing skills$-$does not hinder one's ability to comprehend cryptography. In particular, we conducted a gameful workshop with 14 high-school dyslexic students and taught them fundamental encryption methods. The students engaged well, learned the techniques, and enjoyed the training. We conclude that with a proper approach, dyslexia cannot hinder learning a complex subject such as cryptography.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Mining REST APIs for Potential Mass Assignment Vulnerabilities
Authors:
Arash Mazidi,
Davide Corradini,
Mohammad Ghafari
Abstract:
REST APIs have a pivotal role in accessing protected resources. Despite the availability of security testing tools, mass assignment vulnerabilities are common in REST APIs, leading to unauthorized manipulation of sensitive data. We propose a lightweight approach to mine the REST API specifications and identify operations and attributes that are prone to mass assignment. We conducted a preliminary…
▽ More
REST APIs have a pivotal role in accessing protected resources. Despite the availability of security testing tools, mass assignment vulnerabilities are common in REST APIs, leading to unauthorized manipulation of sensitive data. We propose a lightweight approach to mine the REST API specifications and identify operations and attributes that are prone to mass assignment. We conducted a preliminary study on 100 APIs and found 25 prone to this vulnerability. We confirmed nine real vulnerable operations in six APIs.
△ Less
Submitted 4 May, 2024; v1 submitted 2 May, 2024;
originally announced May 2024.
-
LLM Security Guard for Code
Authors:
Arya Kavian,
Mohammad Mehdi Pourhashem Kallehbasti,
Sajjad Kazemi,
Ehsan Firouzi,
Mohammad Ghafari
Abstract:
Many developers rely on Large Language Models (LLMs) to facilitate software development. Nevertheless, these models have exhibited limited capabilities in the security domain. We introduce LLMSecGuard, a framework to offer enhanced code security through the synergy between static code analyzers and LLMs. LLMSecGuard is open source and aims to equip developers with code solutions that are more secu…
▽ More
Many developers rely on Large Language Models (LLMs) to facilitate software development. Nevertheless, these models have exhibited limited capabilities in the security domain. We introduce LLMSecGuard, a framework to offer enhanced code security through the synergy between static code analyzers and LLMs. LLMSecGuard is open source and aims to equip developers with code solutions that are more secure than the code initially generated by LLMs. This framework also has a benchmarking feature, aimed at providing insights into the evolving security attributes of these models.
△ Less
Submitted 3 May, 2024; v1 submitted 2 May, 2024;
originally announced May 2024.
-
Insecure by Design in the Backbone of Critical Infrastructure
Authors:
Jos Wetzels,
Daniel dos Santos,
Mohammad Ghafari
Abstract:
We inspected 45 actively deployed Operational Technology (OT) product families from ten major vendors and found that every system suffers from at least one trivial vulnerability. We reported a total of 53 weaknesses, stemming from insecure by design practices or basic security design failures. They enable attackers to take a device offline, manipulate its operational parameters, and execute arbitr…
▽ More
We inspected 45 actively deployed Operational Technology (OT) product families from ten major vendors and found that every system suffers from at least one trivial vulnerability. We reported a total of 53 weaknesses, stemming from insecure by design practices or basic security design failures. They enable attackers to take a device offline, manipulate its operational parameters, and execute arbitrary code without any constraint. We discuss why vulnerable products are often security certified and appear to be more secure than they actually are, and we explain complicating factors of OT risk management.
△ Less
Submitted 22 March, 2023;
originally announced March 2023.
-
Wasmizer: Curating WebAssembly-driven Projects on GitHub
Authors:
Alexander Nicholson,
Quentin Stiévenart,
Arash Mazidi,
Mohammad Ghafari
Abstract:
WebAssembly has attracted great attention as a portable compilation target for programming languages. To facilitate in-depth studies about this technology, we have deployed Wasmizer, a tool that regularly mines GitHub projects and makes an up-to-date dataset of WebAssembly sources and their binaries publicly available. Presently, we have collected 2 540 C and C++ projects that are highly-related t…
▽ More
WebAssembly has attracted great attention as a portable compilation target for programming languages. To facilitate in-depth studies about this technology, we have deployed Wasmizer, a tool that regularly mines GitHub projects and makes an up-to-date dataset of WebAssembly sources and their binaries publicly available. Presently, we have collected 2 540 C and C++ projects that are highly-related to WebAssembly, and built a dataset of 8 915 binaries that are linked to their source projects. To demonstrate an application of this dataset, we have investigated the presence of eight WebAssembly compilation smells in the wild.
△ Less
Submitted 16 March, 2023;
originally announced March 2023.
-
Naturalistic Static Program Analysis
Authors:
Mohammad Mehdi Pourhashem Kallehbasti,
Mohammad Ghafari
Abstract:
Static program analysis development is a non-trivial and time-consuming task. We present a framework through which developers can define static program analyses in natural language. We show the application of this framework to identify cryptography misuses in Java programs, and we discuss how it facilitates static program analysis development for developers.
Static program analysis development is a non-trivial and time-consuming task. We present a framework through which developers can define static program analyses in natural language. We show the application of this framework to identify cryptography misuses in Java programs, and we discuss how it facilitates static program analysis development for developers.
△ Less
Submitted 12 January, 2023;
originally announced January 2023.
-
Mining unit test cases to synthesize API usage examples
Authors:
Mohammad Ghafari,
Konstantin Rubinov,
Mohammad Mehdi Pourhashem K
Abstract:
Software developers study and reuse existing source code to understand how to properly use application programming interfaces (APIs). However, manually finding sufficient and adequate code examples for a given API is a difficult and a time-consuming activity. Existing approaches to find or generate examples assume availability of a reasonable set of client code that uses the API. This assumption d…
▽ More
Software developers study and reuse existing source code to understand how to properly use application programming interfaces (APIs). However, manually finding sufficient and adequate code examples for a given API is a difficult and a time-consuming activity. Existing approaches to find or generate examples assume availability of a reasonable set of client code that uses the API. This assumption does not hold for newly released API libraries, non-widely used APIs, nor private ones. In this work we reuse the important information that is naturally present in test code to circumvent the lack of usage examples for an API when other sources of client code are not available. We propose an approach for automatically identifying the most representative API uses within each unit test case. We then develop an approach to synthesize API usage examples by extracting relevant statements representing the usage of such APIs. We compare the output of a prototype implementation of our approach to both human-written examples and to a state-of-the-art approach. The obtained results are encouraging; the examples automatically generated with our approach are superior to the state-of-the-art approach and highly similar to the manually constructed examples.
△ Less
Submitted 30 July, 2022;
originally announced August 2022.
-
Developers Struggle with Authentication in Blazor WebAssembly
Authors:
Pascal Marc André,
Quentin Stiévenart,
Mohammad Ghafari
Abstract:
WebAssembly is a growing technology to build cross-platform applications. We aim to understand the security issues that developers encounter when adopting WebAssembly. We mined WebAssembly questions on Stack Overflow and identified 359 security-related posts. We classified these posts into 8 themes, reflecting developer intentions, and 19 topics, representing developer issues in this domain. We fo…
▽ More
WebAssembly is a growing technology to build cross-platform applications. We aim to understand the security issues that developers encounter when adopting WebAssembly. We mined WebAssembly questions on Stack Overflow and identified 359 security-related posts. We classified these posts into 8 themes, reflecting developer intentions, and 19 topics, representing developer issues in this domain. We found that the most prevalent themes are related to bug fix support, requests for how to implement particular features, clarification questions, and setup or configuration issues. We noted that the topmost issues attribute to authentication in Blazor WebAssembly. We discuss six of them and provide our suggestions to clear these issues in practice.
△ Less
Submitted 30 July, 2022;
originally announced August 2022.
-
Brillouin Zones of Integer Lattices and Their Perturbations
Authors:
Herbert Edelsbrunner,
Alexey Garber,
Mohadese Ghafari,
Teresa Heiss,
Morteza Saghafian,
Mathijs Wintraecken
Abstract:
For a locally finite set, $A \subseteq \mathbb{R}^d$, the $k$-th Brillouin zone of $a \in A$ is the region of points $x \in \mathbb{R}^d$ for which $\|x-a\|$ is the $k$-th smallest among the Euclidean distances between $x$ and the points in $A$. If $A$ is a lattice, the $k$-th Brillouin zones of the points in $A$ are translates of each other, which tile space. Depending on the value of $k$, they e…
▽ More
For a locally finite set, $A \subseteq \mathbb{R}^d$, the $k$-th Brillouin zone of $a \in A$ is the region of points $x \in \mathbb{R}^d$ for which $\|x-a\|$ is the $k$-th smallest among the Euclidean distances between $x$ and the points in $A$. If $A$ is a lattice, the $k$-th Brillouin zones of the points in $A$ are translates of each other, which tile space. Depending on the value of $k$, they express medium- or long-range order in the set. We study fundamental geometric and combinatorial properties of Brillouin zones, focusing on the integer lattice and its perturbations. Our results include the stability of a Brillouin zone under perturbations, a linear upper bound on the number of chambers in a zone for lattices in $\mathbb{R}^2$, and the convergence of the maximum volume of a chamber to zero for the integer lattice.
△ Less
Submitted 21 March, 2024; v1 submitted 3 April, 2022;
originally announced April 2022.
-
On Angles in Higher Order Brillouin Tessellations and Related Tilings in the Plane
Authors:
Herbert Edelsbrunner,
Alexey Garber,
Mohadese Ghafari,
Teresa Heiss,
Morteza Saghafian
Abstract:
For a locally finite set in $\mathbb{R}^2$, the order-$k$ Brillouin tessellations form an infinite sequence of convex face-to-face tilings of the plane. If the set is coarsely dense and generic, then the corresponding infinite sequences of minimum and maximum angles are both monotonic in $k$. As an example, a stationary Poisson point process in $\mathbb{R}^2$ is locally finite, coarsely dense, and…
▽ More
For a locally finite set in $\mathbb{R}^2$, the order-$k$ Brillouin tessellations form an infinite sequence of convex face-to-face tilings of the plane. If the set is coarsely dense and generic, then the corresponding infinite sequences of minimum and maximum angles are both monotonic in $k$. As an example, a stationary Poisson point process in $\mathbb{R}^2$ is locally finite, coarsely dense, and generic with probability one. For such a set, the distribution of angles in the Voronoi tessellations, Delaunay mosaics, and Brillouin tessellations are independent of the order and can be derived from the formula for angles in order-$1$ Delaunay mosaics given by Miles in 1970.
△ Less
Submitted 3 April, 2022;
originally announced April 2022.
-
An Intelligent System for Multi-topic Social Spam Detection in Microblogging
Authors:
Bilal Abu-Salih,
Dana Al Qudah,
Malak Al-Hassan,
Seyed Mohssen Ghafari,
Tomayess Issa,
Ibrahim Aljarah,
Amin Beheshti,
Sulaiman Alqahtan
Abstract:
The communication revolution has perpetually reshaped the means through which people send and receive information. Social media is an important pillar of this revolution and has brought profound changes to various aspects of our lives. However, the open environment and popularity of these platforms inaugurate windows of opportunities for various cyber threats, thus social networks have become a fe…
▽ More
The communication revolution has perpetually reshaped the means through which people send and receive information. Social media is an important pillar of this revolution and has brought profound changes to various aspects of our lives. However, the open environment and popularity of these platforms inaugurate windows of opportunities for various cyber threats, thus social networks have become a fertile venue for spammers and other illegitimate users to execute their malicious activities. These activities include phishing hot and trendy topics and posting a wide range of contents in many topics. Hence, it is crucial to continuously introduce new techniques and approaches to detect and stop this category of users. This paper proposes a novel and effective approach to detect social spammers. An investigation into several attributes to measure topic-dependent and topic-independent users' behaviours on Twitter is carried out. The experiments of this study are undertaken on various machine learning classifiers. The performance of these classifiers are compared and their effectiveness is measured via a number of robust evaluation measures. Further, the proposed approach is benchmarked against state-of-the-art social spam and anomalous detection techniques. These experiments report the effectiveness and utility of the proposed approach and embedded modules.
△ Less
Submitted 13 January, 2022;
originally announced January 2022.
-
FuzzingDriver: the Missing Dictionary to Increase Code Coverage in Fuzzers
Authors:
Arash Ale Ebrahim,
Mohammadreza Hazhirpasand,
Oscar Nierstrasz,
Mohammad Ghafari
Abstract:
We propose a tool, called FuzzingDriver, to generate dictionary tokens for coverage-based greybox fuzzers (CGF) from the codebase of any target program. FuzzingDriver does not add any overhead to the fuzzing job as it is run beforehand. We compared FuzzingDriver to Google dictionaries by fuzzing six open-source targets, and we found that FuzzingDriver consistently achieves higher code coverage in…
▽ More
We propose a tool, called FuzzingDriver, to generate dictionary tokens for coverage-based greybox fuzzers (CGF) from the codebase of any target program. FuzzingDriver does not add any overhead to the fuzzing job as it is run beforehand. We compared FuzzingDriver to Google dictionaries by fuzzing six open-source targets, and we found that FuzzingDriver consistently achieves higher code coverage in all tests. We also executed eight benchmarks on FuzzBench to demonstrate how utilizing FuzzingDriver's dictionaries can outperform six widely-used CGF fuzzers. In future work, investigating the impact of FuzzingDriver's dictionaries on improving bug coverage might prove important. Video demonstration: https://www.youtube.com/watch?v=Y8j_KvfRrI8
△ Less
Submitted 13 January, 2022;
originally announced January 2022.
-
Security Risks of Porting C Programs to WebAssembly
Authors:
Quentin Stiévenart,
Coen De Roover,
Mohammad Ghafari
Abstract:
WebAssembly is a compilation target for cross-platform applications that is increasingly being used. In this paper, we investigate whether one can transparently cross-compile C programs to WebAssembly, and if not, what impact porting can have on their security. We compile 17,802 programs that exhibit common vulnerabilities to 64-bit x86 and to WebAssembly binaries, and we observe that the executio…
▽ More
WebAssembly is a compilation target for cross-platform applications that is increasingly being used. In this paper, we investigate whether one can transparently cross-compile C programs to WebAssembly, and if not, what impact porting can have on their security. We compile 17,802 programs that exhibit common vulnerabilities to 64-bit x86 and to WebAssembly binaries, and we observe that the execution of 4,911 binaries produces different results across these platforms. Through manual inspection, we identify three classes of root causes for such differences: the use of a different standard library implementation, the lack of security measures in WebAssembly, and the different semantics of the execution environments. We describe our observations and discuss the ones that are critical from a security point of view and need most attention from developers. We conclude that compiling an existing C program to WebAssembly for cross-platform distribution may require source code adaptations; otherwise, the security of the WebAssembly application may be at risk.
△ Less
Submitted 22 December, 2021;
originally announced December 2021.
-
How Do Developers Deal with Security Issue Reports on GitHub?
Authors:
Noah Bühlmann,
Mohammad Ghafari
Abstract:
Security issue reports are the primary means of informing development teams of security risks in projects, but little is known about current practices. We aim to understand the characteristics of these reports in open-source projects and uncover opportunities to improve developer practices. We analysed 3,493 security issue reports in 182 different projects on GitHub and manually studied 333 report…
▽ More
Security issue reports are the primary means of informing development teams of security risks in projects, but little is known about current practices. We aim to understand the characteristics of these reports in open-source projects and uncover opportunities to improve developer practices. We analysed 3,493 security issue reports in 182 different projects on GitHub and manually studied 333 reports, and their discussions and pull requests. We found that, the number of security issue reports has increased over time, they are resolved faster, and they are reported in earlier development stages compared to past years. Nevertheless, a tiny group of developers are involved frequently, security issues progress slowly, and a great number of them has been pending for a long time. We realized that only a small subset of security issue reports include reproducibility data, a potential fix is rarely suggested, and there is no hint regarding how a reporter spotted an issue. We noted that the resolution time of an issue is significantly shorter when the first reaction to a security report is fast and when a reference to a known vulnerability exists.
△ Less
Submitted 20 December, 2021;
originally announced December 2021.
-
Cryptography Vulnerabilities on HackerOne
Authors:
Mohammadreza Hazhirpasand,
Mohammad Ghafari
Abstract:
Previous studies have shown that cryptography is hard for developers to use and misusing cryptography leads to severe security vulnerabilities. We studied relevant vulnerability reports on the HackerOne bug bounty platform to understand what types of cryptography vulnerabilities exist in the wild. We extracted eight themes of vulnerabilities from the vulnerability reports and discussed their real-…
▽ More
Previous studies have shown that cryptography is hard for developers to use and misusing cryptography leads to severe security vulnerabilities. We studied relevant vulnerability reports on the HackerOne bug bounty platform to understand what types of cryptography vulnerabilities exist in the wild. We extracted eight themes of vulnerabilities from the vulnerability reports and discussed their real-world implications and mitigation strategies. We hope that our findings alert developers, familiarize them with the dire consequences of cryptography misuses, and support them in avoiding such mistakes.
△ Less
Submitted 6 November, 2021;
originally announced November 2021.
-
Security Header Fields in HTTP Clients
Authors:
Pascal Gadient,
Oscar Nierstrasz,
Mohammad Ghafari
Abstract:
HTTP headers are commonly used to establish web communications, and some of them are relevant for security. However, we have only little information about the usage and support of security-relevant headers in mobile applications. We explored the adoption of such headers in mobile app communication by querying 9,714 distinct URLs that were used in 3,376 apps and collected each server's response inf…
▽ More
HTTP headers are commonly used to establish web communications, and some of them are relevant for security. However, we have only little information about the usage and support of security-relevant headers in mobile applications. We explored the adoption of such headers in mobile app communication by querying 9,714 distinct URLs that were used in 3,376 apps and collected each server's response information. We discovered that support for secure HTTP header fields is absent in all major HTTP clients, and it is barely provided with any server response. Based on these results, we discuss opportunities for improvement particularly to reduce the likelihood of data leaks and arbitrary code execution. We advocate more comprehensive use of existing HTTP headers and timely development of relevant web browser security features in HTTP client libraries.
△ Less
Submitted 5 November, 2021;
originally announced November 2021.
-
Phish What You Wish
Authors:
Pascal Gadient,
Pascal Gerig,
Oscar Nierstrasz,
Mohammad Ghafari
Abstract:
IT professionals have no simple tool to create phishing websites and raise the awareness of users. We developed a prototype that can dynamically mimic websites by using enriched screenshots, which requires no additional programming experience and is simple to set up. The generated websites are functional and remain up-to-date. We found that 98% of the hyperlinks in mimicked websites are functional…
▽ More
IT professionals have no simple tool to create phishing websites and raise the awareness of users. We developed a prototype that can dynamically mimic websites by using enriched screenshots, which requires no additional programming experience and is simple to set up. The generated websites are functional and remain up-to-date. We found that 98% of the hyperlinks in mimicked websites are functional with our tool, compared to 43% with the best competitor, and only two participants suspected phishing attempts at the time they were performing tasks with our prototype. This work intends to raise awareness for phishing attempts especially with local websites by providing an easy to use prototype to set up such phishing sites.
△ Less
Submitted 5 November, 2021;
originally announced November 2021.
-
The Security Risk of Lacking Compiler Protection in WebAssembly
Authors:
Quentin Stiévenart,
Coen De Roover,
Mohammad Ghafari
Abstract:
WebAssembly is increasingly used as the compilation target for cross-platform applications. In this paper, we investigate whether one can rely on the security measures enforced by existing C compilers when compiling C programs to WebAssembly. We compiled 4,469 C programs with known buffer overflow vulnerabilities to x86 code and to WebAssembly, and observed the outcome of the execution of the gene…
▽ More
WebAssembly is increasingly used as the compilation target for cross-platform applications. In this paper, we investigate whether one can rely on the security measures enforced by existing C compilers when compiling C programs to WebAssembly. We compiled 4,469 C programs with known buffer overflow vulnerabilities to x86 code and to WebAssembly, and observed the outcome of the execution of the generated code to differ for 1,088 programs. Through manual inspection, we identified that the root cause for these is the lack of security measures such as stack canaries in the generated WebAssembly: while x86 code crashes upon a stack-based buffer overflow, the corresponding WebAssembly continues to be executed. We conclude that compiling an existing C program to WebAssembly without additional precautions may hamper its security, and we encourage more research in this direction.
△ Less
Submitted 2 November, 2021;
originally announced November 2021.
-
Dazed and Confused: What's Wrong with Crypto Libraries?
Authors:
Mohammadreza Hazhirpasand,
Oscar Nierstrasz,
Mohammad Ghafari
Abstract:
Recent studies have shown that developers have difficulties in using cryptographic APIs, which often led to security flaws. We are interested to tackle this matter by looking into what types of problems exist in various crypto libraries. We manually studied 500 posts on Stack Overflow associated with 20 popular crypto libraries. We realized there are 10 themes in the discussions. Interestingly, th…
▽ More
Recent studies have shown that developers have difficulties in using cryptographic APIs, which often led to security flaws. We are interested to tackle this matter by looking into what types of problems exist in various crypto libraries. We manually studied 500 posts on Stack Overflow associated with 20 popular crypto libraries. We realized there are 10 themes in the discussions. Interestingly, there were only two questions related to attacks against cryptography. There were 63 discussions in which developers had interoperability issues when working with more than a crypto library. The majority of posts (i.e. 112) were about encryption/decryption problems and 111 were about installation/compilation issues of crypto libraries. Overall, we realize that the crypto libraries are frequently involved in more than five themes of discussions. We believe the current initial findings can help team leaders and experienced developers to correctly guide the team members in the domain of cryptography. Moreover, future research should investigate the similarity of problems at the API level among popular crypto libraries.
△ Less
Submitted 2 November, 2021;
originally announced November 2021.
-
Crypto Experts Advise What They Adopt
Authors:
Mohammadreza Hazhirpasand,
Oscar Nierstrasz,
Mohammad Ghafari
Abstract:
Previous studies have shown that developers regularly seek advice on online forums to resolve their cryptography issues. We investigated whether users who are active in cryptography discussions also use cryptography in practice. We collected the top 1% of responders who have participated in crypto discussions on Stack Overflow, and we manually analyzed their crypto contributions to open source pro…
▽ More
Previous studies have shown that developers regularly seek advice on online forums to resolve their cryptography issues. We investigated whether users who are active in cryptography discussions also use cryptography in practice. We collected the top 1% of responders who have participated in crypto discussions on Stack Overflow, and we manually analyzed their crypto contributions to open source projects on GitHub. We could identify 319 GitHub profiles that belonged to such crypto responders and found that 189 of them used cryptography in their projects. Further investigation revealed that the majority of analyzed users (i.e., 85%) use the same programming languages for crypto activity on Stack Overflow and crypto contributions on GitHub. Moreover, 90% of the analyzed users employed the same concept of cryptography in their projects as they advised about on Stack Overflow.
△ Less
Submitted 30 September, 2021;
originally announced September 2021.
-
Worrisome Patterns in Developers: A Survey in Cryptography
Authors:
Mohammadreza Hazhirpasand,
Oscar Nierstrasz,
Mohammad Ghafari
Abstract:
We surveyed 97 developers who had used cryptography in open-source projects, in the hope of identifying developer security and cryptography practices. We asked them about individual and company-level practices, and divided respondents into three groups (i.e., high, medium, and low) based on their level of knowledge. We found differences between the high-profile developers and the other two groups.…
▽ More
We surveyed 97 developers who had used cryptography in open-source projects, in the hope of identifying developer security and cryptography practices. We asked them about individual and company-level practices, and divided respondents into three groups (i.e., high, medium, and low) based on their level of knowledge. We found differences between the high-profile developers and the other two groups. For instance, high-profile developers have more years of experience in programming, have attended more security and cryptography courses, have more background in security, are highly concerned about security, and tend to use security tools more than the other two groups. Nevertheless, we observed worrisome patterns among all participants such as the high usage of unreliable sources like Stack Overflow, and the low rate of security tool usage.
△ Less
Submitted 30 September, 2021; v1 submitted 29 September, 2021;
originally announced September 2021.
-
What Do Developers Discuss about Code Comments?
Authors:
Pooja Rani,
Mathias Birrer,
Sebastiano Panichella,
Mohammad Ghafari,
Oscar Nierstrasz
Abstract:
Code comments are important for program comprehension, development, and maintenance tasks. Given the varying standards for code comments, and their unstructured or semi-structured nature, developers get easily confused (especially novice developers) about which convention(s) to follow, or what tools to use while writing code documentation. Thus, they post related questions on external online sourc…
▽ More
Code comments are important for program comprehension, development, and maintenance tasks. Given the varying standards for code comments, and their unstructured or semi-structured nature, developers get easily confused (especially novice developers) about which convention(s) to follow, or what tools to use while writing code documentation. Thus, they post related questions on external online sources to seek better commenting practices. In this paper, we analyze code comment discussions on online sources such as Stack Overflow (SO) and Quora to shed some light on the questions developers ask about commenting practices. We apply Latent Dirichlet Allocation (LDA) to identify emerging topics concerning code comments. Then we manually analyze a statistically significant sample set of posts to derive a taxonomy that provides an overview of the developer questions about commenting practices. Our results highlight that on SO nearly 40% of the questions mention how to write or process comments in documentation tools and environments, and nearly 20% of the questions are about potential limitations and possibilities of documentation tools to add automatically and consistently more information in comments. On the other hand, on Quora, developer questions focus more on background information (35% of the questions) or asking opinions (16% of the questions) about code comments. We found that (i) not all aspects of comments are covered in coding style guidelines, e.g., how to add a specific type of information, (ii) developers need support in learning the syntax and format conventions to add various types of information in comments, and (iii) developers are interested in various automated strategies for comments such as detection of bad comments, or verify comment style automatically, but lack tool support to do that.
△ Less
Submitted 17 August, 2021;
originally announced August 2021.
-
FluentCrypto: Cryptography in Easy Mode
Authors:
Simon Kafader,
Mohammad Ghafari
Abstract:
Research has shown that cryptography concepts are hard to understand for developers, and secure use of cryptography APIs is challenging for mainstream developers. We have developed a fluent API named FluentCrypto to ease the secure and correct adoption of cryptography in the Node.js JavaScript runtime environment. It provides a task-based solution i.e., it hides the low-level complexities that inv…
▽ More
Research has shown that cryptography concepts are hard to understand for developers, and secure use of cryptography APIs is challenging for mainstream developers. We have developed a fluent API named FluentCrypto to ease the secure and correct adoption of cryptography in the Node.js JavaScript runtime environment. It provides a task-based solution i.e., it hides the low-level complexities that involve using the native Node.js cryptography API, and it relies on the rules that crypto experts specify to determine a secure configuration of the API. We conducted an initial study and found that FluentCrypto is hard to misuse even for developers who lack cryptography knowledge, and compared to the standard Node.js crypto API, it is easier to use for developers and helps them to develop secure solutions in a shorter time.
△ Less
Submitted 16 August, 2021;
originally announced August 2021.
-
Security Smells Pervade Mobile App Servers
Authors:
Pascal Gadient,
Marc-Andrea Tarnutzer,
Oscar Nierstrasz,
Mohammad Ghafari
Abstract:
[Background] Web communication is universal in cyberspace, and security risks in this domain are devastating. [Aims] We analyzed the prevalence of six security smells in mobile app servers, and we investigated the consequence of these smells from a security perspective. [Method] We used an existing dataset that includes 9714 distinct URLs used in 3376 Android mobile apps. We exercised these URLs t…
▽ More
[Background] Web communication is universal in cyberspace, and security risks in this domain are devastating. [Aims] We analyzed the prevalence of six security smells in mobile app servers, and we investigated the consequence of these smells from a security perspective. [Method] We used an existing dataset that includes 9714 distinct URLs used in 3376 Android mobile apps. We exercised these URLs twice within 14 months and investigated the HTTP headers and bodies. [Results] We found that more than 69% of tested apps suffer from three kinds of security smells, and that unprotected communication and misconfigurations are very common in servers. Moreover, source-code and version leaks, or the lack of update policies expose app servers to security risks. [Conclusions] Poor app server maintenance greatly hampers security.
△ Less
Submitted 16 August, 2021;
originally announced August 2021.
-
Hurdles for Developers in Cryptography
Authors:
Mohammadreza Hazhirpasand,
Oscar Nierstrasz,
Mohammadhossein Shabani,
Mohammad Ghafari
Abstract:
Prior research has shown that cryptography is hard to use for developers. We aim to understand what cryptography issues developers face in practice. We clustered 91954 cryptography-related questions on the Stack Overflow website, and manually analyzed a significant sample (i.e., 383) of the questions to comprehend the crypto challenges developers commonly face in this domain. We found that either…
▽ More
Prior research has shown that cryptography is hard to use for developers. We aim to understand what cryptography issues developers face in practice. We clustered 91954 cryptography-related questions on the Stack Overflow website, and manually analyzed a significant sample (i.e., 383) of the questions to comprehend the crypto challenges developers commonly face in this domain. We found that either developers have a distinct lack of knowledge in understanding the fundamental concepts, \eg OpenSSL, public-key cryptography or password hashing, or the usability of crypto libraries undermined developer performance to correctly realize a crypto scenario. This is alarming and indicates the need for dedicated research to improve the design of crypto APIs.
△ Less
Submitted 16 August, 2021;
originally announced August 2021.
-
Java Cryptography Uses in the Wild
Authors:
Mohammadreza Hazhirpasand,
Mohammad Ghafari,
Oscar Nierstrasz
Abstract:
[Background] Previous research has shown that developers commonly misuse cryptography APIs. [Aim] We have conducted an exploratory study to find out how crypto APIs are used in open-source Java projects, what types of misuses exist, and why developers make such mistakes. [Method] We used a static analysis tool to analyze hundreds of open-source Java projects that rely on Java Cryptography Architec…
▽ More
[Background] Previous research has shown that developers commonly misuse cryptography APIs. [Aim] We have conducted an exploratory study to find out how crypto APIs are used in open-source Java projects, what types of misuses exist, and why developers make such mistakes. [Method] We used a static analysis tool to analyze hundreds of open-source Java projects that rely on Java Cryptography Architecture, and manually inspected half of the analysis results to assess the tool results. We also contacted the maintainers of these projects by creating an issue on the GitHub repository of each project, and discussed the misuses with developers. [Results] We learned that 85% of Cryptography APIs are misused, however, not every misuse has severe consequences. Developer feedback showed that security caveats in the documentation of crypto APIs are rare, developers may overlook misuses that originate in third-party code, and the context where a Crypto API is used should be taken into account. [Conclusion] We conclude that using Crypto APIs is still problematic for developers but blindly blaming them for such misuses may lead to erroneous conclusions.
△ Less
Submitted 2 September, 2020;
originally announced September 2020.
-
Why Research on Test-Driven Development is Inconclusive?
Authors:
Mohammad Ghafari,
Timm Gross,
Davide Fucci,
Michael Felderer
Abstract:
[Background] Recent investigations into the effects of Test-Driven Development (TDD) have been contradictory and inconclusive. This hinders development teams to use research results as the basis for deciding whether and how to apply TDD. [Aim] To support researchers when designing a new study and to increase the applicability of TDD research in the decision-making process in the industrial context…
▽ More
[Background] Recent investigations into the effects of Test-Driven Development (TDD) have been contradictory and inconclusive. This hinders development teams to use research results as the basis for deciding whether and how to apply TDD. [Aim] To support researchers when designing a new study and to increase the applicability of TDD research in the decision-making process in the industrial context, we aim at identifying the reasons behind the inconclusive research results in TDD. [Method] We studied the state of the art in TDD research published in top venues in the past decade, and analyzed the way these studies were set up. [Results] We identified five categories of factors that directly impact the outcome of studies on TDD. [Conclusions] This work can help researchers to conduct more reliable studies, and inform practitioners of risks they need to consider when consulting research on TDD.
△ Less
Submitted 19 July, 2020;
originally announced July 2020.
-
Security Smells in Android
Authors:
Mohammad Ghafari,
Pascal Gadient,
Oscar Nierstrasz
Abstract:
The ubiquity of smartphones, and their very broad capabilities and usage, make the security of these devices tremendously important. Unfortunately, despite all progress in security and privacy mechanisms, vulnerabilities continue to proliferate. Research has shown that many vulnerabilities are due to insecure programming practices. However, each study has often dealt with a specific issue, making…
▽ More
The ubiquity of smartphones, and their very broad capabilities and usage, make the security of these devices tremendously important. Unfortunately, despite all progress in security and privacy mechanisms, vulnerabilities continue to proliferate. Research has shown that many vulnerabilities are due to insecure programming practices. However, each study has often dealt with a specific issue, making the results less actionable for practitioners. To promote secure programming practices, we have reviewed related research, and identified avoidable vulnerabilities in Android-run devices and the "security code smells" that indicate their presence. In particular, we explain the vulnerabilities, their corresponding smells, and we discuss how they could be eliminated or mitigated during development. Moreover, we develop a lightweight static analysis tool and discuss the extent to which it successfully detects several vulnerabilities in about 46,000 apps hosted by the official Android market.
△ Less
Submitted 1 June, 2020;
originally announced June 2020.
-
What do class comments tell us? An investigation of comment evolution and practices in Pharo Smalltalk
Authors:
Pooja Rani,
Sebastiano Panichella,
Manuel Leuenberger,
Mohammad Ghafari,
Oscar Nierstrasz
Abstract:
Previous studies have characterized code comments in various programming languages to support better program comprehension activities and maintenance tasks. However, very few studies have focused on understanding developer practices to write comments. None of them has compared such developer practices to the standard comment guidelines to study the extent to which developers follow the guidelines.…
▽ More
Previous studies have characterized code comments in various programming languages to support better program comprehension activities and maintenance tasks. However, very few studies have focused on understanding developer practices to write comments. None of them has compared such developer practices to the standard comment guidelines to study the extent to which developers follow the guidelines. This paper reports the first empirical study investigating commenting practices in Pharo Smalltalk. First, we analyze class comment evolution over seven Pharo versions. Then, we investigate the information types embedded in class comments. Finally, we study the adherence of developer commenting practices to the official class comment template over Pharo versions.
The results of this study show that there is a rapid increase in class comments in the initial three Pharo versions, while in subsequent versions developers added comments to both new and old classes, thus maintaining a similar code to comment ratio. We furthermore found three times as many information types in class comments as those suggested by the template. However, the information types suggested by the template tend to be present more often than other types of information. Additionally, we find that a substantial proportion of comments follow the writing style of the template in writing these information types, but they are written and formatted in a non-uniform way.This suggests the need to standardize the commenting guidelines for formatting the text, and to provide headers for the different information types to ensure a consistent style and to identify the information easily. Given the importance of high-quality code comments, we draw numerous implications for developers and researchers to improve the support for comment quality assessment tools.
△ Less
Submitted 15 June, 2021; v1 submitted 23 May, 2020;
originally announced May 2020.
-
Towards Time-Aware Context-Aware Deep Trust Prediction in Online Social Networks
Authors:
Seyed Mohssen Ghafari
Abstract:
Trust can be defined as a measure to determine which source of information is reliable and with whom we should share or from whom we should accept information. There are several applications for trust in Online Social Networks (OSNs), including social spammer detection, fake news detection, retweet behaviour detection and recommender systems. Trust prediction is the process of predicting a new tru…
▽ More
Trust can be defined as a measure to determine which source of information is reliable and with whom we should share or from whom we should accept information. There are several applications for trust in Online Social Networks (OSNs), including social spammer detection, fake news detection, retweet behaviour detection and recommender systems. Trust prediction is the process of predicting a new trust relation between two users who are not currently connected. In applications of trust, trust relations among users need to be predicted. This process faces many challenges, such as the sparsity of user-specified trust relations, the context-awareness of trust and changes in trust values over time. In this dissertation, we analyse the state-of-the-art in pair-wise trust prediction models in OSNs. We discuss three main challenges in this domain and present novel trust prediction approaches to address them. We first focus on proposing a low-rank representation of users that incorporates users' personality traits as additional information. Then, we propose a set of context-aware trust prediction models. Finally, by considering the time-dependency of trust relations, we propose a dynamic deep trust prediction approach. We design and implement five pair-wise trust prediction approaches and evaluate them with real-world datasets collected from OSNs. The experimental results demonstrate the effectiveness of our approaches compared to other state-of-the-art pair-wise trust prediction models.
△ Less
Submitted 20 March, 2020;
originally announced March 2020.
-
Tricking Johnny into Granting Web Permissions
Authors:
Mohammadreza Hazhirpasand,
Mohammad Ghafari,
Oscar Nierstrasz
Abstract:
We studied the web permission API dialog box in popular mobile and desktop browsers, and found that it typically lacks measures to protect users from unwittingly granting web permission when clicking too fast.
We developed a game that exploits this issue, and tricks users into granting webcam permission. We conducted three experiments, each with 40 different participants, on both desktop and mob…
▽ More
We studied the web permission API dialog box in popular mobile and desktop browsers, and found that it typically lacks measures to protect users from unwittingly granting web permission when clicking too fast.
We developed a game that exploits this issue, and tricks users into granting webcam permission. We conducted three experiments, each with 40 different participants, on both desktop and mobile browsers. The results indicate that in the absence of a prevention mechanism, we achieve a considerably high success rate in tricking 95% and 72% of participants on mobile and desktop browsers, respectively. Interestingly, we also tricked 47% of participants on a desktop browser where a prevention mechanism exists.
△ Less
Submitted 19 February, 2020;
originally announced February 2020.
-
Caveats in Eliciting Mobile App Requirements
Authors:
Nitish Patkar,
Mohammad Ghafari,
Oscar Nierstrasz,
Sofija Hotomski
Abstract:
Factors such as app stores or platform choices heavily affect functional and non-functional mobile app requirements. We surveyed 45 companies and interviewed ten experts to explore how factors that impact mobile app requirements are understood by requirements engineers in the mobile app industry.
We observed a lack of knowledge in several areas. For instance, we observed that all practitioners w…
▽ More
Factors such as app stores or platform choices heavily affect functional and non-functional mobile app requirements. We surveyed 45 companies and interviewed ten experts to explore how factors that impact mobile app requirements are understood by requirements engineers in the mobile app industry.
We observed a lack of knowledge in several areas. For instance, we observed that all practitioners were aware of data privacy concerns, however, they did not know that certain third-party libraries, usage aggregators, or advertising libraries also occasionally leak sensitive user data. Similarly, certain functional requirements may not be implementable in the absence of a third-party library that is either banned from an app store for policy violations or lacks features, for instance, missing desired features in ARKit library for iOS made practitioners turn to Android.
We conclude that requirements engineers should have adequate technical experience with mobile app development as well as sufficient knowledge in areas such as privacy, security and law, in order to make informed decisions during requirements elicitation.
△ Less
Submitted 19 February, 2020;
originally announced February 2020.
-
Enabling the Analysis of Personality Aspects in Recommender Systems
Authors:
Shahpar Yakhchi,
Amin Beheshti,
Seyed Mohssen Ghafari,
Mehmet Orgun
Abstract:
Existing Recommender Systems mainly focus on exploiting users' feedback, e.g., ratings, and reviews on common items to detect similar users. Thus, they might fail when there are no common items of interest among users. We call this problem the Data Sparsity With no Feedback on Common Items (DSW-n-FCI). Personality-based recommender systems have shown a great success to identify similar users based…
▽ More
Existing Recommender Systems mainly focus on exploiting users' feedback, e.g., ratings, and reviews on common items to detect similar users. Thus, they might fail when there are no common items of interest among users. We call this problem the Data Sparsity With no Feedback on Common Items (DSW-n-FCI). Personality-based recommender systems have shown a great success to identify similar users based on their personality types. However, there are only a few personality-based recommender systems in the literature which either discover personality explicitly through filling a questionnaire that is a tedious task, or neglect the impact of users' personal interests and level of knowledge, as a key factor to increase recommendations' acceptance. Differently, we identifying users' personality type implicitly with no burden on users and incorporate it along with users' personal interests and their level of knowledge. Experimental results on a real-world dataset demonstrate the effectiveness of our model, especially in DSW-n-FCI situations.
△ Less
Submitted 7 January, 2020;
originally announced January 2020.
-
CryptoExplorer: An Interactive Web Platform Supporting Secure Use of Cryptography APIs
Authors:
Mohammadreza Hazhirpasand,
Mohammad Ghafari,
Oscar Nierstrasz
Abstract:
Research has shown that cryptographic APIs are hard to use. Consequently, developers resort to using code examples available in online information sources that are often not secure. We have developed a web platform, named CryptoExplorer, stocked with numerous real-world secure and insecure examples that developers can explore to learn how to use cryptographic APIs properly. This platform currently…
▽ More
Research has shown that cryptographic APIs are hard to use. Consequently, developers resort to using code examples available in online information sources that are often not secure. We have developed a web platform, named CryptoExplorer, stocked with numerous real-world secure and insecure examples that developers can explore to learn how to use cryptographic APIs properly. This platform currently provides 3,263 secure uses, and 5,897 insecure uses of Java Cryptography Architecture mined from 2,324 Java projects on GitHub. A preliminary study shows that CryptoExplorer provides developers with secure crypto API use examples instantly, developers can save time compared to searching on the internet for such examples, and they learn to avoid using certain algorithms in APIs by studying misused API examples. We have a pipeline to regularly mine more projects, and, on request, we offer our dataset to researchers.
△ Less
Submitted 3 January, 2020;
originally announced January 2020.
-
Web APIs in Android through the Lens of Security
Authors:
Pascal Gadient,
Mohammad Ghafari,
Marc-Andrea Tarnutzer,
Oscar Nierstrasz
Abstract:
Web communication has become an indispensable characteristic of mobile apps. However, it is not clear what data the apps transmit, to whom, and what consequences such transmissions have. We analyzed the web communications found in mobile apps from the perspective of security. We first manually studied 160 Android apps to identify the commonly-used communication libraries, and to understand how the…
▽ More
Web communication has become an indispensable characteristic of mobile apps. However, it is not clear what data the apps transmit, to whom, and what consequences such transmissions have. We analyzed the web communications found in mobile apps from the perspective of security. We first manually studied 160 Android apps to identify the commonly-used communication libraries, and to understand how they are used in these apps. We then developed a tool to statically identify web API URLs used in the apps, and restore the JSON data schemas including the type and value of each parameter. We extracted 9,714 distinct web API URLs that were used in 3,376 apps. We found that developers often use the java.net package for network communication, however, third-party libraries like OkHttp are also used in many apps. We discovered that insecure HTTP connections are seven times more prevalent in closed-source than in open-source apps, and that embedded SQL and JavaScript code is used in web communication in more than 500 different apps. This finding is devastating; it leaves billions of users and API service providers vulnerable to attack.
△ Less
Submitted 1 June, 2020; v1 submitted 1 January, 2020;
originally announced January 2020.
-
The Impact of Developer Experience in Using Java Cryptography
Authors:
Mohammadreza Hazhirpasand,
Mohammad Ghafari,
Stefan Krüger,
Eric Bodden,
Oscar Nierstrasz
Abstract:
Previous research has shown that crypto APIs are hard for developers to understand and difficult for them to use. They consequently rely on unvalidated boilerplate code from online resources where security vulnerabilities are common.
We analyzed 2,324 open-source Java projects that rely on Java Cryptography Architecture (JCA) to understand how crypto APIs are used in practice, and what factors a…
▽ More
Previous research has shown that crypto APIs are hard for developers to understand and difficult for them to use. They consequently rely on unvalidated boilerplate code from online resources where security vulnerabilities are common.
We analyzed 2,324 open-source Java projects that rely on Java Cryptography Architecture (JCA) to understand how crypto APIs are used in practice, and what factors account for the performance of developers in using these APIs. We found that, in general, the experience of developers in using JCA does not correlate with their performance. In particular, none of the factors such as the number or frequency of committed lines of code, the number of JCA APIs developers use, or the number of projects they are involved in correlate with developer performance in this domain.
We call for qualitative studies to shed light on the reasons underlying the success of developers who are expert in using cryptography. Also, detailed investigation at API level is necessary to further clarify a developer obstacles in this domain.
△ Less
Submitted 5 August, 2019;
originally announced August 2019.
-
Testability First!
Authors:
Mohammad Ghafari,
Markus Eggiman,
Oscar Nierstrasz
Abstract:
The pivotal role of testing in high-quality software production has driven a significant effort in evaluating and assessing testing practices. We explore the state of testing in a large industrial project over an extended period. We study the interplay between bugs in the project and its test cases, and interview developers and stakeholders to uncover reasons underpinning our observations. We real…
▽ More
The pivotal role of testing in high-quality software production has driven a significant effort in evaluating and assessing testing practices. We explore the state of testing in a large industrial project over an extended period. We study the interplay between bugs in the project and its test cases, and interview developers and stakeholders to uncover reasons underpinning our observations. We realized that testing is not well adopted, and that testability (ie, ease of testing) is low. We found that developers tended to abandon writing tests when they assessed the effort to be high. Frequent changes in requirements and pressure to add new features also hindered developers from writing tests. Regardless of the debates on test first or later, we hypothesize that the underlying reasons for poor test quality are rooted in a lack of attention to testing early in the development of a software component, leading to poor testability of the component. However, testability is usually overlooked in research that studies the impact of testing practices, and should be explicitly taken into account.
△ Less
Submitted 5 August, 2019;
originally announced August 2019.
-
Security Code Smells in Android ICC
Authors:
Pascal Gadient,
Mohammad Ghafari,
Patrick Frischknecht,
Oscar Nierstrasz
Abstract:
Android Inter-Component Communication (ICC) is complex, largely unconstrained, and hard for developers to understand. As a consequence, ICC is a common source of security vulnerability in Android apps. To promote secure programming practices, we have reviewed related research, and identified avoidable ICC vulnerabilities in Android-run devices and the security code smells that indicate their prese…
▽ More
Android Inter-Component Communication (ICC) is complex, largely unconstrained, and hard for developers to understand. As a consequence, ICC is a common source of security vulnerability in Android apps. To promote secure programming practices, we have reviewed related research, and identified avoidable ICC vulnerabilities in Android-run devices and the security code smells that indicate their presence. We explain the vulnerabilities and their corresponding smells, and we discuss how they can be eliminated or mitigated during development. We present a lightweight static analysis tool on top of Android Lint that analyzes the code under development and provides just-in-time feedback within the IDE about the presence of such smells in the code. Moreover, with the help of this tool we study the prevalence of security code smells in more than 700 open-source apps, and manually inspect around 15% of the apps to assess the extent to which identifying such smells uncovers ICC security vulnerabilities.
△ Less
Submitted 10 December, 2018; v1 submitted 30 November, 2018;
originally announced November 2018.
-
Goal-Oriented Mutation Testing with Focal Methods
Authors:
Sten Vercammen,
Mohammad Ghafari,
Serge Demeyer,
Markus Borg
Abstract:
Mutation testing is the state-of-the-art technique for assessing the fault-detection capacity of a test suite. Unfortunately, mutation testing consumes enormous computing resources because it runs the whole test suite for each and every injected mutant. In this paper we explore fine-grained traceability links at method level (named focal methods), to reduce the execution time of mutation testing a…
▽ More
Mutation testing is the state-of-the-art technique for assessing the fault-detection capacity of a test suite. Unfortunately, mutation testing consumes enormous computing resources because it runs the whole test suite for each and every injected mutant. In this paper we explore fine-grained traceability links at method level (named focal methods), to reduce the execution time of mutation testing and to verify the quality of the test cases for each individual method, instead of the usually verified overall test suite quality. Validation of our approach on the open source Apache Ant project shows a speed-up of 573.5x for the mutants located in focal methods with a quality score of 80%.
△ Less
Submitted 9 October, 2018; v1 submitted 28 July, 2018;
originally announced July 2018.
-
The Impact of Feature Selection on Predicting the Number of Bugs
Authors:
Haidar Osman,
Mohammad Ghafari,
Oscar Nierstrasz
Abstract:
Bug prediction is the process of training a machine learning model on software metrics and fault information to predict bugs in software entities. While feature selection is an important step in building a robust prediction model, there is insufficient evidence about its impact on predicting the number of bugs in software systems. We study the impact of both correlation-based feature selection (CF…
▽ More
Bug prediction is the process of training a machine learning model on software metrics and fault information to predict bugs in software entities. While feature selection is an important step in building a robust prediction model, there is insufficient evidence about its impact on predicting the number of bugs in software systems. We study the impact of both correlation-based feature selection (CFS) filter methods and wrapper feature selection methods on five widely-used prediction models and demonstrate how these models perform with or without feature selection to predict the number of bugs in five different open source Java software systems. Our results show that wrappers outperform the CFS filter; they improve prediction accuracy by up to 33% while eliminating more than half of the features. We also observe that though the same feature selection method chooses different feature subsets in different projects, this subset always contains a mix of source code and change metrics.
△ Less
Submitted 12 July, 2018;
originally announced July 2018.