Boa: Ultra-Large-Scale Software Repository and Source-Code Mining

By: Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan, and Tien N. Nguyen

PDF Download Download Paper

Abstract

In today’s software-centric world, ultra-large-scale software repositories, such as SourceForge, GitHub, and Google Code, are the new library of Alexandria. They contain an enormous corpus of software and related information. Scientists and engineers alike are interested in analyzing this wealth of information. However, systematic extraction and analysis of relevant data from these repositories for testing hypotheses is hard, and best left for mining software repository (MSR) experts! Specifically, mining source code yields significant insights into software development artifacts and processes. Unfortunately, mining source code at a large scale remains a difficult task. Previous approaches had to either limit the scope of the projects studied, limit the scope of the mining task to be more coarse grained, or sacrifice studying the history of the code. In this article we address mining source code: (a) at a very large scale; (b) at a fine-grained level of detail; and (c) with full history information. To address these challenges, we present domain-specific language features for source-code mining in our language and infrastructure called Boa. The goal of Boa is to ease testing MSR-related hypotheses. Our evaluation demonstrates that Boa substantially reduces programming efforts, thus lowering the barrier to entry. We also show drastic improvements in scalability.

ACM Reference

Dyer, R. et al. 2015. Boa: Ultra-Large-Scale Software Repository and Source-Code Mining. ACM Trans. Softw. Eng. Methodol. 25, 1 (Dec. 2015), 7:1–7:34. DOI:https://doi.org/10.1145/2803171.

BibTeX Reference

@article{dyer2015boa-b,
  author = {Dyer, Robert and Nguyen, Hoan Anh and Rajan, Hridesh and Nguyen, Tien N.},
  title = {Boa: Ultra-Large-Scale Software Repository and Source-Code Mining},
  journal = {ACM Trans. Softw. Eng. Methodol.},
  issue_date = {December 2015},
  volume = {25},
  number = {1},
  month = dec,
  year = {2015},
  issn = {1049-331X},
  pages = {7:1--7:34},
  articleno = {7},
  numpages = {34},
  url = {http://doi.acm.org/10.1145/2803171},
  doi = {10.1145/2803171},
  acmid = {2803171},
  publisher = {ACM},
  address = {New York, NY, USA},
  keywords = {Boa, domain-specific language, ease of use, lower barrier to entry, mining software repositories, scalable},
  abstract = {
    In today's software-centric world, ultra-large-scale software repositories,
    such as SourceForge, GitHub, and Google Code, are the new library of
    Alexandria. They contain an enormous corpus of software and related
    information. Scientists and engineers alike are interested in analyzing this
    wealth of information. However, systematic extraction and analysis of relevant
    data from these repositories for testing hypotheses is hard, and best left for
    mining software repository (MSR) experts! Specifically, mining source code
    yields significant insights into software development artifacts and processes.
    Unfortunately, mining source code at a large scale remains a difficult task.
    Previous approaches had to either limit the scope of the projects studied,
    limit the scope of the mining task to be more coarse grained, or sacrifice
    studying the history of the code. In this article we address mining source
    code: (a) at a very large scale; (b) at a fine-grained level of detail; and
    (c) with full history information. To address these challenges, we present
    domain-specific language features for source-code mining in our language and
    infrastructure called Boa. The goal of Boa is to ease testing MSR-related
    hypotheses. Our evaluation demonstrates that Boa substantially reduces
    programming efforts, thus lowering the barrier to entry. We also show drastic
    improvements in scalability.
  }
}