Boa: A Language and Infrastructure for Analyzing Ultra-Large-Scale Software Repositories

By: Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan, and Tien N. Nguyen

PDF Download Download Paper

Abstract

In today’s software-centric world, ultra-large-scale software repositories, e.g. SourceForge (350,000+ projects), GitHub (250,000+ projects), and Google Code (250,000+ projects) are the new library of Alexandria. They contain an enormous corpus of software and information about software. Scientists and engineers alike are interested in analyzing this wealth of information both for curiosity as well as for testing important hypotheses. However, systematic extraction of relevant data from these repositories and analysis of such data for testing hypotheses is hard, and best left for mining software repository (MSR) experts! The goal of Boa, a domain-specific language and infrastructure described here, is to ease testing MSR-related hypotheses. We have implemented Boa and provide a web-based interface to Boa’s infrastructure. Our evaluation demonstrates that Boa significantly reduces programming efforts, thus lowering the barrier to entry. We also see drastic improvements in scalability. Last but not least, reproducing an experiment conducted using Boa is just a matter of re-running small Boa programs provided by previous researchers.

ACM Reference

Dyer, R. et al. 2013. Boa: A Language and Infrastructure for Analyzing Ultra-Large-Scale Software Repositories. 35th International Conference on Software Engineering (May 2013), 422–431.

BibTeX Reference

@inproceedings{dyer2013boa,
  author = {Dyer, Robert and Nguyen, Hoan Anh and Rajan, Hridesh and Nguyen, Tien N.},
  title = {Boa: A Language and Infrastructure for Analyzing Ultra-Large-Scale Software Repositories},
  booktitle = {35th International Conference on Software Engineering},
  series = {ICSE'13},
  month = {May},
  year = {2013},
  pages = {422--431},
  location = {San Francisco, CA},
  entrysubtype = {conference},
  abstract = {
    In today's software-centric world, ultra-large-scale software repositories,
    e.g. SourceForge (350,000+ projects), GitHub (250,000+ projects), and Google
    Code (250,000+ projects) are the new library of Alexandria. They contain an
    enormous corpus of software and information about software. Scientists and
    engineers alike are interested in analyzing this wealth of information both
    for curiosity as well as for testing important hypotheses. However, systematic
    extraction of relevant data from these repositories and analysis of such data
    for testing hypotheses is hard, and best left for mining software repository
    (MSR) experts! The goal of Boa, a domain-specific language and infrastructure
    described here, is to ease testing MSR-related hypotheses. We have implemented
    Boa and provide a web-based interface to Boa's infrastructure. Our evaluation
    demonstrates that Boa significantly reduces programming efforts, thus lowering
    the barrier to entry. We also see drastic improvements in scalability. Last
    but not least, reproducing an experiment conducted using Boa is just a matter
    of re-running small Boa programs provided by previous researchers.
  }
}