Supplementary Material - Github Python Dataset for Boa

  1. Introduction
  2. Usage
    1. Using Boa Website
    2. Standalone Project
  3. Example Boa Queries
  4. Dataset Description

Introduction

This page provides supplementary material for the Github dataset for Python Data Science (DS) projects which is published in MSR 2019. We have used Boa infrastructure for this dataset.

Usage

The dataset can be used in two ways: using Boa website and standalone project.

Using Boa Website

To use the dataset go to Boa website and follow the steps:

  1. From the left menu, select User Login to login as a registered user. If you are not registered, request for a user.
  2. Write a query under the Boa Source Code. If researchers are not familiar with the language, the example Boa programs can be utilized by clicking the Select Examples. Some good examples for this dataset can be also found from the Github repository.
  3. Select 2019 February/Python dataset in the drop-down list under Input Dataset and run the query.

The job will be submitted to Hadoop cluster and is executed parallely on the dataset. When the job status is finished, the output text file will be available for downloading. The job is saved for future reference. One can share the job with others and one can reproduce the result.

To learn about Boa language and queries, navigate through the Boa website, especially Programming Guide Section.

Standalone Project

The dataset is also available outside of Boa website. All data are stored in Hadoop sequence files. Therefore, one can write simple programs to read those files and get parsed AST. The raw dataset is available here [~15 GB].

We have a written a simple program to show how one can read the sequence file and get the parsed AST of the Python programs in the dataset: Github Link. Download the raw dataset and use this project to get parsed AST.

Example Boa Queries

The example Boa queries and their output can be found here: Github Link.

Dataset Description

The dataset contains 1,558 Github projects with following properties:

  1. Original (not forked) project with Python as the primary language.
  2. Contains data at least one science keywords like machine-learning, deep neural network in the description of the project. The whole list of keywords are listed in the appendix.
  3. Contains at least one usage of data science library like Pytorch, Caffe, Keras, Tensorflow etc. A full list of used 33 Python data science libraries are listed in the appendix.
  4. Contains at least 80 star.

The dataset contains projects owned by both organizations and individual users. Some of the top rated projects are Tensorflow Models, Keras, Scikit-learn, Pandas, Spacy, Spotify Luigi, NVIDIA FastPhotoStyle, Theano, etc. A full list of all the 1,558 Github projects are available here. 350 projects in the dataset are maintained by different organizations (Google, Microsoft, NVIDIA etc.) and the rest 1,208 projects are maintained by individual users. The other metrics of the dataset are:

  • Number of developers: 9,839
  • Number of Python files (latest snapshot): 86,321
  • Number of Python files (all revisions):4,977,680.