Sampling Projects in GitHub for MSR Studies
Almost every Mining Software Repositories (MSR) study requires, as first step, the selection of the subject software repositories. These repositories are usually collected from hosting services like GitHub using specific selection criteria dictated by the study goal. For example, a study related to licensing might be interested in selecting projects explicitly declaring a license. Once the selection criteria have been defined, utilities such as the GitHub APIs can be used to “query” the hosting service. However, researchers have to deal with usage limitations imposed by these APIs and a lack of required information. For example, the GitHub search APIs allow 30 requests per minute and, when searching repositories, only provide limited information (e.g., the number of commits in a repository is not included). To support researchers in sampling projects from GitHub, we present GHS (GitHub Search), a dataset containing 25 characteristics (e.g., number of commits, license, etc.) of 735,669 repositories written in 10 programming languages. The set of characteristics has been derived by looking for frequently used project selection criteria in MSR studies and the dataset is continuously updated to (i) always provide fresh data about the existing projects, and (ii) increase the number of indexed projects. The GHS dataset can be queried through a web application we built that allows to set many combinations of selection criteria needed for a study and download the information of matching repositories: https://seart-ghs.si.usi.ch.
Mon 17 MayDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
10:00 - 10:50 | Resources for MSR ResearchTechnical Papers / Data Showcase at MSR Room 1 Chair(s): Felipe Ebert Eindhoven University of Technology | ||
10:01 3mTalk | PSIMiner: A Tool for Mining Rich Abstract Syntax Trees from Code Technical Papers Egor Spirin JetBrains Research; National Research University Higher School of Economics, Egor Bogomolov JetBrains Research, Vladimir Kovalenko JetBrains Research, Timofey Bryksin JetBrains Research, Saint Petersburg State University Pre-print | ||
10:04 3mTalk | Mining DEV for social and technical insights about software development Technical Papers Maria Papoutsoglou Aristotle University of Thessaloniki, Johannes Wachs Vienna University of Economics and Business & Complexity Science Hub Vienna, Georgia Kapitsaki University of Cyprus Pre-print | ||
10:07 3mTalk | TNM: A Tool for Mining of Socio-Technical Data from Git Repositories Technical Papers Nikolai Sviridov ITMO University, Mikhail Evtikhiev JetBrains Research, Vladimir Kovalenko JetBrains Research Pre-print | ||
10:10 3mTalk | Identifying Versions of Libraries used in Stack Overflow Code Snippets Technical Papers Ahmed Zerouali Vrije Universiteit Brussel, Camilo Velázquez-Rodríguez Vrije Universiteit Brussel, Coen De Roover Vrije Universiteit Brussel Pre-print Media Attached | ||
10:13 3mTalk | Sampling Projects in GitHub for MSR Studies Data Showcase Ozren Dabic Software Institute, Università della Svizzera italiana (USI), Switzerland, Emad Aghajani Software Institute, USI Università della Svizzera italiana, Gabriele Bavota Software Institute, USI Università della Svizzera italiana Pre-print | ||
10:16 3mTalk | gambit – An Open Source Name Disambiguation Tool for Version Control Systems Technical Papers Christoph Gote Chair of Systems Design, ETH Zurich, Christian Zingg Chair of Systems Design, ETH Zurich Pre-print Media Attached | ||
10:19 31mLive Q&A | Discussions and Q&A Technical Papers |
Go directly to this room on Clowdr