ManyTypes4Py: A benchmark Python dataset for machine learning-based type inference
In this paper, we present ManyTypes4Py, a large Python dataset for machine learning (ML)-based type inference. The dataset contains a total of 5,382 Python projects with more than 869K type annotations. Duplicate source code files were removed to eliminate the negative effect of the duplication bias. To facilitate training and evaluation of ML models, the dataset was split into training, validation and test sets by files. To extract type information from abstract syntax trees (ASTs), a light-weight static analyzer pipeline is developed and accompanied with the dataset. Using this pipeline, the collected Python projects were analyzed and the results of the AST analysis were stored in JSON-formatted files. The ManyTypes4Py dataset is shared on zenodo and its tools are publicly available on GitHub.
Tue 18 MayDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
10:00 - 10:50 | ML and Deep LearningTechnical Papers / Data Showcase / Registered Reports at MSR Room 2 Chair(s): Hongyu Zhang The University of Newcastle | ||
10:01 4mTalk | Fast and Memory-Efficient Neural Code Completion Technical Papers Alexey Svyatkovskiy Microsoft, Sebastian Lee University of Oxford, Anna Hadjitofi Alan Turing Institute, Maik Riechert Microsoft Research, Juliana Franco Microsoft Research, Miltiadis Allamanis Microsoft Research, UK Pre-print Media Attached | ||
10:05 4mResearch paper | Comparative Study of Feature Reduction Techniques in Software Change Prediction Technical Papers Ruchika Malhotra Delhi Technological University, Ritvik Kapoor Delhi Technological University, Deepti Aggarwal Delhi Technological University, Priya Garg Delhi Technological University Pre-print | ||
10:09 4mTalk | An Empirical Study on the Usage of BERT Models for Code Completion Technical Papers Matteo Ciniselli Università della Svizzera Italiana, Nathan Cooper William & Mary, Luca Pascarella Delft University of Technology, Denys Poshyvanyk College of William & Mary, Massimiliano Di Penta University of Sannio, Italy, Gabriele Bavota Software Institute, USI Università della Svizzera italiana Pre-print | ||
10:13 3mTalk | ManyTypes4Py: A benchmark Python dataset for machine learning-based type inference Data Showcase Amir Mir Delft University of Technology, Evaldas Latoskinas Delft University of Technology, Georgios Gousios Facebook & Delft University of Technology Pre-print | ||
10:16 3mTalk | KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle Data Showcase Luigi Quaranta University of Bari, Italy, Fabio Calefato University of Bari, Filippo Lanubile University of Bari | ||
10:19 3mTalk | Exploring the relationship between performance metrics and cost saving potential of defect prediction models Registered Reports Steffen Herbold University of Göttingen Pre-print | ||
10:22 28mLive Q&A | Discussions and Q&A Technical Papers |
Go directly to this room on Clowdr