Data Balancing Improves Self-Admitted Technical Debt Detection (MSR 2021 - Technical Papers)

Who

Murali Sridharan, Leevi Rantala, Maëlick Claes, Mika Mäntylä

Track

MSR 2021 Technical Papers

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 19 May 2021 10:11 - 10:15 at MSR Room 2 - Dependencies and OSS Chair(s): Luca Pascarella

Abstract

A high imbalance exists between technical debt and non-technical debt source code comments. Such imbalance affects Self Admitted Technical Debt (SATD) detection performance, and existing literature lacks empirical evidence on the choice of balancing technique. In this work, we evaluate the impact of multiple balancing techniques, including Data level, Classifier level, and Hybrid, for SATD detection in Within-Project and Cross-Project setup. Our results show that the Data level balancing technique SMOTE or Classifier level Ensemble approaches with Random Forest or XGBoost are reasonable choices depending on whether the goal is to maximize Precision, Recall, F1, or AUC-ROC. We compared our best-performing model with the previous SATD detection benchmark (cost-sensitive Convolution Neural Network). Interestingly the top-performing XGBoost with SMOTE sampling improved the Within-project F1 score by 10% but fell short in Cross-Project set up by 9%. This supports the higher generalization capability of deep learning in Cross-Project SATD detection, yet while working within individual projects, classical machine learning algorithms can deliver better performance. We also evaluate and quantify the impact of duplicate source code comments in SATD detection performance. Finally, we employ SHAP and discuss the interpreted SATD features. We have included the replication package and shared a web-based SATD prediction tool with the balancing techniques in this study.

Link to Preprint

https://arxiv.org/abs/2103.13165

Murali Sridharan

University of Oulu

Leevi Rantala

University of Oulu

Maëlick Claes

University of Oulu

Finland

Mika Mäntylä

University of Oulu

Finland

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 19 May
Displayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

10:00 - 10:50	Dependencies and OSSTechnical Papers / Registered Reports at MSR Room 2 Chair(s): Luca Pascarella Delft University of Technology

10:01 3m Talk		Identifying Critical Projects via PageRank and Truck Factor Technical Papers Rolf-Helge Pfeiffer IT University of Copenhagen Pre-print
10:04 4m Talk		Revisiting Dockerfiles in Open Source Software Over Time Technical Papers Kalvin Eng University of Alberta, Abram Hindle University of Alberta Pre-print
10:08 3m Talk		Does the First-Response Matter for Future Contributions? A Study of First Contributions Registered Reports Noppadol Assavakamhaenghan Nara Institute of Science and Technology, Supatsara Wattanakriengkrai Nara Institute of Science and Technology, Naomichi Shimada Nara Institute of Science and Technology, Raula Gaikovina Kula NAIST, Takashi Ishio Nara Institute of Science and Technology, Kenichi Matsumoto Nara Institute of Science and Technology Pre-print
10:11 4m Talk		Data Balancing Improves Self-Admitted Technical Debt Detection Technical Papers Murali Sridharan University of Oulu, Leevi Rantala University of Oulu, Maëlick Claes University of Oulu, Mika Mäntylä University of Oulu Pre-print
10:15 35m Live Q&A		Discussions and Q&A Technical Papers

Information for Participants

Wed 19 May 2021 10:00 - 10:50 at MSR Room 2 - Dependencies and OSS Chair(s): Luca Pascarella

Info for room MSR Room 2:

Go directly to this room on Clowdr