Subventions et des contributions :
Subvention ou bourse octroyée s'appliquant à plus d'un exercice financier. (2017-2018 à 2022-2023)
Companies must ensure that software is high quality and understandable, the latter because it is virtually guaranteed that another person will need to read and understand software written by someone else, whether for purposes of upgrading, fixing, or replacing the code. In the field of natural language researchers have used machine learning to determine an author’s gender with up to 90% accuracy; this process also allowed the identification of how the two groups used the language differently. This is valuable information for software development teams, whether we look at (for instance) differences between men and women, junior and senior developers, or developers with different native natural languages. While there is a great deal of research on people’s use of natural language, both written and spoken, there is very little on how people use artificial languages such as those used for writing software, and none at all on the sociolinguistics of artificial language use. I propose to address this gap with a view to using my discoveries to contribute to improving both software quality and software readability.
One high level motivation of this work lies in the quest for diversity. If different groups use language in different ways, then can tools be tweaked to guide programmers toward more stylistically “neutral” code, thus reducing potential sociolinguistic differences between diverse groups? Or is this defeating the purpose of encouraging a diverse population of contributors and their unique perspectives? When teaching students to code, are there particular styles that might be more “comfortable” for different groups, and should we be encouraging this? We don’t know the answers to these questions, but evidence from the natural languages suggests that we should be considering them.
However as we investigate how a diverse population creates software, the underlying motivation is that of the need for quality code that can be clearly understood. If different groups write code differently, is that contributing to misunderstanding? Are there ways that we can, for instance, “translate” code to make it easier for another reader to understand? If there are significant differences, do these correspond to code quality? Finally, what has larger impact; language use at the micro levels that we are proposing to start with (i.e. line by line), or larger, structural and conceptual differences that form the overall architecture and design of large pieces of software? My research program aims to examine these questions, and I believe that this novel approach will have a significant impact on the field of software engineering.