The goal of The Open Source Survey is to create a dataset of and for the open source community. We are informed and motivated by the principles of open data and open source, but not exclusively so; we also prioritize the privacy and safety of respondents, and the scientific rigor of the survey design and fielding process. Where these come into conflict, we err on the side of privileging respondent privacy and the scientific integrity of the data collection process.
The universe of this study is anyone who uses or otherwise engages with open source technology and development, whether passively or actively through contributions. This is a broader definition of this community than commonly employed, and we believe one of the major contributions of this study will be a dataset that allows for exploration of open source consumers alongside contributors.
Our primary deliverable for this project is a dataset on attitudes, backgrounds, experiences, motivations, and other attitudinal data related to open source that is freely available to any and all interested community members, researchers, companies or other users.
Releasing the data publicly entails tradeoffs in order to maintain respondent privacy, principally in the form of the kinds of data we can collect and release. We aim not to collect any personally identifying information, such as GitHub username or email address. In the event that identifying information is provided, such as in open-text responses, we will remove it before releasing the data publicly. Some data, such as gender, is not directly identifying and has been widely requested, but when combined with other data, may make respondents identifiable. Though we hope and aim to release all of the data that we collect, we may withhold data if necessary to prevent identification of individuals.
The principal designers of this survey are employed by GitHub in various capacities relating to data and open source. However, this is not intended to be a product survey, and we aim to avoid adopting a privileged position in regards to the final dataset. Responsible data stewardship requires us to have some selective access (e.g. to clean and de-identify the data), but otherwise we do not intend to collect any data that is not made available to other users.
Data on unique populations, such as the Open Source community, is most enlightening when viewed in contrast to the general population, or other related communities (e.g. professional developers). Even small differences in question wording can yield differences in responses, so in order to facilitate such comparisons, we aim to re-use items from studies of relevant populations where they exist and are appropriate. While we are aware of much of the existing work on developer communities specifically, the field of public opinion research is vast, and we welcome pointers towards prior work that can inform our design.
Respondents will be sampled and invited to participate in the survey randomly. For projects hosted on GitHub, a random sample of visitors will be invited to take the survey. Of course, many important projects are not on GitHub. Communities of projects hosted elsewhere or maintained and developed through mailing lists or other methods will be incorporated via a parallel process of random selection from similar sources of participants, as facilitated by partner organizations.
Wherever possible, we have written the survey instrument with global audiences in mind and avoided Northern American -centric concepts or terminology. We aim to field the survey in several languages, including but not limited to: English, Spanish, Chinese, Japanese, and Russian. Because of the sensitivity of surveys to word choice, instead of relying only on crowdsourcing, we plan to use professional survey translators to translate the instrument, and request community review of their work. We will evaluate additional languages based on availability of translators and populations of open source contributors and users to whom an English version would be inaccessible.
We aim for the survey experience to be relatively short and straight forward. We plan to ask for no more than 10 to 15 minutes of respondents' time, and to use taxing formats such as open ended text boxes sparingly. This necessarily limits the number of topics we can cover.