Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude company's own projects filter #67

Open
vallode opened this issue Mar 27, 2021 · 13 comments
Open

Exclude company's own projects filter #67

vallode opened this issue Mar 27, 2021 · 13 comments

Comments

@vallode
Copy link

vallode commented Mar 27, 2021

I think it would be pertinent to include a filter that excludes contributions to the company's own open source projects.
As much as I enjoy seeing the numbers I feel like it would be amazing to see which companies contribute outside of their own circle of influence the most, this could shift the rankings somewhat and showcase a bit more of the open source community on the top lists.

Throwing this out there as an idea, absolutely understand if this is not relevant to this project but maybe something worth thinking about!

@vlad-isayko
Copy link
Collaborator

The idea is pretty interesting.

There are also a number of primary questions that arise before the implementation of this idea.

Main question:

How to identify the repositories in relation to the company (a company's own repository or not)?

There is an option to use information about the organization (see OrgId). However, this is connected with the fact that you need to have a list of compliance of the company and the organization that belongs to it. It turns out that it is necessary to create such a list by hand for each company and constantly keep it up to date. And again, there is no certainty that this criterion is 100% valid.

Do you have any ideas on this?

@abitrolly
Copy link
Contributor

How to identify the repositories in relation to the company (a company's own repository or not)?

  1. If source repo belongs to company. Maintaining official repo status is no different that mantaining official list of domains.
  2. If all commits and merge requests are from the company

@vlad-isayko
Copy link
Collaborator

1. If source repo belongs to company. Maintaining official repo status is no different that mantaining official list of domains.

I agree that at first glance, maintaining a list of repositories does not differ much from maintaining a list of companies. But the question arises about a significantly larger volume of repositories than companies and about a greater dynamics of the list of repositories than domains.

2. If all commits and merge requests are from the company

I didn't quite understand what it meant. Could you explain a little more broadly?

You are suggested to think that the company's own repository is those repositories in which commits are only from the company, right? Is this a necessary and/or sufficient condition?

@abitrolly
Copy link
Contributor

abitrolly commented Mar 29, 2021

But the question arises about a significantly larger volume of repositories than companies and about a greater dynamics of the list of repositories than domains.

It could happen that the amount of non-owned repositories that companies are committing to is non-significant.

I didn't quite understand what it meant. Could you explain a little more broadly?

The repo where all commits are from corporate emails are definitely owned by the company. That's a sufficient condition for a filter. )

@vallode
Copy link
Author

vallode commented Mar 31, 2021

Sorry for taking a while to respond, I simply don't have enough information on the workflow that OSCI uses (my bad) to elaborate further than what @abitrolly said. I would only ever consider a contribution to be in the company's full self-interest if the contribution landed on a repository that was owned by the company itself.

Is this a trivial task? Very unlikely, I think a "repo where all commits are from corporate emails" is too specific of a scenario and wouldn't affect the dataset very much (especially for the top dogs which is where my interest lies the most)...

We'd need a way to filter out contributions made from the organisation's own authors into the organisation's own repositories.

@abitrolly
Copy link
Contributor

We'd need a way to filter out contributions made from the organisation's own authors into the organisation's own repositories.

I agree. That would be sufficient.

@dzintars
Copy link

dzintars commented May 1, 2021

... or at least you could start small and list at least the number of the repositories collaborators of the organizations contribute to.
If most of the organization contributors contribute to single or few repositories, this is a good indication of their efforts. :)

@patrickstephens2
Copy link

I suggest the way to move forward on this issue is:

  1. pick a company at random
  2. look at the list of repos which OSCI is showing their employees contribute to
  3. try to define some logic (algorithm) defining which of these repos are "company repos" vs "non-company repos". As part of this task you will have to define what is a "company repo", that in itself will be challenging.
  4. Now pick another company at random and test the logic you came up with, refine it.
  5. And so on with additional companies until you have logic which appears to manage the general case.

It's important to understand that a perfect algorithm for this does not exist, just different directions to go, each with pros and cons. An empirical approach (if that's the right term) like I suggest above is necessary rather than defining a theoretical approach. Your goal has to be to iterate until you reach a logic which is "good enough" to show a general picture of activity across organizations. This was our experience defining the logic for OSCI itself. What looks easy at a high level gets very challenging when one tries to define the detail and algorithmize it.

@abitrolly
Copy link
Contributor

As part of this task you will have to define what is a "company repo", that in itself will be challenging.

def outside_contributions():
    employees_committed
    contractors_committed
    robots_committed
    total_committed

    if (total_committed - employees_committed - contractors_committed - robots_committed > 0):
       return True

@patrickstephens2
Copy link

Let's take company ACME. It creates and runs project X. This project is not under the ACME org on github, so programmatically not directly connectable to the company. The project has 100 contributors, 99 who work at ACME and 1 who is outside (perhaps it is an ex-employee who worked on this before leaving the company and continued after... I have seen such examples). Is this a company project?

@dzintars
Copy link

What could be the simplest and probably not the most accurate insight? While getting perfect stats sounds sweet, most likely we will not get there right away. So... what could be done right now to make the index by 1% better?
How about CLA's? Could those be considered as indication? If repo is requiring to submit CLA, could it be considered X org repository?
Could manual PR process be implemented to metatag the repos? Like... community could submit PR's to this repository to mark/add indexed repos to one or the other category and even augment the metadata? While fully automated process is neat... i think mostly we are interested in like... 2-5K public repositories and those definitely could be meta-tagged manually over the time.

@abitrolly
Copy link
Contributor

Maybe the priority should be to publish the data that could make different kind of filters possible. Right now the site https://opensourceindex.io/ just links to this repo with no diagrams of the DB schema are no information if the Big Query datasets are being public.

@jeffwilcox
Copy link

At our company, internally we gather public data on GitHub activity from employees who choose to opt-in regarding their GitHub activity and contributions, with the goal of identifying trends in contributions to projects outside of Microsoft's governance. Our data is skewed differently than this index, however, since we have an internal indicator of who our employees are on GitHub once they opt-in to tell us, vs having to determine it from profiles.

Our numbers for December 2021, for example, are significantly higher for 'total community' and other figures as a result of so many people being e-mail private on GitHub... but of that specific month's contributions, I tried pulling equivalent data, and around a third of our actively-open-contributing employees contributed to projects not governed by our company, yielding a number higher than the index but not majorly larger.

While the data is interesting, our key reason for differentiating "is it controlled by Microsoft or not" is to help encourage our employees' participation in communities to become eligible in our FOSS Fund and to evolve the culture.

I agree slicing off a company's controlled projects is an interesting pivot, but a murky gray area, especially given foundations and cross-industry collaborations and so on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants