Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GSoC Idea: Implementing CHAOSS metrics with Perceval #81

Closed
jgbarah opened this issue Jan 31, 2019 · 82 comments
Closed

GSoC Idea: Implementing CHAOSS metrics with Perceval #81

jgbarah opened this issue Jan 31, 2019 · 82 comments

Comments

@jgbarah
Copy link
Collaborator

jgbarah commented Jan 31, 2019

[ This issue for addressing questions and comments related to this GSoC idea, which is one of the ideas proposed by the CHAOSS group for the 2019 edition of GSoC.]

Description

The GMD Working group is proposing some metrics
that are computed with information obtained from software development repositories.
One of the goals of the working group is to provide reference implementations of
those metrics, based on the output produced by Perceval.
As an example, still work in progress, check the Python notebook with the
reference implementation for changes to the source code.

The aims of this idea are as follows:

  • Producing Python (Jupyter) notebooks with proposals for reference implementations
    for these metrics. The proposals will follow the usual acceptance mechanism of the working group
    (based on GitHub issues and pull requests). In general, all reference implementations will
    build on the output produced by Perceval for the relevant data source,
    and will explore the peculiarities of the metric and its implementation for that data source.

  • Document the notebooks as much as possible, so that any person trying to implement the metric
    can understand not only how to implement it, but also the details that should have into account.
    Documentation should also be suitable for people willing to dig deeper about the details of
    the metric, with the aim of understanding it (not necesarily implementing it).

  • When needed, propose changes to Perceval, and maybe other components in GrimoireLab,
    that allow for the implementation of the metric.

  • Participate in the acceptance process of the working group.

The aims will require programming in Python, producing Python notebooks, interaction with other
people in the working group to learn about subtle details of the metrics, and producing
documentation.

  • Difficulty: Medium
  • Requirements: Python programming, experience with Python notebooks, skills for producing documentation.
  • Recommended: Experience in data analytics with Python, if possible involving the use of Pandas, will be a plus.
  • Mentors: @jgbarah and @valeriocos.

Microtasks

For becoming familiar with Perceval, it is useful if you produce retrieve data from several repositories or a FOSS (free, open source software) project, and produce some kind of analysis and/or summary of results, in a Jupyter Notebook. You can start by browsing the GrimoireLab Tutorial, and in particular, the chapter on Perceval.

Once you're familiar with producing analysis, you can exploit the information in the indexes via a Python script presented as a Python Jupyter Notebook. For showing it, you can make it work with MyBinder, and show us its MyBinder link. For the data to be presented in the analysis, you can run Perceval on any FOSS project with at least 5 GitHub or GitLab repositories which include git (at least 1,000 commits in total), issues (at least 200 issues in total) and pull requests / merge requests (at least 200 pull requests / merge requests in total).

  • Microtask 0: Use this notebook implementing the Code_Changes metric (see it in MyBinder) as an example of how to collect the data, producing a single JSON file per data source, with all items (commits, issues, pull/merge requests) in it. Produce one notebook per data source (git, GitHub/GitLab issues, GitHub pull requests / GitLab merge requests) showing a summary of the contents of that file (number of items in it, and number of different identities in it counting authors/committers for git, submitters for issues and pull/merge requests). This microtask is mandatory, to show that you can retrieve data and produde a notebook showing it. In each notebook, include also the list of repositories retrieved, and the date of retrieval, using data available in the JSON file.

  • Microtask 1: Produce a notebook showing (and producing) a list with the activity per quarter: number of new committers, submitters of issues, and submitters of pull/merge requests, number of items (commits, issues, pull/merge requests), number of repositories with new items (all of this per quarter) as a table and as a CSV file. Use plain Python3 (eg, no Pandas) for this.

  • Microtask 2: Like Microtask 1, but now using Pandas.

  • Microtask 3: Produce a notebook with charts showing the distribution of time-to-close for issues already closed, and opened during the last year, for each of the repositories analyzed, and for all of them together. Use Pandas for this, and the Python charting library of your choice (as long as it is a FOSS module).

  • Microtask 4: Produce a listing of repositories, as a table and as CSV file, with the number of commits authored, issues opened, and pull/merge requests opened, during the last three months, ordered by the total number (commits plus issues plus pull requests). Use plain Python3 (eg, no Pandas) for this.

  • Microtask 5: Like Microtask 4, but now using Pandas.

  • Microtask 6: Perform any other analysis you may find interesting, based on the Perceval data you collected.

If you want, you can also:

  • Microtask 7: Produce a pull request for any of the GrimoireLab tools, and try to follow instructions until it gets accepted. Try do do something simple that you consider useful, not necessarily fix to the code: improvement of comments, documentation or testing will usually be easier to get accepted, and very useful for the project. Please, avoid just producing a random pull request just to have another microtask: the objective is not that you get one more microtask done, but that you understand how to interact with developers in the project contributing with something that could be useful).

  • Microtask 8: Like Microtask 7, but for the GMD working group. You may try to fix some error (even grammatical) in the description of a metric, improve the description of a focus area, fix or improve a reference implementation, or even produce a new reference implementation proposal. As in the previous microtask, the goal is not that your contribution is accepted (which of course would be great), or that you just complete yet another microtask, but that you interact with the working group, and you understand its context and procedures.

Of course, there is no need to do all the microtasks, you only need to show that your skills are in good standing for working in this project.

Showing the work you did

If you want to show the work you did, open a GitHub repository, and upload to it:

  • A README.md file explaining what you did, and linking to the results (which will be in the same repository, see below). This will be the main file to show your skills and interest on the project, so try to make it organized and clear, in a way that we can easily understand what you did.

  • Python Jupyter notebooks you produced, with enough information (in comments and/or in the README.md file) so that we can run them if needed. Upload Python scripts ready to work assuming GrimoireLab packages are already installed. Upload Jupyter notebooks ready to be seen via the GitHub web interface, but also include links to fully-functional versions of the notebooks in MyBinder.

  • Links to the pull request you did (if any), along with any comment you may have about it.

Submitting information for the application process

  • You must complete at least microtask 0, and at least three microtasks, in the case you're interested in this idea.

  • Once you completed at least one microtask, go to the governance repository and create a pull request to add yourself, your information, and a link to your repository with the completed micro-task(s) in the GSoC-interest.md file (see above for the contents of the repository).

  • You are welcome to include in your repository other information that could be of interest, such as open issues or pull requests submitted to the project to which you intend to contribute during GSoC, contributions to other projects, skills, and other related information.

  • You must complete these things by GSoC deadline for proposals. Make sure to also submit the information required by GSoC for applicants (i.e., project proposal), linking to it from your pull request in the GSoC-interest.md file.

Useful documentation

Getting feedback for your proposal & microtasks

Our idea is to have a look at proposals that are registered in the governance repository starting after 25th March, when students can formally apply. But if you have specific doubts, comments, or whatever, use this issue.

In general, we don't want to give advice too specific to one case, because that could give some advantage to some person with respect to the others. Answering questions and addressing comments (if you want, based on your proposal) is not a problem as long as that's done in public, hence the threads in this issue.

Asking for help

If you need help, please use the following channels.

For issues related to GrimoireLab:

For general issues related to CHAOSS or CHAOSS metrics

@aswanipranjal
Copy link

aswanipranjal commented Feb 8, 2019

@jgbarah I think the link for the Python notebook with the
reference implementation for commits.
is incorrect here. Did you meant to point at https://github.com/chaoss/wg-gmd/blob/master/implementations/Code_Changes-Git.ipynb, instead?

@jgbarah
Copy link
Collaborator Author

jgbarah commented Feb 13, 2019

Yes, @aswanipranjal thanks for noticing. I'll change that.

@Polaris000
Copy link
Contributor

Which metrics are we specifically talking about here?
Only these metrics (present in wg-gmd repository)? Or is it something else?

Though most metrics are consistent between the chaoss/metrics repository , some are present as archived metrics in the wg-gmd repo. What about such metrics?

@jgbarah
Copy link
Collaborator Author

jgbarah commented Feb 24, 2019

We're in the process of defining the metrics in the focus areas directory. Have a look to those files, which are still work in progress.

@sarvesh211999
Copy link
Contributor

Produce one notebook per data source (git, GitHub/GitLab issues, GitHub pull requests / GitLab merge requests) showing a summary of the contents of that file

For a given repo we have to produce one notebook right ?? And in that we have to output number of commits, pull request open and merged ??

Is this what is required for microtask0 or any else?

@aswanipranjal
Copy link

@sarvesh211999
To keep everything tidy, you need to produce a Notebook for each of the data sources that you gather the data for.

Produce one notebook per data source (git, GitHub/GitLab issues, GitHub pull requests / GitLab merge requests)

So the repository that you create can have multiple Notebooks

@sarvesh211999
Copy link
Contributor

Ohkkay. I will create a pull request and then you can review it. So if there are any changes I can make it.
Thanks for you help

@aswanipranjal
Copy link

@sarvesh211999 please go through this document to understand how to show CHAOSS what you have done: https://github.com/chaoss/governance/blob/master/GSoC-interest.md

@sarvesh211999
Copy link
Contributor

Yeah I was just asking whether it will be reviewed and responsed back if anything is missing ?

@sarvesh211999
Copy link
Contributor

@aswanipranjal
After analysis of how many data source I have to create pull request for Microtask0?

@aswanipranjal
Copy link

@sarvesh211999, I apoligize about the late reply:

Yeah I was just asking whether it will be reviewed and responsed back if anything is missing ?

@jgbarah might be better suited to answer this.

After analysis of how many data source I have to create pull request for Microtask0?

The microtask0 is more about you understanding how Grimoirelab tools work and interact with each other. The more data sources that you can produce the reports for, the better you'll understand the systems I'd say.

@sarvesh211999
Copy link
Contributor

Thank You Very Much. @aswanipranjal

@Polaris000
Copy link
Contributor

Polaris000 commented Mar 2, 2019

Can we improve the wording of micro-task 0? I suggest adding information answering the following questions as they have been asked in the comments:

  • what exactly is meant by a data source? (A repository ?, or a project with several repos? or the data fetched from a repository like commits, pull requests, etc. ?)
  • How many data sources do we fetch the data from? (answered in the comments but would be better to include)
  • How many notebooks are to be submitted per data source? (this will become clearer when the first question is answered)

Please have a look @aswanipranjal, @jgbarah.

@s-ankur
Copy link
Contributor

s-ankur commented Mar 3, 2019

Hello, I am interested in this project. I have mostly completed microtask0 (analysis of commits), however, I encounter a problem with the rate limits of GitHub being exceeded (If I try to perceval github any large project). I have tried a few ways to get around this but I'm afraid Ive hit a roadblock. Can I just use a smaller project instead? Also, as @Polaris000 said, I find the wording of the task to be unclear as well and we would benefit from further clarification.

@Polaris000
Copy link
Contributor

HI @s-ankur! Regarding your rate limit, have you tried using the authentication token parameter with perceval?

@Polaris000
Copy link
Contributor

Polaris000 commented Mar 3, 2019

The rate limit is low for unauthenticated users.
Have a look at the documentation

@s-ankur
Copy link
Contributor

s-ankur commented Mar 3, 2019

Yes, I did, but I guess the repo I'm trying is too big either way. I think I'll try it with a smaller repo after my timeout is over. Thanks @Polaris000

@sarvesh211999
Copy link
Contributor

@s-ankur use sleep-for-rate flag

@Polaris000
Copy link
Contributor

Polaris000 commented Mar 3, 2019

This should do the trick. Thanks @sarvesh211999.

@jgbarah
Copy link
Collaborator Author

jgbarah commented Mar 3, 2019

@sarvesh211999 said:

Yeah I was just asking whether it will be reviewed and responsed back if anything is missing ?

Usually, you upload the results for all the microtasks you do in your own repository, and include that in the pull request you produce (as stated in the /GSoC-interest.md file).

We will go through all the repos in that file when we enter the evaluation period, or earlier (if we find some time to have a look earlier). Of course, you can ask whenever you find some problem. But if you understand what to do, I think you'll know if it is (well) done. Of course, the better documented, and maybe complemented with related stuff, the better. Remember that the idea is to show that you can work with the kind of stuff that you will need during the GSoC.

@jgbarah
Copy link
Collaborator Author

jgbarah commented Mar 3, 2019

@Polaris000 said:

Can we improve the wording of micro-task 0?

Thanks for the suggestion, but i think we better will leave it as it is now. The main reason for that is that (I think) the microtask is reasonably well defined, if you read it carefully. Of course, there may be some level of fuzziness depending on your assumptions, but that's the reason why we have comments. And finally, I think it is better to leave it relatively open, because the main reason for the microtask is not to do something specific, but to show that people are familiar with the basics needed for this idea.

@s-ankur
Copy link
Contributor

s-ankur commented Mar 5, 2019

@jgbarah can you please review microtasks. Although currently I have completed 3 Microtasks (slightly incomplete) and am working on the rest, I would love to have some feedback as to whether I am on the right track.

@sarvesh211999
Copy link
Contributor

@aswanipranjal @jgbarah I was just asking while creating pull request whether we have to keep repo private and can add u as collaborator, or we have to keep it public. Because in public there is possiblity that it can be copied.

@vchrombie
Copy link
Member

@sarvesh211999, maybe you can have a look at this conversation. :)
chaoss/community#86

@sarvesh211999
Copy link
Contributor

I want to ask is there any list of decided metrics that has to be implemented. Then our proposal will be more on point rather than being general or a long list.

@Polaris000
Copy link
Contributor

The metrics are being worked on in the focus group

@sarvesh211999
Copy link
Contributor

So we have to draft proposal on the basis of metrics as how we are going to implement it??

@DataGasmic
Copy link

Hello, Everyone.

I am an Indian Undergrad student , in my sophomore year, from Kolkata currently pursuing B-Tech in Information Technology. I am very excited and interested in this Project. I feel like I am very late to this conversation but I would still love to contribute. I have gone through the comments above and started working with the Micro Tasks.

@jgbarah
Copy link
Collaborator Author

jgbarah commented Mar 20, 2019

For all of you asking for us to review microtasks, in public or private repos:

  • I prefer not to look at private repos. Thanks for the invitations, anyway. The main reason for this is to maintain transparency, and letting anyone see any advice that we provide anyone else. I know that some student prefer not to make their repo public until the very last moment, to avoid other reusing their code, and I understand that. But even in that case, please consider that in GitHub it is very easy to know who copied from whom, if that would happen, and that the main criteria for accepting proposals is the overall evaluation of microtasks, involvement, etc, and not whether a single microtask was completed or not.

  • I prefer not to provide overall evaluations of microtasks. I will be happy to answer specific questions, and help people to complete microtasks. But in general, I will only evaluate microtasks when the evaluation period starts.

@Polaris000
Copy link
Contributor

Polaris000 commented Apr 9, 2019

@valeriocos @jgbarah @GeorgLink @aswanipranjal
On the GSoC submission page, do I have to submit a draft first? Or can I directly submit the proposal pdf?
I am asking for official reasons only: If I send you a draft, will you approve it? And for a student to get to work with CHAOSS, is it necessary to get a draft approved first?
Please let me know.

@germonprez
Copy link
Contributor

@Polaris000 and @quirroone I would recommend that you get your proposal (in PDF) submitted to the Google submission system. Follow their rules. I suspect that the mentors are not going to pre-approve submissions as that would create an additional layer of work for them. That said, others can give their advice.

@Polaris000
Copy link
Contributor

Submitted

@harshalmittal4
Copy link
Contributor

Submitted!!

@harshalmittal4
Copy link
Contributor

@valeriocos, @aswanipranjal,
For the metric pull_request_participants, participants are defined as those who review a PR or comment on a PR or both (anyone who either reviews or comments on a PR) ?

@vchrombie
Copy link
Member

vchrombie commented Apr 13, 2019

@harshalmittal4
I think both can be counted because participants mean being a part of the discussion.

But when coming to the sample implementation in the metrics/pull-requests-participants.md, it is just implemented as with review comments because the Perceval doesn't store the normal comments. I think GitHub API gives only the code review comments, we can't blame Perceval 😅.
The discussion comments cannot be retrieved, I guess.

Please correct me if I am wrong.

@vchrombie
Copy link
Member

Let me give an example, this is the data retrieved by Perceval for PR #112, pr-structure.json.

Jesus' comment is not stored but Aniruddha's comment is stored. #L612

@harshalmittal4
Copy link
Contributor

harshalmittal4 commented Apr 13, 2019

because the Perceval doesn't store the normal comments.

yes, this is known @vchrombie 🙂, and thats why the question popped on who all to include in participants (while I was trying to implement pull_request_participants).
Anyways thanks for putting the example.

The discussion comments cannot be retrieved, I guess.

They can be retrieved, although not very straightforward (The github's API response for pull_requests includes only the reviews, but it does include a url to the comments, which can be used to fetch the commentators for the metric pull_request_participants using python's requests).
Was trying this, for this comments url obtained fron the pull_request response data.

(Note : The issues which are actually pull_requests, include the comments, but the reviews aren't included in this case which is why we can't use them either)

@valeriocos @aswanipranjal @jgbarah, please see when you have time :)

@vchrombie
Copy link
Member

The discussion comments cannot be retrieved, I guess.

They can be retrieved @vchrombie, although not very straightforward (The github's API response for pull_requests includes only the reviews, but it does include a url to the comments, which can be used to fetch the commentators for the metric pull_request_participants using python's requests).

@harshalmittal4, thanks for letting me know.

That is really good. Then it would be a good feature for the Perceval tool too. The metric can be improved if we can fetch the PR commentators using Perceval.
What do you say @jgbarah @valeriocos @aswanipranjal

@harshalmittal4, can you see if this is the correct URL for the comments
https://api.github.com/repos/chaoss/wg-gmd/issues/112/comments

@Polaris000
Copy link
Contributor

Polaris000 commented Apr 13, 2019

@vchrombie
Copy link
Member

Why does https://api.github.com/repos/chaoss/wg-gmd/issues/112/comments not include Jesus' comment?@vchrombie

I had the same doubt, let's confirm with @harshalmittal4 whether there is any other link regarding it (I couldn't find any) or maybe there is some other procedure to do it.

I also think this is not the right place for this discussion (fetching comments details using Perceval, not about the metric). Can we open an issue in the chaoss/grimoirelab-perceval?

@Polaris000
Copy link
Contributor

I also think this is not the right place for this discussion (fetching comments details using Perceval, not about the metric). Can we open an issue in the chaoss/grimoirelab-perceval?

Good idea!

@harshalmittal4
Copy link
Contributor

harshalmittal4 commented Apr 13, 2019

Hey @vchrombie, @Polaris000, I would try to answer you, lets have some examples:
#PR 112
Comments url : https://api.github.com/repos/chaoss/wg-gmd/issues/23/comments
Jesus's comment isn't a regular comment, it is made when he approved the changes as you can see in the PR (lets say it to be a merge comment), so information about the participant (Jesus) is present in the merged_by section in the json structure (this line), instead of being present in the comment's url.

#PR 12
Comments url :https://api.github.com/repos/chaoss/wg-gmd/issues/12/comments
Jesus's comment is present in the PR (regular comment), so it is shown in the comments url. Sean's comments are not regular comment, they are made when he approved the changes, so information about him is present in the merged_by section in the json structure here.

#PR 17
Comments url : https://api.github.com/repos/chaoss/wg-gmd/issues/17/comments
Here again, Klumb's and Sean's comment are regular comments, so they are present in the comment's url, while others are either reviews or comments made while approving the changes so they would be present in the json structure retrieved of the PR.

Hope that it is resolved now @vchrombie @Polaris000. If still there is some confusion, play around with some examples and you will get it yourself 🙂
Thanks!

@vchrombie
Copy link
Member

Hope that it is quite clear now @vchrombie @Polaris000. If still there is some confusion, play around with some examples and you will get it yourself

Okay, so approval changes is also a review comment (can't read the comment though).
Thanks for the clarification @harshalmittal4. 😃

Nice work. I was playing with the wrong PR (with less diversity of comments). 😅

@valeriocos
Copy link
Member

valeriocos commented Apr 13, 2019

Sorry for being late in answering @harshalmittal4 @Polaris000 @vchrombie. If I understand correctly the discussion, it seems that some comments are retrieved and others are not.

Depending on the type of comment (issue comment, pull request comment, pull request review, commit comment), the GitHub API provides different endpoints. For instance, for this issue: #12, all comments are available at:

*the body of the issue (or the initial comment) is available at: https://api.github.com/repos/chaoss/wg-gmd/issues/12
*commit comments are not present in this example, but they would be available inside each commit returned by https://api.github.com/repos/chaoss/wg-gmd/pulls/12/commits

The GitHub perceval backend returns:

The addition of reviews and commit comments would be interesting, and wouldn't require big changes in the backend. If you want to work on these enhancements, feel free to submit PRs :)

@Polaris000
Copy link
Contributor

Thanks for the reply @valeriocos!. I'd love to work on this.

@harshalmittal4
Copy link
Contributor

Hey @valeriocos, for the specific case of pull_request :

pull request comments (https://github.com/chaoss/grimoirelab-perceval/blob/master/perceval/backends/core/github.py#L653) [when using the category pull_request]

The pull request comments^ are the review comments made on the the PR, and

issue comments (https://github.com/chaoss/grimoirelab-perceval/blob/master/perceval/backends/core/github.py#L554) [when using the category issue]

this^ gives the comments which are not reviews, made on the same PR.

@valeriocos
Copy link
Member

Great @Polaris000 , feel free to start with one enhancement (reviews or commit comments), thanks!

@harshalmittal4
Copy link
Contributor

harshalmittal4 commented Apr 13, 2019

@valeriocos, the body of the pull request retrieved by perceval for the case of #12 is present here.
It already contains the 3 commit comments (line163, line 363 and line 526) , but not the regular and review comments I think we need to add those to the retrieved json for the pull_request category. I would like to work on one of them, once other things are finalized :)

@valeriocos
Copy link
Member

valeriocos commented Apr 13, 2019

line163, line 363 and line 526 should be review comments, which are extracted with this method.
Regular comments (= issue comments) are extracted when using the category issue (the default one, see here).
The missing comments are:

  • (1) reviews, which should be implemented using this call: https://api.github.com/repos/<user-name>/<repo-name>/pulls/<pull-id>/reviews
  • (2) commit comments, which should be plugged here

Perfect @harshalmittal4 ! maybe @Polaris000 can focus on (1) and you on (2), or viceversa. What do you think @harshalmittal4 @Polaris000 ?

@harshalmittal4
Copy link
Contributor

Ok @valeriocos, I can work on including the commit comments in the PR json response. Also, do we need to include the regular comments in the PR json response?

@Polaris000
Copy link
Contributor

@harshalmittal4 said:

Ok @valeriocos, I can work on including the commit comments in the PR json response. Also, do we need to include the regular comments in the PR json response?

I can work on (1) then, @valeriocos.
I'm busy for a couple of days and the patch will be slightly delayed. Hope that's alright.

@valeriocos
Copy link
Member

@harshalmittal4 @Polaris000 , feel free to open two ssues in Perceval about including commit comments and reviews.

@harshalmittal4 I wouldn't add regular comments to the PR, because they are already available when fetching issues.

@Polaris000 no worries, you can start working on it when you have time.

Thank you both!

@harshalmittal4
Copy link
Contributor

harshalmittal4 commented Apr 13, 2019

Thanks @valeriocos, also need suggestions regarding this (and this).
Shall it be discussed in seperate issue..(would be easier to follow for everyone)

@valeriocos
Copy link
Member

Good idea @harshalmittal4 , please open a different issue, we can comment there, thanks!

@jgbarah
Copy link
Collaborator Author

jgbarah commented Jun 5, 2019

I think this is done by now.

@jgbarah jgbarah closed this as completed Jun 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests