✏️ writes @inishchith

Home About Work RSS

[week-n] The Final Report ☀️

  • GSoC
  • 2019
  • week-12
  • coding-period-3

This is the last week of the final coding phase of GSoC 2019. In this post, I’ll try to summarize the work done during all the coding phases and share the results at the end. This post would also serve as a Final Report for the project undertaken and contains all the information regarding the contributions done by me towards the project.



# Coding Phase I

(27th May to 28th June)

In this phase, I had tried to understand how the GrimoireLab toolchain works on entirety and come up with an initial approach to integrating CoCom Backend with ELK.

  • Before the start of this phase, I had only contributed to Graal and didn’t have any experience of how the Grimoirelab chain works. Hence, with the guidance of my mentors, I started learning how Mordred works along with ELK and how to import dashboards using Kidash and Sigils.
  • Next step was to discuss and initiate the integration of Graal’s Backend with ELK. We had agreed to integrate Code Complexity(CoCom) Backend entirely and then focus on Code License(CoLic). Along these lines, I had also been fixing a few issues in Graal and adding an analyzer(Scancode-CLI) which was merged after a code review in the second week and we had started working on the initial integration task.
  • At this point, we had an initial integration of Graal-CoCom-ELK, which was dependent on the pure(raw) data produced by Graal and with no preprocessing steps performed on the process. I could produce some mock visualizations and understand how Kibana worked, as I didn’t have any experience in the ELK-stack prior to this project except for some idea in ETL tasks.
  • Although we had integrated the initial version of CoCom Backend, it wasn’t something we were looking for, at this point we had all sorts of restrictions as the data fields produced by graal-cocom was per-commit. One good thing about this implementation was it was very efficient as an item had information related to only the files affected in that given commit. But, the issue, in order to produce visualization which can be analyzed over time we need to have a history of points in order to show (for instance) Evolution of a metric (Line of Code).
  • And the time had come, we had to think of a solution to either structure the data stored in ElasticSearch index or to perform a Repository-Level analysis at every commit using Graal 😅




# Coding Phase II

(28th June to 26th July)

In this phase, we had identified potential performance issues with the repository-level analysis and come up with the File-Level + Study approach and evaluated it after integrating with the ELK, Also producing some visualizations with the help of the data.

  • At this point, I had discussed the repository-level analysis approach my mentors during the weekly meetings and we had agreed to give it a try and see how things go. (evaluation below)
Repository Number of Commits    File Level   Repository Level
chaoss/grimoirelab-perceval 1394 26:22 min 26:56 min
chaoss/grimoirelab-sirmordred 869 08:51 min 3:51 min
chaoss/grimoirelab-graal 171 2:24 min 1:04 min
  • The above experimentation, lead to some good results with the visualizations as every commit item now had information about related to each and every file in the repository, I could produce data-tables and metrics which gave many insights about a given repository. Each of the items can be viewed as a data-point on a graph and can be visualized and.. we’re done. No, not yet 😉
  • I spent much time implementing this approach only to understand that the structure of data we’re using would consume a lot of data in memory once we integrate the implementation with ELK.
  • Ahead of all, we had already addressed the performance issues that repository-level data brought along with the insightful results and this was the time we had to re-think of ways to eliminate the overhead. At this point, @valcos had suggested me to explore the execution of a study over the enrich index which would help eliminate execution of analyzer at repository-level and hence reducing the memory overhead.
  • I could produce an initial implementation of study and the following is a comparison.
Repository Commits   File Level   Repository Level   File-Level + Study 
grimoirelab-kingarthur 208 3:12 min 34:23 min 5:06 min
grimoirelab-graal 179 2:22 min 28:35 min 3:58 min
  • Space Complexity: For instance, if we consider an item per file, the results for Graal repository would be as follows: (no. of files: ~200, no. of commits: ~180)
    • Repository Level: 36000 items (200*180)
    • File-Level + Study: 200 items
  • On further improving the approach i.e adding an interval_months parameter, which helps us conduct study over an interval of time, we could get the total number of items in memory to Max(1st commit date / interval_months) (for instance, if a repository is 3 years old and we want to analyzer the evolution of a metric every month, then we would have items in memory equal to 3 x 12 = 32, no imagine increasing the interval_months) 😉


  • CoCom Dashboard (1st Iteration) CoCom Dashboard


  • CoLic Dashboard (1st Iteration) CoLic Dashboard

  • In summary, and as per the evaluation results, we had started off the process long time back with (1)Repository-Level enricher which used about 36,000 items in memory. We tried to improve this via (2)File-level + Study(cache_dict) and brought down the execution time significantly and memory usage to just 200 items. And finally, to File-Level + Study(query_chuck) which brought down the memory usage to a worst-case of number_of_intervals items (an interval is decided by chunking data into months, starting from the first commit) and all by the trading of just 5% more execution time. 🎉



# Coding Phase III

(26th July to 26th August)

In this phase, we had tried to identify issues in the integration by conducting full-chain tests and fixing some of the major issues and addressing some minor ones. Also added tests for the integration and conducted several iterations on both the dashboards.

  • After successfully implementing the Study Enricher, I went on to add support of other categories of CoLic analyzers Scancode & NOMOS and evaluate their performance with respect to small & medium-sized repositories.
  • Next, I spent time adding corresponding adding tests, performing code-reviews with the help of @valcos and lessen the redundancy in all the existing pull-requests. Our next step was to perform Full tests of the entire chain of tools, right from raw collection of data to produce the dashboards. In this phase, we had conducted the entire chain test more than 10 times and subsequently identifying issues and accordingly fixing them.
  • At this point, we started testing the import and export of the dashboard panels with the help of Micro-Mordred(panel’s task) and subsequently opening a pull-request at sigils.

  • The video below summarizes the results achieved. (It shows how to collect and enrich CoCom and CoLic data and importing their corresponding dashboards)




Thank You ♥️


# Footnotes:

  • Project tracker can be found here

I’ll be posting updates on the mailing list as well as on the twitter thread.

If you have any questions or suggestions, please make sure to comment down below ;)