Rishabh Madan | Week 1

The basic idea of release notes extension is to basically parse commit messages to generate note files. For the purpose of simplifying parsing for us, we use reStructuredText. This helps us to map messages to specific sections like Bug Fixes, Feature additions, Performance improvements etc. with the help of admonitions. In the previous post, I mentioned about sending initial patches for the release notes extension. The good news is that with a few changes in them, they were finally merged in Mercurial’s main repo. Now that the release notes extension is finally in, it is time to improve it and get it to work better.

Similarity check

One of the first improvements that I had mentioned in my proposal also, was to build a similarity check function. The purpose of this function will be to compare the incoming note fragments (the notes that we parse from commits) with already existing files and ignore/combine them based on how similar they are. This will help in de-duplication of release notes so that they aren’t cluttered with details about the same feature or the same bug fix. I faced a couple of problems while trying to implement this function. The first thing was that the extension’s code for dealing with non-titled notes (note fragments without a title) was buggy. The non-titled notes had issues with display after being added to the notes file and also with duplication. For eg., if you run the extension command for the same commit twice, it would add it twice, which obviously shouldn’t happen.

Improved Parsing

I started out by making changes to the merge() function that basically merges the incoming notes with the existing file. But that didn’t work. With further discussions with my mentor, I finally stumbled upon the idea of having different parsing method for non-titled note fragments and titled ones. The notes are now basically stored in the form of nested lists for a particular section, with the penultimate elements being old notes and new notes. The inner layers further are divided on the basis of paragraphs and newlines. The way a paragraph from titled section is parsed is now different from non-titled ones, but the data structure still remains the same.

if title:
    lines = [l[1:].strip() for l in block['lines']]
    notefragment.append(lines)
    continue
else:
    lines = [[l[1:].strip() for l in block['lines']]]

    for block in blocks[i + 1:]:
        if block['type'] in ('bullet', 'section'):
            break
        if block['type'] == 'paragraph':
            lines.append(block['lines'])
    notefragment.append(lines)
    continue

Upon resolving this bug, I added the necessary tests for the same and the issues that I mentioned earlier have now been resolved and the code is merged.

Design of similarity check

Coming back to the similarity function, the second problem that I ran into was related to the design of how it would exactly work. To begin with, I used fuzzy string comparison functions from this amazing library called fuzzywuzzy. Now, what the program does is that it converts the nested list structure to individual strings. Once we have the incoming note string and the list of strings for the existing notes, it runs them through the fuzzy algorithm and returns scores that quantify the similarity. Finally, based on a certain threshold, it ignores or adds the new notes. I built a basic prototype of similarity function and the results were good enough, to start with. Although upon discussion with my mentors I discovered a lot of new issues and improvements.

One major issue with the fuzzy algorithm is that in the case of small commit messages it might give really ambiguous answers and thus can’t be trusted. One basic case can be related to bug fixes, where small commit messages are quite normal. To tackle the problem, in this case, we use a simple regex that basically searches for an issueNNNN pattern in the commit message and once found, it searches for the same number of the existing files. I used this website called regex101.com to build a good enough pattern for this purpose. This way we can straight away ignore already existing bug fixes. Also, we put a threshold on the string length before which we would avoid de-duplication. This is somewhat a brute force but it works for now.

I have sent an RFC patch to the mailing list of mercurial based on the similarity function, where we will discuss further issues and developments related to it. This is a time consuming yet very important part of the extension and would be a major breakthrough once successfully implemented.

Meanwhile you can look at the patches that I have submitted here. That’s all for now!