Rewriting the Grammar Bot
2013 Aug 8 06:53 PM UTC | English5 Grammar4 Programming18 Python4 Web13 [2013]10
I have rewritten my Grammar bot. Previously, it had used regular expressions to find errors, which means that it must check every character against the rules. In addition, Python 2.7 doesn’t support variable-length lookbehinds, which adds extra regular expression checks for some rules. Also, it cannot provide good quotes if there is an overlap between two matches.
The new system works by splitting the text into words. Punctuation is included with the spacers. After that, the words are, in a loop, checked against the rules.
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
This would be parsed as
[Lorem, ( ), ipsum, ( ), dolor, ( ), sit, ( ), amet, (, ), consectetur, ( ), adipisicing, ( ), elit, (, ), sed, ( ), do, ( ), eiusmod, ( ), tempor, ( ), incididunt, ( ), ut, ( ), labore, ( ), et, ( ), dolore, ( ), magna, ( ), aliqua, (.), (empty word)]
and then the words/spaces would be converted to objects so that flags can be set.
Each rule will check if the current word is a specific word, or in a small list, so it can skip rules quickly. If a rule is matched, it will mark the word as “modified” and some nearby words as “automatic stop words”, and the reason will be flagged. Some rules will request a rerun on specific other rules, which will result in another iteration over the rules list, but only some rules will be run the next time.
If at least one reason has been flagged, it will proceed to build the correction list. It will look at each correction and check a few nearby unflagged words and mark them as “stop words” and one “near word”, right beside the last stop word. If there is a one-word gap between two corrections, that gap will then be flagged. Then it will build a list of corrections, which includes the continuous chains of flagged words and spaces in between them.
The last step is to randomly generate a message for the user, and include an english-joined version of the correction list.
Although I must admit that the new method is not as good as before performance-wise, it delivers more accurate results, so the performance loss can be considered a trade-off for a more accurate algorithm. And the new method has an advantage over the previous method. For example, this text would have been reported improperly:
*Their is you’re own. (The[re] is you[r] own.)
The old system would issue a correction for “is [your] own” and “[there] is you’re”. But the problem is that the quotes are done with the order of the rules, rather than the order in which they appear in the text and that the overlap is not merged. The new system would be able to issue a quote for “[there] is [your] own”.
In addition, this text would not be fully corrected until now:
*Your you’re own. (You['re] you[r] own.)
The old system would detect that the user should have said “Your [your] own” instead. However, the new system can do a second pass and report “[you’re your] own”.