Monday, August 5, 2013

Readability Measures in C#

Readability measures, such as the Gunning-Fog IndexAutomated Readability Index or the Flesh-Kincaid Index are well established and widely used formulas that basically compute and roughly assess the difficulty in reading a piece of text. These scores generally produce an approximate representation of the US / UK grade level needed to comprehend the text. For instance a score of 8.2 would indicate that a text is expected to be understandable by an average student in year 8 in the United Kingdom or an 8th grade student in the United States. These kinds of scores are used to assess readability of school-books before their publication, or as a requirements by law of certain maximum scores for insurance policies and terms and conditions (see the Wikipedia page links, above for example uses).

There are some online websites that allow to score texts in such a way, e.g. http://www.readability-score.com/. However, when you need to use these scores in your own code and applications, it's preferable to have your own implementation or third party library that you can call upon to get things done.

Unfortunately, C#.net suffers from a limited open-source repertoire of libraries and code-snippets (relative to e.g. Python or JAVA).  This is why I've written up a quick implementation of some readability indices in C#.
  • Automated Readability Index
  • Gunning-Fog Index
  • Flesh-Kincaid Index
Automated Readability Index is probably the easiest one to compute, as it is the only one of the three measures that relies on character counts as a measure of word-complexity, rather than syllables count. This also means that it can be applied to different languages, and not just to English.

public static double CalculateAutomatedReadabilityIndex(string inputstring)
        {
            int charcount = BasicNLP.RemoveWhitespace(inputstring).Length;  //space characters need to be ignored in character count.
            int wordcount = BasicNLP.Tokenise(inputstring).Length;
            int sentencecount = BasicNLP.SegmentSentences(inputstring).Length;

            double indexval = 4.71 * ((double)charcount / wordcount) + 0.5 * ((double)wordcount / sentencecount) - 21.43;
            return indexval;
        }

The fourth line of code inside the function is the actual formula for the index computation. As you can imagine characercount, wordount and sentencecount are relatively straight-forward (for sentence segmentation I've simply checked for few common sentence punctuation symbols). Also to note is that in my code I've set-up a separate static class called BasicNLP that contains these utility functions, in order to keep the code organised.

The formula to compute the Gunning-Fog index (5th line in the code below) is very simple again, except that instead of character count, count of complex words (i.e. words consisting of three or more syllables) is used.

public static double CalculateGunningFogIndex(string inputstring)
        {
            int sentencecount = BasicNLP.SegmentSentences(inputstring).Length;
            string[] tokens = BasicNLP.Tokenise(inputstring);
            int complexwords = BasicNLP.CountComplexWords(tokens);
            int wordcount = tokens.Length;

            double indexval = 0.4 * (((double)wordcount / sentencecount) + 100 * ((double)complexwords / wordcount));
            return indexval;
        }

Finally Flesh-Kincaid uses the total count of syllables rather than the count of complex words (i.e. words that have three or more syllables), but is otherwise a rather similar formula.

public static double CalculateFleshKincaidIndex(string inputstring)
        {
            int sentencecount = BasicNLP.SegmentSentences(inputstring).Length;
            string[] tokens = BasicNLP.Tokenise(inputstring);
            int syllablescount = BasicNLP.SyllableCount(tokens);
            int wordcount = tokens.Length;

            double indexval = 0.39 * (((double)wordcount / sentencecount) + 11.8 * ((double)syllablescount / wordcount) - 15.59);
            return indexval;
        }

Computing the syllable-count isn't too difficult either, especially as I found a function on the web that readily achieves this, and roughly does the syllable counting job.

     public static int SyllableCount(string word)  
     {  
       word = word.ToLower().Trim();  
       int count = System.Text.RegularExpressions.Regex.Matches(word, "[aeiouy]+").Count;  
       if ((word.EndsWith("e") || (word.EndsWith("es") || word.EndsWith("ed"))) && !word.EndsWith("le"))  
         count--;  
       return count;  
     }  

A simple console application in C#.net is available for download (MIT open source license, unless otherwise stated, .net v. 4.5), please use with care and at your own risk! Some of the measures can also vary slightly from other tools, which mostly has to do with how the syllables are counted and slight differences in the indices measures. The way I use it at the moment is that I calculate an average of all three, to get a more stable measure.

Download c#.net code

No comments:

Post a Comment