Martin Sykora's Personal Blog: 2013

Tuesday, October 8, 2013

Detecting Emotions in Social-Media Streams

Recently my research and development work within the EMOTIVE project - looking at fine-grained, cross-cultural emotion detection from Twitter datasets - has received considerable national and international mass-media attention. Despite its' relatively low budget the project was an overall success; achieving the highest currently known performance on test-datasets in the world (in terms of F-Measure) and processing tweets at a speed of 1500-2000 tweets per second (on an avg. dual-core processor). The system that was developed detects a range of 8 "Basic Emotions", anger, disgust, fear, happiness, sadness, surprise, and shame and confusion, rather than a variation on the less informative positive / negative sentiment score.

The ontology employed gives a rich linguistic context to the eight emotions through its ability to analyse both ordinary speech and slang. This ability to monitor how the public mood changes over time is particularly useful when assessing what interventions are most successful in dealing with civil unrest or concern. However, potential analysis of tweets with the developed system can range from marketing to personality profiling through computational models which are based on emotions.

To me the project was particularly interesting mainly as it allowed me to further focus on my interests that I had throughout my PhD and also to further delve into very interesting social-media questions, NLP related issues in sparse texts processing (i.e. 140 characters per message vs. traditional NLP on large documents) and Ontology processing applications.

The work has resulted in several conference papers (and at least one Journal paper is on the way, with more in the pipeline). Currently we are continuing work within another EPSRC & DSTL funded project and are in collaboration with several organisations to further explore applications of our fine-grained emotions detection system.

Monday, August 5, 2013

Readability Measures in C#

Readability measures, such as the Gunning-Fog Index, Automated Readability Index or the Flesh-Kincaid Index are well established and widely used formulas that basically compute and roughly assess the difficulty in reading a piece of text. These scores generally produce an approximate representation of the US / UK grade level needed to comprehend the text. For instance a score of 8.2 would indicate that a text is expected to be understandable by an average student in year 8 in the United Kingdom or an 8th grade student in the United States. These kinds of scores are used to assess readability of school-books before their publication, or as a requirements by law of certain maximum scores for insurance policies and terms and conditions (see the Wikipedia page links, above for example uses).

There are some online websites that allow to score texts in such a way, e.g. http://www.readability-score.com/. However, when you need to use these scores in your own code and applications, it's preferable to have your own implementation or third party library that you can call upon to get things done.

Unfortunately, C#.net suffers from a limited open-source repertoire of libraries and code-snippets (relative to e.g. Python or JAVA). This is why I've written up a quick implementation of some readability indices in C#.

Automated Readability Index
Gunning-Fog Index
Flesh-Kincaid Index

Automated Readability Index is probably the easiest one to compute, as it is the only one of the three measures that relies on character counts as a measure of word-complexity, rather than syllables count. This also means that it can be applied to different languages, and not just to English.


public static double CalculateAutomatedReadabilityIndex(string inputstring)
        {
            int charcount = BasicNLP.RemoveWhitespace(inputstring).Length;  //space characters need to be ignored in character count.
            int wordcount = BasicNLP.Tokenise(inputstring).Length;
            int sentencecount = BasicNLP.SegmentSentences(inputstring).Length;

            double indexval = 4.71 * ((double)charcount / wordcount) + 0.5 * ((double)wordcount / sentencecount) - 21.43;
            return indexval;
        }

The fourth line of code inside the function is the actual formula for the index computation. As you can imagine characercount, wordount and sentencecount are relatively straight-forward (for sentence segmentation I've simply checked for few common sentence punctuation symbols). Also to note is that in my code I've set-up a separate static class called BasicNLP that contains these utility functions, in order to keep the code organised.

The formula to compute the Gunning-Fog index (5th line in the code below) is very simple again, except that instead of character count, count of complex words (i.e. words consisting of three or more syllables) is used.


public static double CalculateGunningFogIndex(string inputstring)
        {
            int sentencecount = BasicNLP.SegmentSentences(inputstring).Length;
            string[] tokens = BasicNLP.Tokenise(inputstring);
            int complexwords = BasicNLP.CountComplexWords(tokens);
            int wordcount = tokens.Length;

            double indexval = 0.4 * (((double)wordcount / sentencecount) + 100 * ((double)complexwords / wordcount));
            return indexval;
        }

Finally Flesh-Kincaid uses the total count of syllables rather than the count of complex words (i.e. words that have three or more syllables), but is otherwise a rather similar formula.


public static double CalculateFleshKincaidIndex(string inputstring)
        {
            int sentencecount = BasicNLP.SegmentSentences(inputstring).Length;
            string[] tokens = BasicNLP.Tokenise(inputstring);
            int syllablescount = BasicNLP.SyllableCount(tokens);
            int wordcount = tokens.Length;

            double indexval = 0.39 * (((double)wordcount / sentencecount) + 11.8 * ((double)syllablescount / wordcount) - 15.59);
            return indexval;
        }

Computing the syllable-count isn't too difficult either, especially as I found a function on the web that readily achieves this, and roughly does the syllable counting job.


     public static int SyllableCount(string word)  
     {  
       word = word.ToLower().Trim();  
       int count = System.Text.RegularExpressions.Regex.Matches(word, "[aeiouy]+").Count;  
       if ((word.EndsWith("e") || (word.EndsWith("es") || word.EndsWith("ed"))) && !word.EndsWith("le"))  
         count--;  
       return count;  
     }

A simple console application in C#.net is available for download (MIT open source license, unless otherwise stated, .net v. 4.5), please use with care and at your own risk! Some of the measures can also vary slightly from other tools, which mostly has to do with how the syllables are counted and slight differences in the indices measures. The way I use it at the moment is that I calculate an average of all three, to get a more stable measure.

Download c#.net code

Monday, March 4, 2013

Economics of Web 2.0 - Peer Production

As some academics like to put it, during the 19th century onwards, distribution of information, knowledge and culture became industrialised (Benkler 2006, Shirky 2010). The steam powered printing press and other expensive machinery and methodologies were required to run, print and distribute the necessary volumes of newspapers. Later with television there was a need for highly qualified workforce and expensive studios. This created a professional class of producers and a large group of (mostly passive) consumers. With social media we have now gained the ability to balance consumption with sharing and our own content production, hence the internet effectively thins the line of separation between “amateurism” and “professionalism”. For further discussion on the idea of amateurs vs. professionals, see Shirky (2010), pp. 56-62, 152-155 or Keen (2007).

Publishing costs, online, have virtually disappeared. Costs associated with collaborating in groups or coordinating groups have also collapsed, examples of this are Wikipedia, Ushahidi, Ebird, or the open source Apache or Linux movements. Linux and Apache are open-source projects. Open-source coordination has been facilitated for a long time through non web based protocols. It was therefore possible for technically minded individuals to reap the benefits of collaboration via the Internet long before world wide web has developed the numerous characteristics of web 2.0.

The virtual disappearance of group coordination costs is the basis behind “social production”, a model of economic production first suggested by Harvard professor Yochai Benkler (Benkler 2002), and later made popular in his book The Wealth of Networks: How Social Production Transforms Markets and Freedom (Benkler 2006). In 1937, the economist Coase asked – if markets are efficient why and under what circumstances do people organise themselves into managed groups or firms, given that production could be carried out without any organisation; why would an entrepreneur hire help instead of contracting out for some particular task on the free market. It turns out that the transaction costs on the market may become a barrier (Coase 1937), so that, where the cost of achieving a certain outcome through organisational means is lower than the cost of achieving that same result through implementation of the price system, organisations will emerge to attain that result. Benkler postulated that under certain circumstances, non proprietary, or commons-based peer production may be less costly in some dimension than either markets or managed hierarchies (firms). One could say that when the cost of organising an activity on a peered basis is lower than the cost of using the market, and lower than the cost of hierarchical organisation, then peer production will emerge (Benkler 2002).

Table – Organisational forms as a function of firm-based management vs. market vs. peering

(source: adapted from Benkler 2006)

The idea of peer production as an alternative or complementary economic mechanism for achieving economic goals is an attractive one, but more importantly it highlights the impact that proliferation of web 2.0 has had.

References:

Benkler Y., 2002. Coase's Penguin, or, Linux and the Nature of the Firm, Yale Law Journal 112
Benkler Y., 2006. The Wealth of Networks: How Social Production Transforms Markets and Freedom, Yale University Press, USA
Coase R., 1937. The Nature of the Firm, Economica 4 (16), pp. 386-405
Keen A., 2007. The Cult of the Amateur: How the Democratization of the Digital World is Assaulting Our Economy, Our Culture, and Our Values. Doubleday Currency Publishing, USA
Shirky C., 2010. Cognitive Surplus: Creativity and Generosity in a Connected Age, Allen Lane Publishers, USA

Sunday, March 3, 2013

Lies, damned lies, and statistics

From my experience most PhDs in engineering and computer science must use quantitative data analyses and statistical techniques to evaluate and validate data within experiments at some point in their research. Increasingly social science researchers also use these techniques, mainly through well established software packages, such as SPSS, R-Statistics or other.

I have to say that most papers I read, even if they are relatively theoretical, do have a strong empirical data analysis component. One can sometime loose sight of the problems associated with statistical evaluations of empirical work, and hence I thought it be a good reminder for my readers and myself to refresh some common pitfalls with statistical techniques.

Discarding unfavorable data

Loaded questions

Overgeneralization

Biased samples

Misreporting or misunderstanding of estimated error

False causality

Proof of the null hypothesis

Data dredging

Data manipulation

Wikipedia is a good source. I also think a good statistics text will help in combination with some lighter reading.

There are also calls for researchers to make their code and datasets publicly available so that experiments can be repeated independently. This is now increasingly becoming a common practice, especially with high profile journals and conferences, but there are still numerous issues associated with making datasets and code-bases publicly available.

Saturday, March 2, 2013

The server side... and ASP.net side...

Browsing through my draft posts I found this draft, which I wrote while I was still a PhD student... just some ramblings about web programming and my past experiences with it.

Stay in Touch

To my surprise I made a rather interesting observation as a PhD student. A number of my PhD colleagues actually do not have any web-based programming knowledge or experience. This might come as a surprise but it may be their university simply stressed more fundamental CS issues and there was little time left for web programming or maybe they simply never had a chance to gain significant on-course / industry practice in web programming, and once their PhD started, their PhD topic was concerned with an entirely different topic. Whatever the reason, there's not any justifiable excuse in these days to ignore server side languages, especially as a comp. science student. During my PhD I therefore challenged myself to stay on top of new technology in server side programming. And yes maybe I lost some time that I could have dedicated elsewhere, at least I am ready to build a web based system at any time in pretty much any server language you'd throw at me!

Arguments for & against...

Within web-programming, we have a choice of languages to work with: Perl or C++ within a CGI setting, or PHP, ASP.Net (C#), JSP (Java), to name a few. i dabbed around in all of them at some point, but the most significant 'competition definitely takes place between PHP & ASP.net

The choice of language more often than anything depends on the background of the programmer. Then come into play execution speed, client preferences (sometime these are more important than any other factors, but more about that maybe in another post), and quite important is the question of available support and the availability of code base from past projects or from 3rd party sources, whether these be open-source or commercial. But the main idea is that we don't begin development from ground level.

PHP has it all, clients like it because it sprang from the open source movement and much more codebase out there is open source than probably for any other server side language out there. So this covers the client preference & availability of code-base factors. The speed is generally acceptable and programmers generally love PHP and since it is an interpreted language it's very easy to maintain and the symbiosis between PHP & MySQL works extremely well.

When I first dabbed in asp.net, this was in 2003/04, on 1.1 and 2.0 of the .net framework. Compared to PHP I hated many things about it, but that's a longer story! Since those days Microsoft engineers were bussy developing the technology, and currently we are at framework 4.0. I heard a lot of hype, as tends to be with Microsoft releases, so I decided to tame my curiosity.

ASP.net web-forms
ASP.net MVC
Stripped down (web-form free) approach

The idea for the stripped down approach was always within my head but in order to take this approach you really had to feel comfortable with some complexities of the chunky asp.net web-forms approach, until I found Chris Taylor's article.

ASP.net is much more complex as PHP, and maybe this is the problem with asp.net too. Due to it's complexity and the learning curve porgammers, quite rightly, keep away from it. Just to name a few problems. The asp.net menu control would render very ugly (non standard comliant) XHTML mark-up, instead of a CSS styled list which would be the way to go here, or asp.net would generate client IDs that depended on where in the page the server control occured rather than keeping the server ID assigned to the control in the first place. Fortunately at least the above named issues are resolved with framewrok 4.0, which gives us 'hope' for Microsoft.

Comparison Table (useful for 1st time asp.net people)

Speed Comparison of server-side languages - check out the source site