Martin Sykora's Personal Blog

Sunday, March 29, 2015

VOTEBEE: Tracking the General Election 2015

Tracking emotional outpour in real-time and at granularity of the 8 basic / cross-cultural emotions on Twitter is something that emotive.systems (EMOTIVE) does pretty well - which is why we've decided to monitor real-time emotional opinions in the upcoming UK General Election, in the run up to May 7, and made it possible for anyone with a smartphone to join-in in the fun.

A smartphone app, based on EMOTIVE, and developed at Loughborough University in partnership with Encircle ltd. for Android and iPhone smartphones is now available for download. It analyses in (near) real-time how tweeters feel about the 7 main political parties, their leaders, partners and policies (NHS, Economy, Education, Welfare...), provides graphs, daily summaries over the last 7 days and much more.

The Emotive.Systems website explains how the EMOTIVE engine does it's analysis. We also get asked a lot about the data that is analysed and I'll try to shed some light on this here.

A selection of terms are tracked for each category, which were all very carefully hand-picked with every effort to minimise any bias and to be fair to the different parties. For instance tweeting activity around each party is tracked using their main (and only one) Twitter account handle:

@Conservatives
@LibDems
@UKLabour
@theSNP
@UKIP
@Plaid_Cymru
@TheGreenParty

This keeps things balanced and also simpler to interpret. For instance if say there are far less tweets being send using @Conservatives than @UKLabour that would likely imply there being less traction in general with tweeters around this parties main Twitter presence. The party leaders are tracked by monitoring the mentions of their names (Cameron, Milliband, Farrage...) and therefore again, this is very fair to each leader.

Useful Links:

Commercialisation details

Encircle Ltd.'s website

Android and iPhone download links

Emotive.Systems Analytics

Tuesday, October 8, 2013

Detecting Emotions in Social-Media Streams

Recently my research and development work within the EMOTIVE project - looking at fine-grained, cross-cultural emotion detection from Twitter datasets - has received considerable national and international mass-media attention. Despite its' relatively low budget the project was an overall success; achieving the highest currently known performance on test-datasets in the world (in terms of F-Measure) and processing tweets at a speed of 1500-2000 tweets per second (on an avg. dual-core processor). The system that was developed detects a range of 8 "Basic Emotions", anger, disgust, fear, happiness, sadness, surprise, and shame and confusion, rather than a variation on the less informative positive / negative sentiment score.

The ontology employed gives a rich linguistic context to the eight emotions through its ability to analyse both ordinary speech and slang. This ability to monitor how the public mood changes over time is particularly useful when assessing what interventions are most successful in dealing with civil unrest or concern. However, potential analysis of tweets with the developed system can range from marketing to personality profiling through computational models which are based on emotions.

To me the project was particularly interesting mainly as it allowed me to further focus on my interests that I had throughout my PhD and also to further delve into very interesting social-media questions, NLP related issues in sparse texts processing (i.e. 140 characters per message vs. traditional NLP on large documents) and Ontology processing applications.

The work has resulted in several conference papers (and at least one Journal paper is on the way, with more in the pipeline). Currently we are continuing work within another EPSRC & DSTL funded project and are in collaboration with several organisations to further explore applications of our fine-grained emotions detection system.

Monday, August 5, 2013

Readability Measures in C#

Readability measures, such as the Gunning-Fog Index, Automated Readability Index or the Flesh-Kincaid Index are well established and widely used formulas that basically compute and roughly assess the difficulty in reading a piece of text. These scores generally produce an approximate representation of the US / UK grade level needed to comprehend the text. For instance a score of 8.2 would indicate that a text is expected to be understandable by an average student in year 8 in the United Kingdom or an 8th grade student in the United States. These kinds of scores are used to assess readability of school-books before their publication, or as a requirements by law of certain maximum scores for insurance policies and terms and conditions (see the Wikipedia page links, above for example uses).

There are some online websites that allow to score texts in such a way, e.g. http://www.readability-score.com/. However, when you need to use these scores in your own code and applications, it's preferable to have your own implementation or third party library that you can call upon to get things done.

Unfortunately, C#.net suffers from a limited open-source repertoire of libraries and code-snippets (relative to e.g. Python or JAVA). This is why I've written up a quick implementation of some readability indices in C#.

Automated Readability Index
Gunning-Fog Index
Flesh-Kincaid Index

Automated Readability Index is probably the easiest one to compute, as it is the only one of the three measures that relies on character counts as a measure of word-complexity, rather than syllables count. This also means that it can be applied to different languages, and not just to English.


public static double CalculateAutomatedReadabilityIndex(string inputstring)
        {
            int charcount = BasicNLP.RemoveWhitespace(inputstring).Length;  //space characters need to be ignored in character count.
            int wordcount = BasicNLP.Tokenise(inputstring).Length;
            int sentencecount = BasicNLP.SegmentSentences(inputstring).Length;

            double indexval = 4.71 * ((double)charcount / wordcount) + 0.5 * ((double)wordcount / sentencecount) - 21.43;
            return indexval;
        }

The fourth line of code inside the function is the actual formula for the index computation. As you can imagine characercount, wordount and sentencecount are relatively straight-forward (for sentence segmentation I've simply checked for few common sentence punctuation symbols). Also to note is that in my code I've set-up a separate static class called BasicNLP that contains these utility functions, in order to keep the code organised.

The formula to compute the Gunning-Fog index (5th line in the code below) is very simple again, except that instead of character count, count of complex words (i.e. words consisting of three or more syllables) is used.


public static double CalculateGunningFogIndex(string inputstring)
        {
            int sentencecount = BasicNLP.SegmentSentences(inputstring).Length;
            string[] tokens = BasicNLP.Tokenise(inputstring);
            int complexwords = BasicNLP.CountComplexWords(tokens);
            int wordcount = tokens.Length;

            double indexval = 0.4 * (((double)wordcount / sentencecount) + 100 * ((double)complexwords / wordcount));
            return indexval;
        }

Finally Flesh-Kincaid uses the total count of syllables rather than the count of complex words (i.e. words that have three or more syllables), but is otherwise a rather similar formula.


public static double CalculateFleshKincaidIndex(string inputstring)
        {
            int sentencecount = BasicNLP.SegmentSentences(inputstring).Length;
            string[] tokens = BasicNLP.Tokenise(inputstring);
            int syllablescount = BasicNLP.SyllableCount(tokens);
            int wordcount = tokens.Length;

            double indexval = 0.39 * (((double)wordcount / sentencecount) + 11.8 * ((double)syllablescount / wordcount) - 15.59);
            return indexval;
        }

Computing the syllable-count isn't too difficult either, especially as I found a function on the web that readily achieves this, and roughly does the syllable counting job.


     public static int SyllableCount(string word)  
     {  
       word = word.ToLower().Trim();  
       int count = System.Text.RegularExpressions.Regex.Matches(word, "[aeiouy]+").Count;  
       if ((word.EndsWith("e") || (word.EndsWith("es") || word.EndsWith("ed"))) && !word.EndsWith("le"))  
         count--;  
       return count;  
     }

A simple console application in C#.net is available for download (MIT open source license, unless otherwise stated, .net v. 4.5), please use with care and at your own risk! Some of the measures can also vary slightly from other tools, which mostly has to do with how the syllables are counted and slight differences in the indices measures. The way I use it at the moment is that I calculate an average of all three, to get a more stable measure.

Download c#.net code

Monday, March 4, 2013

Economics of Web 2.0 - Peer Production

As some academics like to put it, during the 19th century onwards, distribution of information, knowledge and culture became industrialised (Benkler 2006, Shirky 2010). The steam powered printing press and other expensive machinery and methodologies were required to run, print and distribute the necessary volumes of newspapers. Later with television there was a need for highly qualified workforce and expensive studios. This created a professional class of producers and a large group of (mostly passive) consumers. With social media we have now gained the ability to balance consumption with sharing and our own content production, hence the internet effectively thins the line of separation between “amateurism” and “professionalism”. For further discussion on the idea of amateurs vs. professionals, see Shirky (2010), pp. 56-62, 152-155 or Keen (2007).

Publishing costs, online, have virtually disappeared. Costs associated with collaborating in groups or coordinating groups have also collapsed, examples of this are Wikipedia, Ushahidi, Ebird, or the open source Apache or Linux movements. Linux and Apache are open-source projects. Open-source coordination has been facilitated for a long time through non web based protocols. It was therefore possible for technically minded individuals to reap the benefits of collaboration via the Internet long before world wide web has developed the numerous characteristics of web 2.0.

The virtual disappearance of group coordination costs is the basis behind “social production”, a model of economic production first suggested by Harvard professor Yochai Benkler (Benkler 2002), and later made popular in his book The Wealth of Networks: How Social Production Transforms Markets and Freedom (Benkler 2006). In 1937, the economist Coase asked – if markets are efficient why and under what circumstances do people organise themselves into managed groups or firms, given that production could be carried out without any organisation; why would an entrepreneur hire help instead of contracting out for some particular task on the free market. It turns out that the transaction costs on the market may become a barrier (Coase 1937), so that, where the cost of achieving a certain outcome through organisational means is lower than the cost of achieving that same result through implementation of the price system, organisations will emerge to attain that result. Benkler postulated that under certain circumstances, non proprietary, or commons-based peer production may be less costly in some dimension than either markets or managed hierarchies (firms). One could say that when the cost of organising an activity on a peered basis is lower than the cost of using the market, and lower than the cost of hierarchical organisation, then peer production will emerge (Benkler 2002).

Table – Organisational forms as a function of firm-based management vs. market vs. peering

(source: adapted from Benkler 2006)

The idea of peer production as an alternative or complementary economic mechanism for achieving economic goals is an attractive one, but more importantly it highlights the impact that proliferation of web 2.0 has had.

References:

Benkler Y., 2002. Coase's Penguin, or, Linux and the Nature of the Firm, Yale Law Journal 112
Benkler Y., 2006. The Wealth of Networks: How Social Production Transforms Markets and Freedom, Yale University Press, USA
Coase R., 1937. The Nature of the Firm, Economica 4 (16), pp. 386-405
Keen A., 2007. The Cult of the Amateur: How the Democratization of the Digital World is Assaulting Our Economy, Our Culture, and Our Values. Doubleday Currency Publishing, USA
Shirky C., 2010. Cognitive Surplus: Creativity and Generosity in a Connected Age, Allen Lane Publishers, USA

Sunday, March 3, 2013

Lies, damned lies, and statistics

From my experience most PhDs in engineering and computer science must use quantitative data analyses and statistical techniques to evaluate and validate data within experiments at some point in their research. Increasingly social science researchers also use these techniques, mainly through well established software packages, such as SPSS, R-Statistics or other.

I have to say that most papers I read, even if they are relatively theoretical, do have a strong empirical data analysis component. One can sometime loose sight of the problems associated with statistical evaluations of empirical work, and hence I thought it be a good reminder for my readers and myself to refresh some common pitfalls with statistical techniques.

Discarding unfavorable data

Loaded questions

Overgeneralization

Biased samples

Misreporting or misunderstanding of estimated error

False causality

Proof of the null hypothesis

Data dredging

Data manipulation

Wikipedia is a good source. I also think a good statistics text will help in combination with some lighter reading.

There are also calls for researchers to make their code and datasets publicly available so that experiments can be repeated independently. This is now increasingly becoming a common practice, especially with high profile journals and conferences, but there are still numerous issues associated with making datasets and code-bases publicly available.

Saturday, March 2, 2013

The server side... and ASP.net side...

Browsing through my draft posts I found this draft, which I wrote while I was still a PhD student... just some ramblings about web programming and my past experiences with it.

Stay in Touch

To my surprise I made a rather interesting observation as a PhD student. A number of my PhD colleagues actually do not have any web-based programming knowledge or experience. This might come as a surprise but it may be their university simply stressed more fundamental CS issues and there was little time left for web programming or maybe they simply never had a chance to gain significant on-course / industry practice in web programming, and once their PhD started, their PhD topic was concerned with an entirely different topic. Whatever the reason, there's not any justifiable excuse in these days to ignore server side languages, especially as a comp. science student. During my PhD I therefore challenged myself to stay on top of new technology in server side programming. And yes maybe I lost some time that I could have dedicated elsewhere, at least I am ready to build a web based system at any time in pretty much any server language you'd throw at me!

Arguments for & against...

Within web-programming, we have a choice of languages to work with: Perl or C++ within a CGI setting, or PHP, ASP.Net (C#), JSP (Java), to name a few. i dabbed around in all of them at some point, but the most significant 'competition definitely takes place between PHP & ASP.net

The choice of language more often than anything depends on the background of the programmer. Then come into play execution speed, client preferences (sometime these are more important than any other factors, but more about that maybe in another post), and quite important is the question of available support and the availability of code base from past projects or from 3rd party sources, whether these be open-source or commercial. But the main idea is that we don't begin development from ground level.

PHP has it all, clients like it because it sprang from the open source movement and much more codebase out there is open source than probably for any other server side language out there. So this covers the client preference & availability of code-base factors. The speed is generally acceptable and programmers generally love PHP and since it is an interpreted language it's very easy to maintain and the symbiosis between PHP & MySQL works extremely well.

When I first dabbed in asp.net, this was in 2003/04, on 1.1 and 2.0 of the .net framework. Compared to PHP I hated many things about it, but that's a longer story! Since those days Microsoft engineers were bussy developing the technology, and currently we are at framework 4.0. I heard a lot of hype, as tends to be with Microsoft releases, so I decided to tame my curiosity.

ASP.net web-forms
ASP.net MVC
Stripped down (web-form free) approach

The idea for the stripped down approach was always within my head but in order to take this approach you really had to feel comfortable with some complexities of the chunky asp.net web-forms approach, until I found Chris Taylor's article.

ASP.net is much more complex as PHP, and maybe this is the problem with asp.net too. Due to it's complexity and the learning curve porgammers, quite rightly, keep away from it. Just to name a few problems. The asp.net menu control would render very ugly (non standard comliant) XHTML mark-up, instead of a CSS styled list which would be the way to go here, or asp.net would generate client IDs that depended on where in the page the server control occured rather than keeping the server ID assigned to the control in the first place. Fortunately at least the above named issues are resolved with framewrok 4.0, which gives us 'hope' for Microsoft.

Comparison Table (useful for 1st time asp.net people)

Speed Comparison of server-side languages - check out the source site

Monday, October 15, 2012

Behaving optimaly in life

Social sciences and Psychologhy have brought us a number of interesting insights into human behaviour. In a recent stumbleupon session I discovered a collection of recent scientific journal research articles relating to various aspects of life. You can read the original article on psychologytoday, what follows is a subset of the "solutions" suggested by the research papers. For a more complete description I recommend the reader to check out the full articles and of course I wouldn't take this advice literaly but only something to ponder on :-).

1-How to break bad habits: J. Quinn, A. Pascoe, W. Wood, & D. Neal (2010) Can't control yourself? Monitor those bad habits. Personality and Social Psychology Bulletin, 36, 499-511

Focus on stopping the behavior before it starts (or, as psychologists tend to put it, you need to "inhibit" your bad behavior). According to research by Jeffrey Quinn and his colleagues, the most effective strategy for breaking a bad habit is vigilant monitoring - focusing your attention on the unwanted behavior to make sure you don't engage in it. In other words, thinking to yourself "Don't do it!" and watching out for slipups - the very opposite of distraction. If you stick with it, the use of this strategy can inhibit the behavior completely over time, and you can be free of your bad habit for good.

2-How to make everything seem easier: J. Ackerman, C. Nocera, and J. Bargh (2010) Incidental haptic sensations influence social judgments and decisions. Science, 328, 1712- 1715.

For instance, we associate smoothness and roughness with ease and difficulty, respectively, as in expressions like "smooth sailing," and "rough road ahead." In one study, people who completed a puzzle with pieces that had been covered in sandpaper later described an interaction between two other individuals as more difficult and awkward than those whose puzzles had been smooth. (Tip: Never try to buy a car or negotiate a raise while wearing a wool sweater. Consider satin underpants instead. Everything seems easy in satin underpants.)

3-How to manage your time better: M. Weick & A. Guinote (2010) How long will it take? Power biases time predictions. Journal of Experimental Social Psychology.

You can learn to more accurately predict how long something will take and become a better planner, if you stop and consider potential obstacles, along with two other factors: your own past experiences (i.e., how long did it take last time?), and all the steps or subcomponents that make up the task (i.e., factoring in the time you'll need for each part.)

4-How to be happier: J. Quoidbach, E. Dunn, K. Petrides, & M. Mikolajczak (2010) Money giveth, money taketh away: The dual effect of wealth on happiness. Psychological Science, 21, 759-763.

The basic idea is that when you have the money to eat at fancy restaurants every night and buy designer clothes from chic boutiques, those experiences diminish the enjoyment you get out of the simpler, more everyday pleasures, like the smell of a steak sizzling on your backyard grill, or the bargain you got on the sweet little sundress from Target. Create plans for how to inject more savoring into each day, and you will increase your happiness and well-being much more than (or even despite) your growing riches. And if you're riches aren't actually growing, then savoring is still a great way to truly appreciate what you do have.

5-How to have more willpower: M. Muraven (2010) Building self-control strength: Practicing self-control leads to improved self-control performance. Journal of Experimental Social Psychology, 46, 465-468.

New research by Mark Muraven shows that our capacity for self-control is surprisingly like a muscle that can be strengthened by regular exercise. Do you have a sweet tooth? Try giving up candy, even if weight-loss and cavity-prevention are not your goals. Hate exerting yourself physically? Go out and buy one of those handgrips you see the muscle men with at the gym - even if your goal is to pay your bills on time. In one study, after two weeks of sweets-abstinence and handgripping, Muraven found that participants had significantly improved on a difficult concentration task that required lots of self-control. Just by working your willpower muscle regularly, engaging in simple actions that require small amounts of self-control - like sitting up straight or making your bed each day - you can develop the self-control strength you'll need to tackle all of your goals.

6-How to feel more powerfull: D. Carney, A. Cuddy, and A. Yap (2010) Power posing: Brief nonverbal displays affect neuroendocrine levels and risk tolerance. Psychological Science, 21, 1363-1368.

In the animal kingdom, alphas signal their dominance through body movement and posture. Human beings are no different. The most powerful guy in the room is usually the one whose physical movements are most expansive - legs apart, leaning forward, arms spread wide while he gestures. The nervous, powerless person holds himself very differently - he makes himself physically as small as possible: shoulders hunched, feet together, hands in his lap or arms wrapped protectively across his chest. We adopt these poses unconsciously, and they are perceived (also unconsciously) by others as indictors of our status. posing in "high power" positions not only created psychological and behavioral changes typically associated with powerful people, it created physiological changes characteristic of the powerful as well. High power posers felt more powerful, were more willing to take risks, and experienced significant increases in testosterone along with decreases in cortisol (the body's chemical response to stress.)

Search all text files by file's content

Finding a text-file, when one doesn't remember the file-name, or where one has stored it on the hard-drive can be a nightmare, especially since the Windows file-search fails. Powershell (udner Windows) comes to the rescue... all you have to do, is open up a Powershell console (in the newer versions it comes with the Windows OS, in older ones, you might need to download it).

Make shure that in the command prompt of powershell you are within the drive you want to search (i.e. use cd and ..cd commands to get there, or type cd c: if you need to search the c drive.
Get-ChildItem -Recurse -Include *.txt | Select-String "search string"

where search string is simply a piece of text that you know is in the contents of the file. For example in my case I type ANOVA, since I was looking for my notes on ANOVA tests. I think you can also use regex expressions, since powershell's Select-String cmd-let supports it, if I'm not mistaken.

Saturday, May 19, 2012

Sports Informatics and Social Collaboration

After a long time, I felt it was about time for a quick update on my Blog. Things have been rather hectic in the last few months. After my torn-ligaments injury, I'm still recuperating, although I've been working for five months as a Research Associate in the field of Sports Informatics, for the computer science department at Loughborough University. This work involves grass-root research of various areas in computer science applications within sport. I am tasked with organising seminars, and a symposium, in addition to initial academic research. There are numerous potential applications, such as the use of image analysis in sports science, virtual reality uses in sports science, coaching applications within sports, or AI / intelligent systems in sports-data management / systems solutions in sports science.

The table below (Lames 2012 - Departmental Presentation) illustrates the two-way relationships between computer science and sports science subjects. Essentially any work at these subject intersections is known as the field of sports-informatics (see the IACSS association website, which is an umbrella association for this type of work).

My task in this research position is to establish research links with international and nation wide research centres, for the department. Loughborough has a strong tradition in sports science with international research excellence (see SSEHS or STI, for instance), and there is a lot of potential applications, for example in team player image analysis based tracking algorithms, or the use of Machine Learning in detection of team-play patterns within the computer science department. To me, an area of most interest are the application of communications / social-media technologies in sports. Recently some work looked at several sporting events, and analysed the social-media UGC (User Generated Content). What type of things people are talking about in regard to sporting events, how the fan-athlete relationship is changing from traditional media, and whether any revealing information is shared (Pegoraro 2010, Kassing and Sanderson 2010). I am especially curious whether Twitter and other social-media contributions may be revealing in relation to for example sport draft picks, line-ups and team-play / strategy changes (in other words, these are problems of talent detection and coaching). There is also more work to be done in investigating fan-athlete communications, such as predicting how likely an athlete / celebrity is to respond with a direct message to fans, identifying fans, and classifying them based on the dynamics of interactions, or correlating match tracking data with social media contributions, since this type of work has not seen much research.

I am also beginning a new RA position in the All-in-One project (funded by the EPSRC) at Leicester University. This is a project looking at single infrastructure provision, and its technological and scientific feasibility within a 100 years from now. This progress is motivated by climate change, cost reductions, and efficient use of utilities (see this working paper, for a basic introduction). My main task in this project is to work on a collaborative web-based (web 2.0 / social-media type) system that facilitates collaboration and sharing within an academic and also a wider citizen-science community. This is an interesting area of work, with various problems, such as: how to design a system that facilitates efficient social, web based collaboration of many individuals?; or how to attract and maintain an active user-base of contributors and collaborators on the web based system? There is some very interesting research work in this area that was done within the Climate CoLab project of MIT's Collective Intelligence Centre, and for example the CSCW conference, contains highly relevant research contributions that help answer the questions, above. My work, within the All-in-One project involves the deployment of a collaborative system / processes based on the evaluation of prior academic research. Some of my work within my PhD, such as the design of the Newsmental system, is relevant to this, and it will be interesting to put the entire concept of collective intelligence into practice, within a larger scale project, such as this one.

References:

[Pegoraro, A. Look Who's Talking - Athletes on Twitter: A Case Study International Journal of Sport Communication, 2010, 3, pp. 501-514]
[Kassing, J. W. & Sanderson, J. Fan-Athlete Interaction and Twitter Tweeting Through the Giro: A Case Study International Journal of Sport Communication, 2010, 3, pp. 113-128]

Tuesday, March 6, 2012

Simplicity in Web Apps

Dr. BJ Fogg from Stanford University does some interesting work in understanding Web 2.0 to human interactions... he even teaches a course at Stanford fully dedicated to Facebook :-)

Anyway, this is his model of simplicity as it relates to web apps, a very brief and rough intro, but maybe you'll find it useful - http://behaviormodel.org/ability.html

What I took away from it:

An ugly / simplistic but useful definition of SIMPLICITY: The minimally satisfying solution at the lowest cost.
Simplicity is contextual, i.e. it depends on the situation or person (not necessarily the product)
Simplicity is a function of your scarcest resource at that moment, where Fogg identifies these resources: Time, Money, Physical Effort, Brain Cycles, Social Deviance (i.e. going against socially acceptable norms), Non-Routine

Interesting stuff, he has many more resources on his web pages.