The Deadly Sins of Social Media Analysts

In case you haven’t heard, there’s a general election coming up here in the UK. This has caused a spike in the number of analysis I’m seeing from social media companies, a large number of them containing some horrific “sins against data science”. Here are a few of my favourites:

1. Thinking Social Media Matters.

Unless you are actually studying social media, i.e. your findings relate strictly to social media and you are not trying to project your insights onto the general population as a whole, then social media doesn’t matter a jot. It doesn’t matter because it is not a statistically significant sample. Your social media sample is, at least “doubly self selecting”.

Firstly, it’s self selecting because it only includes that part of the population who use social media. Secondly, it’s self selecting because even amongst people on social media, your sample only contains those people who are talking about the subject you are interested in. Thus, when trying to gain insight into what the general population think about a particular subject, social media alone is useless.

This sin manifests itself in headlines such as “65% of people think UKIP will be a positive influence in government”. To be accurate, this headline would have to be rewritten as, “A proportion of the population of the UK are on social media and of those, a proportion are talking about the election. We took a statistically insignificant sample of those people and analysed their posts. Of those posts 65% of them indicated a belief that UKIP will be a positive influence in government.“ Somehow, this isn’t as catchy a headline.

2. Thinking Size Matters

I was recently speaking at a conference and I got into a conversation with an attendee that went something like this… “I see you and I are doing the same kind of analysis, my analysis doesn’t agree with yours and since I have two orders of magnitude more data than you, you should stop saying the stuff you’re saying and agree with me.”

Being the polite and mild mannered chap that I am, I explained that I drew my data from the Twitter stream API which guarantees me a statistically significant sample of all of the tweets generated (even then it’s not perfect). So, when I say “12% of tweets show such and such”, I can say that with confidence, even though I’m not looking at all tweets, because I have a statistical sample.

The chap talking to me however, also takes tweets from the same feed, and from other feeds and from selected people who he deems to be “subject matter experts” and collects all that data into one large Bucket-O-Crap.

In short, he started with a statistically correct sample, then diluted down until the sample was worthless. He was fixated on the fact that his sample size was two orders of magnitude larger than mine, but lost sight of the fact that what he now had was garbage.

3. Using Concrete Numbers in Findings

This brings me on nicely to my next “sin”, and that is the use of concrete numbers in findings. I see lots of blog posts that say “x number of tweets were tagged #VoteMuppetParty”. No they weren’t. That is how many was in your sample, unless you forked out for the Firehose feed that number is nowhere near a reflection of how many tweets were actually posted.

What you can, legitimately say is that “30% of tweets were tagged #SuchAndSuch whilst 40% were tagged #ThisAndThat”. You can do this because your sample is statistically correct, so the ratios will be (near enough) the same in your sample as in the whole Twitter population.

“Wait a minute!” I hear you say, “I’ve read your blog, Gary and you do that, you use concrete numbers in this chart”:


It’s true, I do. Here I use concrete numbers because the difference between the number of posts per hour for the highest and lowest performing parties, means I have to use a logarithmic scale on the Y axis in order to show all the data. The first time I used this chart, I took trouble to explain that the numbers were there to show orders of magnitude differences only, and should not be taken as a concrete value.

4. Averaging Ordinal Data

This is my pet hate and you see it everywhere (even the BBC). So what do I mean by this. Well, let’s say we had a class of pupils all taking a test out of 100. We can count up their scores and divide by the number of pupils to get an average score for the test. We can do this because the data is numeric and so the difference between 1 and 2 is the same as the difference between 8 and 9.

Not so with ordinal data. Ordinal data implies some form of rank, but the difference between each point on the scale is not the same, nor is it the same for each person question. For example, if I say rate something between 1 and 5 where 1 is terrible and 5 is great, each person questioned has to divide the numbers between 1 and 5 and attach their own value to it. These will vary from person to person.

Even if I breakdown each of the numbers and attach a value myself, each person questioned will have a different idea of what each means. For example if I say, rate this from 1 to 5 where 1 is very bad, 2 is bad, 3 is neither good nor bad, 4 is good and 5 is very good; each person will have their own idea of what the difference between good and very good is. When dealing with numerical data, everyone knows the difference between 3 and 4 and it is the same for everyone.

For these reasons you cannot average ordinal data. Stop doing it!!

The easy way to remember it, is not to let the numeric part fool you. Instead just use the word values you attached to the numbers. Now answer me this, what is the average of “Good” and “Very good”? It’s meaningless, right? Right! Well so is the difference between 4 and 5 on an ordinal scale.

Well that’s it for this post, if you are doing social media analysis, please make sure you don’t “sin against data science” and until next time, keep crunching those numbers! Smile

Posted in Data Science, Statistics | Tagged , | 2 Comments

UK General Election Day 30 Twitter Analysis

This is the analysis for Wednesday 30th April 2015.


There’s two stand out items on the chart today, the first being that #UKIP are in number 3 spot, and the second is the continuing rise of #SNPOUT. The more the SNP refuse to rule out a second referendum, the stronger this hashtag grows. #VOTESNP vs #SNPOUT is now 4:1, down from 54:1 at the start of the campaign.


The thing that strikes you immediately is the lead that the two SNP accounts have over everyone else is staggering, there seems to be little any of the other parties can do to halt the “Cult of the SNP”.




Labour continue to be the most talked about party, with the SNP second and the Tories third. To find out what was said, we’ll take a look at the trigrams.


After polls showing that the SNP may win every seat in Scotland, the topic of conversation, on the “CyberNats” dominated stream, is one of the more seats they win the stronger Scotland’s voice is.

Let’s see what else made the news yesterday:

The Conservatives promise a law guaranteeing no rise in income tax, national insurance or VAT before 2020, prompting “TAX RISES LAW” at 1,237.

But Labour say Tory plans would mean cuts to tax credits totalling £3.8 billion, bringing “CUTTING TAX CREDITS”, in at 1,118.

The Lib Dems pledge to offer free schools meals to all children in England, pushing “FREE SCHOOL MEALS”, into the number 35 spot.

Russell Brand released his interview with Ed Miliband . The Labour leader told the comedian he was wrong to say that voting was pointless, prompting “MILIBRAND INTERVIEW GE2015”, at 240.

UKIP leader Nigel Farage warned of an influx of Islamic extremists if Europe’s doors were opened to large numbers of people fleeing conflict zones, putting “NIGEL FARAGE WARNS” at 769.

Nick Clegg laid down another condition for considering a coalition – he wants a £12,500 personal tax free allowance, bringing “” to the 1,135 spot.

Well that’s all for this post, until next time, keep crunching those numbers! Smile

Posted in Data Science, Politics, Social Media, Statistics | Tagged , , , | 1 Comment

UK General Election Day 29 Twitter Analysis

Welcome to the analysis for Tuesday 28th April 2015. Let’s jump straight in by looking at the hashtag chart:


As we enter the last stages of the election, all the parties are making their mark on the chart, with the exception of the Greens. The SNP are still well out in front, their expertise, honed on the independence referendum, shines in this campaign – on Twitter. The NHS also reappears on the chart, though it’s never really been far away, even when it’s not been present. The #SNPOUT tag is creeping up the chart too as the election progresses, the #VOTESNP vs #SNPOUT ratio is at 4.9:1, the lowest it’s been throughout the campaign.


Most of the usual suspects appear in today’s chart, with a couple of exceptions; those are @UTVELECTIONS which is an account that carried information about a TV debate in NI and @LIBBY_BROOKS, the Scotland correspondent for the Guardian Newspaper, who had a humorous exchange with Nicola Sturgeon, which the “CyberNats” seemed to enjoy.




Again Labour are the most talked about party, with the SNP second and the Tories third, whilst the number of people talking about UKIP challenge those talking about the LibDems.


As you can see, the Twitter exchange between Libby Brooks and Nicola Sturgeon delighted the “CyberNats”, with them immediately announcing it the “best tweet of the election, possibly ever”, you can judge for yourself if there might be a little bit of hyperbole there:


Let’s see what else happened in the election news yesterday:

The Conservatives promise another 50,000 apprenticeships paid for by £200 million from Libor fines, this did not register with the Twitterati.

Labour announce a 10-point plan to reform the immigration system, prompting “GE2015 IMMIGRATION PLAN” at 822.

Lib Dems demand a stability budget within 50 days of the next government being formed as a red line for any post-election negotiations, pushing “STABILITY BUDGET WITHIN” into position 1,390.

The Green Party pledged to double child benefit to £40 a week, prompting “CHILD BENEFIT DOUBLE” at 1,411.

Well that’s it for this post, until next time, keep crunching those numbers! Smile

Posted in Data Science, Social Media, Statistics, UKGeneralElection2015 | Tagged , , , | 1 Comment

UK General Election Day 28 Twitter Analysis

After my little break it’s time to get back on to the analysis for Monday 27th April 2015, starting with the hashtag chart:


All the parties (bar the Greens) are represented on the chart today; looks like they are finally getting their acts together with Twitter, as the election draws into it’s final stages.

#WHATMATTERSTOME is a Twitter based survey by the Scotsman newspaper to sample the attitude of the nation with regard to what they feel is important to them right now.

The #SNP continue to do well, but the rise of #SNPOUT has been quite remarkable, in this chart ‘’#votesnp is “only” out performing #’snpout by 5:1. Given that more than half of the electorate claim to support the SNP, this rise (from 54:1) is significant.


A good spread of parties on the chart, the SNP still way out in the lead as the “Cybernats’” experience in the independence referendum is still telling. An example of that is the “group cheer” from the “CyberNats” for @AROBERTSONSNP, after his performance on the BBC Question Time; placing him at number 4 on today’s chart.



Again we see the trend of Labour being the most popular topic of conversation in every hour.


The zeitgeist is dominated by the TNS poll showing SNP with 54% of the vote in Scotland. Although this poll does hint at something interesting, something that won’t actually be resolved until the election itself.

In 1992 the Conservatives won the election against the polls. Later analysis concluded that this was due to it being “socially unacceptable” to be a Tory supporter, at that time; so when asked which party respondents were going to vote for, they either lied or said they were undecided.

Now, although the headline figure in the poll says 54% for the SNP, when you dig into the charts and data, you find two interesting things. One is that 29% of respondents said they were undecided. The second is that the question, about voting intentions, which garnered the 54% support for the SNP, wasn’t a straight question along the lines of who they’d vote for, in fact the question was:

“ The next General Election for the Westminster Parliament will be held in May 2015. Which party do you intend to vote for in that election? (Respondents initially stating undecided/refused then asked: Which party would you be most inclined to vote for in a General Election for the Westminster Parliament?)”

So when asked, a percentage of people stated that they were undecided or refused to answer and only when “pushed” did they give an answer, some for the SNP and some for other parties (we must assume). It is this “final answer” that was taken and the 54% figure obtained.

Now these interviews were conducted face to face and it is arguable that it is not “socially acceptable” to support another party in Scotland at this time. Therefore, there is a possibility that the “1992 phenomenon” is at work here. However, we will have to wait and see until the election to find out. It’s just one of the many interesting questions that will be resolved when votes are actually cast.

Right, after that slight diversion, lets look at what else was happening yesterday, and what the Twitterati had to say about it:

The Liberal Democrats say education funding will be a “red line” in any coalition negotiations, prompting “EDUCATION RED LINE” at number 467.

Labour says it would exempt first-time buyers from stamp duty on homes worth up to £300,000, this was not mentioned.

Nicola Sturgeon says Labour has been “bullied” in to ruling out a coalition with her SNP party, prompting “ED MILIBAND BULLIED” at 1,759.

A letter signed by 5,000 small businesses backs the Conservatives, getting “SMALL BUSINESS LETTER” to 248.

The Greens said they would take away the “right-to-buy”, prompting “END RIGHT BUY” 369.

Well that concludes this post, until next time, keep crunching those numbers! Smile

Posted in Social Media, Statistics, UKGeneralElection2015 | Tagged , , , | 1 Comment

UK General Election Day 22 Twitter Analysis

After yesterday’s detour to look at the party’s manifestos, we return today to looking at the Twitter stream for Tuesday 21st April 2015; starting with the Hashtag Chart:


The SNP and Labour continue the trend of topping the chart. The election looks as if it may boil down to the choice between a Tory/LibDem coalition and a Labour/SNP coalition. However, as the Tories and the Libdems do not feature as highly (the Tories at 14, the LibDems at 19), this result is probably best explained by the fact that the election stream is still dominated by Scottish politics and the SNP and Labour are engaged in a knife fight there.

#MiliFandom makes an appearance at number 5 after an account started posting pics of Miliband’s face posted onto other characters. #SNPOut still showing a strong growth at number 6. The NHS is never far from the electorate’s mind, occupying positions 8 and 9, whilst the ever popular #IndyRef tag closes us out at number 10. It doesn’t seem to matter how many times Sturgeon and the SNP say this election is not about independence, it’s always at the forefront of the “CyberNat’s” minds.

#STUC15 tag, a tag about the Scottish Trades Union Congress, makes an appearance after Labour and the SNP both make popular speeches there.


The national news accounts are starting to grow in popularity as the electorate take more of an interest in the election and start to comment on interviews as they are happening.



Due to, what the BBC would call a “slight technical hitch”, but what I’ll just call a “cock up”, there’s no data for the first part of the day; however, from what we do have, it’s pretty clear that Labour maintain their position as the party that most people are talking about, having again been top of the posts in every hour.


The fact that Labour and the SNP dominate the chatter on the election stream, is borne out by the graphic above showing those tweets that are geocoded and the party that is the subject of the tweet.


Major saying that a Lab/SNP coalition would result in a “daily doze of blackmail” is the leading story in our trigrams chart, followed by teenage girls declaring themselves members of #milifandom.

Let’s see what other election news hit the headlines and how it played out on Twitter:

The Libdems launched their Scottish manifesto, “LAUNCHES GE2015 MANIFESTO” made it to 144 in the chart.

Labour says it would launch what it calls an “NHS rescue plan”, including a recruitment drive for 1,000 new nurses, prompting “FORTNIGHT RESCUE NHS” at 324.

Ed Miliband accuses David Cameron of putting the union at risk by “talking up” the SNP, leading to “TALKING UP SNP” at 1,292.

Nick Clegg says the Lib Dems would allow councils to charge 200% council tax on second homes in rural beauty spots, causing “TAX DOUBLE UNDER” at 1,549.

BBC Radio One’s Newsbeat stages an hour-long debate on health, education and immigration for 100 young adults, prompting “NEWSBEAT GE2015 DEBATE” 1,523.

Conservative chairman Grant Shapps has said a Guardian story linking him with changes to Wikipedia pages is “the most bonkers story” of the campaign so far, leading to “GRANT SHAPPS EDITED” at 135.

Well that’s it for this post, ‘til next time, keep crunching those number! Smile

Posted in Data Science, Social Media, Statistics, UKGeneralElection2015 | Tagged , , , | 1 Comment

2015 General Election Party Manifesto Analysis

Now that all the parties have their manifestos out, let’s have a look and see what a word and trigram frequency analysis shows us.

The Conservatives


Looking at the most common three word collocations, it’s clear that the main message from the Conservatives is one of “steady as she goes”, re-elect us and it’ll be more of the same.


“People” are close to the top of the Conservative manifesto, when looking at the term frequency of the words used, along with “Support”, “Work” and “Plan”. “Help” also makes an appearance and “Tax” is never far away from a governing party’s thoughts.

The Labour Party


Labour seem quite introverted when you examine their trigrams, with the party featuring in 4 out of the top 10 places.


Like the Tories, when you examine the term frequencies of words used, we see that Labour focused on “people” and “work”, with “local” and “support” being popular too.

The LibDems


Much like the Tories, their coalition partners, the message from the LibDems is one of “steady as she goes”.


Whilst their word term frequency makes a lot of “Support”, “Work” and “Local”.



The main message from the SNP’s manifesto seems to be one of working across the UK regardless of them being a party for whom only Scots can vote.


When you look at the frequency of words used, we see a picture of “Scotland” and party before “UK” and “people”. Having said that, this is a common theme amongst the nationalist parties, as we shall see.

Plaid Cymru


Looking at the trigrams for Plaid we see that, like the SNP, they are very focused on party and country.


The idea of nationalist parties being about country and party before people is further supported when we have a look at the term frequencies for the words used in the manifesto.



Looking at the trigrams for UKIP it can be seen that they demonstrate the characteristics of a nationalist party. Like The SNP and Plaid Cymru they are obsessed by the idea of nationhood, in this case the nation of UK outside of the EU.


Again, when we examine the word frequencies, we can see that UKIP value party and country above people, a nationalist party trait, as we’ve seen. Looking at this election’s crop of manifestos, there is evidence to support the idea that UKIP is the UK’s third nationalist party.

The Green Party


The message from the Green Party’s trigrams is that we all have to act for the common good.


Looking at the word frequencies, we can see that “People” are important to the Greens, as is doing things at a “Local” level.

Now that we’ve looked at the party’s manifestos in general, let’s look at how well they did at talking about the issues that voters were interested in. According to a recent BBC survey, the top 5 topics that voters are interested in are: the NHS; the economy; immigration; welfare and jobs.

To calculate how well each of the parties talked about the issues, I took the term frequency of each topic, then normalised the count for the number of “non stop” words in the party’s manifesto. The results are as follows:


The above graph shows how well the parties did on a topic by topic basis, but it’s hard to see who came out on top. To discover that, we can sum the normalised scores for each topic, like so:


Now we can see that the Tories talked more often about the issues that the voters were interested in, whilst the SNP talked least about them.

Well that’s all for this post, until next time, keep crunching those numbers! Smile

Posted in Data Science, Social Media, Statistics, UKGeneralElection2015 | Tagged , , , | 3 Comments

UK General Election Day 20 Twitter Analysis

This is the election analysis for Sunday April 19 2015. Let’s get started with the hashtag chart:


Sunday mornings in the UK are given over to politics on the main TV channels, and as you can see from the tag chart, both the SNP and the Conservatives appeared on the main program, The Marr Show.

Labour and the SNP lead the chart, this is in line with the trend we’ve been seeing over the last few days, whereby the chatter is all concentrated around these two parties. #RegisterToVote is still doing well ahead of today’s deadline to register for the election.

#VoteConservative gains in popularity and hits our chart at number 10. The polls have been pretty stagnant with neither the Tories or Labour managing to gain any sort of lead on the other, outside the pollster’s margin of error. The last couple of days have seen a trend in popularity for the Tories tag, it’ll be interesting to see if this trend continues and, if so, if it filters through to the polls.

#SNPOut is 7 in our chart and continues to grow in popularity. The ratio of #VoteSNP tweets to #SNPOut tweets is now 5:1, the lowest it’s ever been, and down from a high of 54:1

Outside of the chart #IndyRef continues its popular trend at number 16, whilst #The45 slumps to 70 and #Trident is down at 56.


After appearing on the Marr Show both Sturgeon’s and Cameron’s accounts are popular in the chart, as is the Marr Show account itself. The account of Stewart Hosie makes an appearance after he said, on a news interview, that the SNP would never vote for Trident.


The image of tweets by location, continues to show Labour dominating the chatter, followed by the SNP, with a little splattering of the other parties mixed in. Again, the trend is for the SNP support to be concentrated in the Central Belt of Scotland.




As you can see from the charts above, the trend continues to be that Labour are the most talked about party in every hour of the day. But we have to look at the Zeitgeist below to find out what they were saying.


Urging people to register to vote, before today’s deadline, tops the chart; after that, all the chatter is about how Cameron refused to share a sofa with Sturgeon on the Marr Show.

This was not the most interesting part of the show, from a data science view point however. As a data scientist, the most interesting part, for me, was when the host, Andrew Marr, quizzed the Prime Minister about the death of soldier, David Clapson. David died last year of diabetic ­ketoacidosis. At the time of his death, his job seeker benefits had been stopped due to him missing mandatory meetings with his advisor. He had no electricity and his medication had to be refrigerated. At the time, the media spun this story as “Callous Government benefit sanctions resulted in the death of a soldier”.

This tragic story is interesting from a data science point of view. Firstly, the spin put on this by the media doesn’t fit the facts. Whilst it is true that benefits are sanctioned for missed meetings, Clapson would have been informed of this in advance. Secondly, in the UK utility companies will not cut off your electricity supply if you are deemed to be vulnerable. Having diabetes put Clapson firmly in this category. Thirdly, his sister said in an interview that he may have stopped taking the drugs because he had become depressed with this situation.

So despite a very easy fact check, pointing to a more probable reason for this poor man’s death, the BBC still decided to couch the question, to the PM, in terms of Clapson’s death being directly caused by the stopping of his benefits.

With the interview of the PM following directly after the interview with Sturgeon, and therefore sure to have a large SNP audience, is it possible that the BBC wanted to show that they were not biased towards the establishment, given what happened to the BBC’s Nick Robinson during the referendum debate? If this is not the case, what else would have prompted the BBC to have posed the question in such inflammatory terms? This question bears further research, but not by me, as it is outwith the scope of this Twitter analysis.

The second reason that this is interesting from a data science point of view, is that when you analyse the trigrams, you find that the plight of this soldier is not mentioned at all. The BBC gave ample ammunition to the political opponents of the government (mainly the SNP in this stream), but they seemed to have missed it; preferring instead to focus on Sturgeon, how well she performed, and the fact that Cameron would not share a sofa with her.

Now it’s time to look at the other election news and to see how it played out on Twitter.

David Cameron outlined Lloyds share sale plan, prompting “CHEAP LLOYDS SHARES” at 235. He also warned against SNP influence in UK government, which hit 1,276 with “SCOTLAND INFLUENCES UK”.

Nicola Sturgeon ruled out any deal with the Conservatives during her Andrew Marr Show appearance, “RULE OUT DEAL” makes it in at 143.

Lib Dem Vince Cable said it would be difficult to work with either Labour or the Conservatives, but they would; this did not register with on the top 2,000 trigrams.

Labour focused on the NHS, saying the Conservatives would cut the number of nurses in England, prompting “AXE 2,000 NURSES” at 759.

Nicola Sturgeon said SNP MPs would be a “constructive” force at Westminster after the election, dismissing David Cameron’s claim that they would be “coming to Westminster to break up our country” – and a Labour claim that the Tories and SNP wanted each other to do well, this lead to “COME WESTMINSTER CONTRIBUTE” at 268.

Well that’s all for today, until next time, keep crunching those numbers! Smile

Posted in Data Science, Social Media, Statistics, UKGeneralElection2015 | Tagged , , , | 1 Comment