Advice to Data Science Noobs

Recently I was at Dev Day in Poland; there was a speaker there, Chad Fowler, and he gave a talk on the Passionate Programmer 10 Years On. The talk was good, but in one particular place, he spoke about advice a mentor had once given him, he’d said do these six things for success in your field – Chad then went on to talk about the six things.

His list was particular to him and not relevant here, but it got me thinking, what six things would I tell a noobie data scientist to learn. Well I thought about it, and here are mine:

  1. Learn statistics
  2. Learn a statistics language (R, SPSS, SAS)
  3. Learn a scripting language, preferably Python
  4. Learn a machine learning library preferably scikit-learn
  5. Learn an RDBMS
  6. Learn a document database

I hope this list helps you, feel free to leave yours in the comments

Posted in Data Science | Tagged | 1 Comment

Why Everyone Should Calm Down About IndyRef2

Nicola Sturgeon announces a “possible” timeline and actions that “might” trigger a second referendum and everyone goes mad. The Separatists go mad because it’s another opportunity to see their dreams come to fruition, and the Unionists go mad because the last referendum cost 13.3Million, fully occupied the Scottish Government (SNP) for 4 years, when they should have been focused on running the country and, well… it’s not even been a year since we had the last one.

Barely a month ago the SNP came in for sharp criticism because they refused to debate a second referendum at their conference then, just last week, Sturgeon was targeted by her own Cybernats after welcoming the Queen to Scotland to open the new Borders rail line. The SNP know that they have a powerful weapon in the form of the Cybernats, but they are fundamentalists when it comes to the question of independence and they will turn on anyone who they see as standing in their way, including the SNP leadership.

This is the context for the announcement regarding the timetable for a second referendum. Sturgeon has to settle her own troops before they openly rebel. Of course we’ll have to wait for the actual wording of the manifesto to she what she has to say, but it very much looks like a sop to the Cybernats – the wording of announcement is couched very much in terms of “possible”, “might” and “maybe”. The fact that this announcement came the same weekend as the new Labour leader was announced, shows us that it’s also a show of strength to Labour, than it is a series commitment to a second referendum.

So why should we calm down? Well, for the simple reason that the SNP were defeated on the fundamentals of independence and those have not changed; namely:

  1. The breath and depth of the defeat. The SNP lost in every demographic except one, including the youngest voters, which put the lie to the fact the SNP were “doing for the next generation”. The next generation spoke, and they said we’re happy with the Union thanks very much. On top of that, they lost in every constituency bar four. The SNP know these numbers and they know that they have a lot of work to do in order to win a second referendum.
  2. The European Question. The SNP want to claim that an “out” vote in the forthcoming UK wide EU referendum would trigger another independence referendum, as Scotland should not be dragged out the EU against the will of it’s citizens. However, the SNP must then deal with the issue that Scotland becoming independent would result in her falling out of the EU with no guarantee of an an early re-entry or that it would be on such favourable terms as the UK negotiated.
  3. The Currency. Salmond famously asserted that there would be a currency union with the rest of the UK. The Westminster Government stated, categorically, that this would not be the case. The voters didn’t believe Salmond’s assertions and the lack of a, credible, “plan B” caused some to vote “no”. The question of what will an independent Scotland use for a currency still has to be  answered.
  4. The Economy. The SNP, famously, asserted that Scotland pays in more to the UK than it gets out and that oil would be $130 a barrel. Well the SNP’s own figures subsequently showed that we only paid in more than we got out 3 times in the last 15 years, and the price of oil crashed. The SNP will have to come up with a new, credible, economic plan for an independent Scotland now that the voters have seen what would have happened under the old one.

The SNP leadership will have access to this information, and a lot more besides, and they’ll be in no hurry to have another referendum, so we should all just calm down about it and see it for what it is, an announcement designed as a show of strength to the new Labour leadership and an attempt to whip the Cybernats back into line.

Posted in Politics | Leave a comment

Died After Being Found Fit to Work. Really?

In this post I want to deal with this Guardian story (note: this story was also covered in other newspapers).


In this article the Guardian claim that 2,380 people died shortly after being declared fit to work by the Department of Work and Pensions (DWP). Okay, whilst every death is a tragedy for the individuals’ families, my initial reaction to this is: so what, people die shortly after having consumed breakfast, that doesn’t mean the two things are correlated? To find out if this is actually a problem or not, we need to dig into the figures.

The source data for this article comes from this freedom of information release, made by the DWP. There are a couple of pieces of information contained in the release, but the figures we are going to analyse are contained in the first request for information, namely:

Information request 1
The total number of people who have died within a year of their work capability assessment since May 2010

The answer to which was given as:

Total number of individuals with a WCA decision between 1 May 2010 and 28 February 2013: 2,017,070
of which: Number who died within a year of that decision. 40,680

To find out if that figure of ~40K is high or not, we have to do the following things:

  1. Using the UK’s population pyramid for 2015, break the ~2M down by age group.
  2. Annualise that figure for the dates provided (assuming uniform distribution).
  3. Normalise the above rates for working age population.
  4. Find out the Age Standardised Mortality Rates (ASMR) for each of the above groups.
  5. Calculate the expected deaths in each age group.
  6. Sum those and compare to the annualised figures.

Phew, okay, let’s get started. Using the above link to the UK’s population pyramid, and assuming it’s not changed that much between 2013 and 2015, annualising the figures and then normalising for the working population, we can say that the ~2M figure breaks down, annually, as follows:


Using the above link to the ASMR from the Office of National Statistics (ONS), we can find the death rate per group and from that, calculate how many people, from each group, we would expect to die:


Summing each age group, we find that we’d expect 1,811 claimants to die per year. Our annualised death figure show that, in actuality, 10,539 claimants died, a variation of 8,728.

The sub-text of the news article was clearly that the DWP test is unfair and is forcing people, “on death’s door”, back to work. On the face of it, these figures would support that, but before we come down on one side of this argument or the other, we have to look to see if we can account for the delta between expected and actual numbers in any other way.

Firstly, the DWP figures don’t account for what caused the death. The claimant could have finished the test and been hit by a bus on the way home; the resulting death was nothing to do with their claim. True, but then the ASMR rates take into account deaths from all causes, so we have that covered.

Next, the DWP figures show deaths after a decision was made, but it doesn’t say what that decision was. Some of these claimants would have continued to receive the benefit, or would have been moved onto other benefits, they were not all necessarily “forced back to work”.

There are also a couple of things that would have depressed our number of expected deaths. Firstly, we assumed that the claimant population mirrors the working population; it doesn’t. We know that the claimant population contains more older males, two categories (older and male) with increased death rates.

On top of this, we assumed that the claimant population mirrors the working population in terms of health (and so risk of death). This, clearly, is not true

These things will have accounted for some of the ~8K “extra” deaths; how many? Well we don’t know. I think we have to put a “health warning” on these figures. So, what can we say with confidence? Well firstly we can say that more people die in the claimant population than in the working population, however we need to do more research to discover if this difference is significant. The other thing we can say, with confidence, is that the source data does not backup the Guardian’s article.

Well that’s it for this post, ‘til next time, keep crunching those numbers! Open-mouthed smile

Posted in Data Science, Statistics | Leave a comment

Always Check Behind the Headlines

So STV are carrying a news story saying 53% of Scots are now in favour of Independence. Hmm, thinks I, unlikely given the level of upfuckery conducted by the SNP around education and the NHS right now, not to mention the crash in oil price that would leave Scotland 7.2Bn poorer than we are now.

However, there are those dyed in the blood separatists who either have cognitive dissonance around the SNP’s failings, or who don’t care, they’d rather live in abject poverty in an independent Scotland, than wealthy in the UK; so, off I trot to look into the data.

Right enough, the headline is strictly true, 53% of those asked did say they’d vote for independence, buuuut, 3% are undecided and there’s a 3% error margin on top of that, so really this poll, puts independence neck and neck with unionism, which is pretty much where some polls had us before the referendum, and we know how that turned out. Also, whilst 53% (+/- 3%) said they would vote for independence, only 50% said they wanted another poll within 5 years.

On top of that, the Unionists respected the settled will of the Scottish people after the vote, whilst the Separatists totally ignored the result and carried on campaigning; so this poll is nothing to get worked up about, you would expect one side of a debate to make ground on the other when they are campaigning hard and the other side are not.

STVPollDataEven so, I still wanted to see the data, so I could see for myself what was actually asked and what was answered. Luckily STV supplied a link where I could download the data, and here’s what I got… I shit you not, that’s it. Umm, okay, in no way can that be construed as “the data”. So now you can colour me suspicious of the whole thing.

Posted in Data Science | Leave a comment

The Deadly Sins of Social Media Analysts

In case you haven’t heard, there’s a general election coming up here in the UK. This has caused a spike in the number of analysis I’m seeing from social media companies, a large number of them containing some horrific “sins against data science”. Here are a few of my favourites:

1. Thinking Social Media Matters.

Unless you are actually studying social media, i.e. your findings relate strictly to social media and you are not trying to project your insights onto the general population as a whole, then social media doesn’t matter a jot. It doesn’t matter because it is not a statistically significant sample. Your social media sample is, at least “doubly self selecting”.

Firstly, it’s self selecting because it only includes that part of the population who use social media. Secondly, it’s self selecting because even amongst people on social media, your sample only contains those people who are talking about the subject you are interested in. Thus, when trying to gain insight into what the general population think about a particular subject, social media alone is useless.

This sin manifests itself in headlines such as “65% of people think UKIP will be a positive influence in government”. To be accurate, this headline would have to be rewritten as, “A proportion of the population of the UK are on social media and of those, a proportion are talking about the election. We took a statistically insignificant sample of those people and analysed their posts. Of those posts 65% of them indicated a belief that UKIP will be a positive influence in government.“ Somehow, this isn’t as catchy a headline.

2. Thinking Size Matters

I was recently speaking at a conference and I got into a conversation with an attendee that went something like this… “I see you and I are doing the same kind of analysis, my analysis doesn’t agree with yours and since I have two orders of magnitude more data than you, you should stop saying the stuff you’re saying and agree with me.”

Being the polite and mild mannered chap that I am, I explained that I drew my data from the Twitter stream API which guarantees me a statistically significant sample of all of the tweets generated (even then it’s not perfect). So, when I say “12% of tweets show such and such”, I can say that with confidence, even though I’m not looking at all tweets, because I have a statistical sample.

The chap talking to me however, also takes tweets from the same feed, and from other feeds and from selected people who he deems to be “subject matter experts” and collects all that data into one large Bucket-O-Crap.

In short, he started with a statistically correct sample, then diluted down until the sample was worthless. He was fixated on the fact that his sample size was two orders of magnitude larger than mine, but lost sight of the fact that what he now had was garbage.

3. Using Concrete Numbers in Findings

This brings me on nicely to my next “sin”, and that is the use of concrete numbers in findings. I see lots of blog posts that say “x number of tweets were tagged #VoteMuppetParty”. No they weren’t. That is how many was in your sample, unless you forked out for the Firehose feed that number is nowhere near a reflection of how many tweets were actually posted.

What you can, legitimately say is that “30% of tweets were tagged #SuchAndSuch whilst 40% were tagged #ThisAndThat”. You can do this because your sample is statistically correct, so the ratios will be (near enough) the same in your sample as in the whole Twitter population.

“Wait a minute!” I hear you say, “I’ve read your blog, Gary and you do that, you use concrete numbers in this chart”:


It’s true, I do. Here I use concrete numbers because the difference between the number of posts per hour for the highest and lowest performing parties, means I have to use a logarithmic scale on the Y axis in order to show all the data. The first time I used this chart, I took trouble to explain that the numbers were there to show orders of magnitude differences only, and should not be taken as a concrete value.

4. Averaging Ordinal Data

This is my pet hate and you see it everywhere (even the BBC). So what do I mean by this. Well, let’s say we had a class of pupils all taking a test out of 100. We can count up their scores and divide by the number of pupils to get an average score for the test. We can do this because the data is numeric and so the difference between 1 and 2 is the same as the difference between 8 and 9.

Not so with ordinal data. Ordinal data implies some form of rank, but the difference between each point on the scale is not the same, nor is it the same for each person question. For example, if I say rate something between 1 and 5 where 1 is terrible and 5 is great, each person questioned has to divide the numbers between 1 and 5 and attach their own value to it. These will vary from person to person.

Even if I breakdown each of the numbers and attach a value myself, each person questioned will have a different idea of what each means. For example if I say, rate this from 1 to 5 where 1 is very bad, 2 is bad, 3 is neither good nor bad, 4 is good and 5 is very good; each person will have their own idea of what the difference between good and very good is. When dealing with numerical data, everyone knows the difference between 3 and 4 and it is the same for everyone.

For these reasons you cannot average ordinal data. Stop doing it!!

The easy way to remember it, is not to let the numeric part fool you. Instead just use the word values you attached to the numbers. Now answer me this, what is the average of “Good” and “Very good”? It’s meaningless, right? Right! Well so is the difference between 4 and 5 on an ordinal scale.

Well that’s it for this post, if you are doing social media analysis, please make sure you don’t “sin against data science” and until next time, keep crunching those numbers! Smile

Posted in Data Science, Statistics | Tagged , | 2 Comments

UK General Election Day 30 Twitter Analysis

This is the analysis for Wednesday 30th April 2015.


There’s two stand out items on the chart today, the first being that #UKIP are in number 3 spot, and the second is the continuing rise of #SNPOUT. The more the SNP refuse to rule out a second referendum, the stronger this hashtag grows. #VOTESNP vs #SNPOUT is now 4:1, down from 54:1 at the start of the campaign.


The thing that strikes you immediately is the lead that the two SNP accounts have over everyone else is staggering, there seems to be little any of the other parties can do to halt the “Cult of the SNP”.




Labour continue to be the most talked about party, with the SNP second and the Tories third. To find out what was said, we’ll take a look at the trigrams.


After polls showing that the SNP may win every seat in Scotland, the topic of conversation, on the “CyberNats” dominated stream, is one of the more seats they win the stronger Scotland’s voice is.

Let’s see what else made the news yesterday:

The Conservatives promise a law guaranteeing no rise in income tax, national insurance or VAT before 2020, prompting “TAX RISES LAW” at 1,237.

But Labour say Tory plans would mean cuts to tax credits totalling £3.8 billion, bringing “CUTTING TAX CREDITS”, in at 1,118.

The Lib Dems pledge to offer free schools meals to all children in England, pushing “FREE SCHOOL MEALS”, into the number 35 spot.

Russell Brand released his interview with Ed Miliband . The Labour leader told the comedian he was wrong to say that voting was pointless, prompting “MILIBRAND INTERVIEW GE2015”, at 240.

UKIP leader Nigel Farage warned of an influx of Islamic extremists if Europe’s doors were opened to large numbers of people fleeing conflict zones, putting “NIGEL FARAGE WARNS” at 769.

Nick Clegg laid down another condition for considering a coalition – he wants a £12,500 personal tax free allowance, bringing “” to the 1,135 spot.

Well that’s all for this post, until next time, keep crunching those numbers! Smile

Posted in Data Science, Politics, Social Media, Statistics | Tagged , , , | 1 Comment

UK General Election Day 29 Twitter Analysis

Welcome to the analysis for Tuesday 28th April 2015. Let’s jump straight in by looking at the hashtag chart:


As we enter the last stages of the election, all the parties are making their mark on the chart, with the exception of the Greens. The SNP are still well out in front, their expertise, honed on the independence referendum, shines in this campaign – on Twitter. The NHS also reappears on the chart, though it’s never really been far away, even when it’s not been present. The #SNPOUT tag is creeping up the chart too as the election progresses, the #VOTESNP vs #SNPOUT ratio is at 4.9:1, the lowest it’s been throughout the campaign.


Most of the usual suspects appear in today’s chart, with a couple of exceptions; those are @UTVELECTIONS which is an account that carried information about a TV debate in NI and @LIBBY_BROOKS, the Scotland correspondent for the Guardian Newspaper, who had a humorous exchange with Nicola Sturgeon, which the “CyberNats” seemed to enjoy.




Again Labour are the most talked about party, with the SNP second and the Tories third, whilst the number of people talking about UKIP challenge those talking about the LibDems.


As you can see, the Twitter exchange between Libby Brooks and Nicola Sturgeon delighted the “CyberNats”, with them immediately announcing it the “best tweet of the election, possibly ever”, you can judge for yourself if there might be a little bit of hyperbole there:


Let’s see what else happened in the election news yesterday:

The Conservatives promise another 50,000 apprenticeships paid for by £200 million from Libor fines, this did not register with the Twitterati.

Labour announce a 10-point plan to reform the immigration system, prompting “GE2015 IMMIGRATION PLAN” at 822.

Lib Dems demand a stability budget within 50 days of the next government being formed as a red line for any post-election negotiations, pushing “STABILITY BUDGET WITHIN” into position 1,390.

The Green Party pledged to double child benefit to £40 a week, prompting “CHILD BENEFIT DOUBLE” at 1,411.

Well that’s it for this post, until next time, keep crunching those numbers! Smile

Posted in Data Science, Social Media, Statistics, UKGeneralElection2015 | Tagged , , , | 1 Comment