Oscars 2015 Tweet Analysis – Part 1

Summary (tl;dr)

The main points of this post are: as data scientists we don’t always get data in the shape we want it. When that happens we have to use the tools we have to make the most of it. Events on social media are transient and only hold an audience’s attention whilst they are running. Event brand managers should publish information on the acceleration down curve, whilst managers of brands, attending events, should publish information on the acceleration up curve. All the code for this post can be found on Github at: https://github.com/garyshort/oscars2015Part1


The 2015 Oscars were held at the Dolby Theatre, Hollywood, on February 22nd 2015. Being a curious data scientist I wanted to analyse the corresponding Twitter stream. To capture the Twitter stream, I used a low powered Acer Aspire One ZG5, running the NodeJS build of Turnkey Linux.

I wrote the following service to capture Tweets tagged as #oscars or #oscars2015:

var util = require('util');
var twitter = require('twitter');
var twit = new twitter({
  consumer_key: 'YOUR KEY',
  consumer_secret: 'YOUR SECRET',
  access_token_key: 'YOUR TOKEN KEY',
  access_token_secret: 'YOUR TOKEN SECRET'

twit.stream('statuses/filter', {track: '#oscars, #oscars2015'}, 
  function(stream) {
    stream.on('data', function(data) {

And the following upstart command to respawn the service, if it was cut off from the Twitter end, and to pipe the output to a file on the filestore:

#start when we have a files system and network
start on (local-filesystem and net-device-up IFACE!=lo)

#stop when the system stops
stop on runlevel [016]

#restart if stopped

#start monitoring
exec node /home/gary/oscars2015/TwitterListener.js >>

I ran the service from 20th February, until the morning of 23rd February 2015, and captured some 3.2 million tweets.

Analysis of  Tweets

Well that’s what I meant to do… and if I’d done it right, I would have had 3.2 million tweets, in a file, one tweet to a line. Only I didn’t, I messed it up. In an error that one of my friends calls “cut and paste tart”, instead of console.log(data) above, I copied from a service I had been previously debugging and ended up with console.log(util.inspect(data)). Meaning that, instead of a nice file of tweets, one per line, what I had was a “pretty printed” view of what the tweets looked like compiled. In other words, a big pile of text that couldn’t be made much sense of.

Ah well, no point crying over spilt milk. Since there was no chance of asking the Academy to rerun the Oscars, just so I could try to capture the tweets again, there was nothing else for it but to see what I could salvage.

One of the first things I wanted to do was to see how many tweets were posted. To do that I needed to pull out the created date for each tweet from my, now useless, text corpus. The data scientist’s best friend for doing that task is grep. The following lines of code gave me a file with the creation timestamp for each tweet in the corpus:

grep -o "\(Fri\|Sat\|Sun\|Mon\) Feb \(20\|21\|22\|23\) [[:digit:]]\{2\}:[[:digit:]]\{2\}:[[:digit:]]\{2\} +0000 2015" results.js >> timestamps.txt

I could then pull this file down onto my dev box and have at it with C#. What I want to show is posts by day. So first thing I’m going to do is to build a list of DateTime objects, from the file of Twitter timestamps:

private static List<DateTime> ProcessDateTimes()
    var dateTimes = new List<DateTime>();
    var path = @"your/path/here";
    var sr = File.OpenText(path);
    while (!sr.EndOfStream) {
        var line = sr.ReadLine();
        var dt = DateTime.ParseExact(
            "ddd MMM dd HH:mm:ss zzzz yyyy", 
    return dateTimes;

With that done, I can group on the day and count the tweets:

private static string GetPostingDateCSV(List<DateTime> dateTimes)
    var csvString = "Day,NumberOfPostings\n";
    dateTimes.GroupBy(dt => dt.Day)
        .ForEach(group =>
            csvString += group.Key.ToString() + ",";
            csvString += group.Count().ToString() + "\n";
    return csvString;

Bang that output text into Excel, and I can make a pretty graph showing the posts by day:


First thing to remember when looking at this graph, is that we don’t get all the tweets published, we only get a statistically relevant sample, as provided by Twitter. What does this mean? Well it means that things we can say about our model, (the Tweets we do have) will be true about the population as a whole (all the Tweets that would have been posted over the same time period). But, specific information will not be correct. So for example, it’s not true to say that 2.2 million Tweets were posted on the 23rd, but is is correct to say that 68% of Tweets were published on the 23rd; got it?

Wait a minute, the 23rd I hear you say? But the Oscars were held on the 22nd? True, and that leads us to the second thing we have to remember. The Oscars were held on the 22nd, Pacific Standard Time, whilst Tweets are recorded in Greenwich Mean Time (or UTC if you want to be more correct; but I’m a Brit, so I don’t. Smile with tongue out). This means that our results are going to be skewed by 8 hours. I don’t intend to do anything about that for the purposes of this analysis, but if you were paying me to do this, and you cared about it, I would correct for that.

Of course, another thing we have to think about, is the fact that I didn’t run the capture service for the full 24 hours in each of the represented days. I started it Friday and stopped it Monday morning (GMT). So we should also look to see what the number of posts per day is, when normalised over the number of hours that Tweets were captured for. To do this, we have to amend our C# code to group by day and by hour:

private static string GetPostsNorm(List<DateTime> dateTimes)
    var csvString = "Day,NumberOfPostings\n";
    dateTimes.GroupBy(dt => new { dt.Day, dt.Hour })
        .ForEach(group =>
            csvString +=
                + "~"
                + group.Key.Hour.ToString()
                + ",";

            csvString += group.Count().ToString() + "\n";
    return csvString;

Then we can graph that:


This brings even more into focus the idea that the “main event” is where all the action is, and everything before, and after, is just a sideshow. Worth bearing in mind if you are responsible for a brand’s social media output.

Let’s emphasise that by taking a look at a frequency distribution of the posts by hour by day:


As you can see from this graph, there are hardly any posts, on topic, in the run up to the “main event” and posts tail off very quickly thereafter, (the difference is around 300,000 posts per hour). In other words, if this were a brand you were managing, you have a very limited window in which you can connect with your audience, before they move on somewhere else.

Lastly, let’s look at the acceleration of the Tweets for 20 hours up to 10:00hrs (GMT) on February 23rd. Acceleration is just the first differentiation of the velocity (posts per hour) graph above:


As acceleration and deceleration of posting can be be viewed as waxing and waning interest in the topic, this further illustrates that interest in the topic spikes quickly as the event takes place and then plummets after the event finishes. If you are responsible for the social media of the brand of the event, then everything below the X axis represents falling interest in your brand, and demonstrates times when you should be injecting information in order to rekindle the interest of your audience.

If you are managing the brand of an event attendee, say a vendor at an expo, then the spikes above the X axis represent occasions when you should inject your brand message in order to get the maximum attention possible.

Lessons Learned

From this analysis, we’ve learned that – as data scientists – we don’t always get data in the shape we want it, and it’s up to us to use the tools at your disposal to make the most of the data we are given; remembering, at all times, to stay within the realms of what is statistically truthful, so that what we say about our model, can be projected onto the population as a whole.

We also learned, that when it comes to events, there is a very narrow window of opportunity for brand managers to push their message to the optimum audience; but by analysing post acceleration (waxing/waning interest in the topic), brand managers can inject information in order to prolong audience interest for as long as possible.

Brand managers, of brands attending events, should look to post their messages at times of increasing interest (times when the acceleration graph is above the X axis) for maximum exposure at events.

Look Ahead

Well that’s all for this post, where we looked at how many people were posting at and what hours; next time we’ll look at where those people were and how interest moved across the globe, from the event’s location in LA. Until next time, keep crunching those numbers! Smile

Installing HDInsight Emulator on Your Local Machine

In this video I demonstrate how to install the HDInsight Emulator on your local machine, via the Web Platform Installer, and how to run the wordcount example to test it’s installed properly.

Call with a recruiter…

You have to admire his honesty, if nothing else. :-)

Him: Hi, it’s Ben Dover here, this is just a courtesy call to find out a little about your skill set, so that I can help you get your next role.

Me: Cool, where’d you get my number?

Him: (Laughs) Oh don’t worry, it’s nothing sinister, it’s right here on your CV.

Me: So you have my CV in front of you?

Him: Sure do!

Me: And you’re phoning to find out about my skills?

Him: Sure am!

Me: Uh huh, so right under where it says “Contact Details”, where you got my phone number from, there’s a section titled “Skills”, that section pretty much details my, well… skills.

Him: Ah, yeah, no, totally read that, just it didn’t really mean that much to me so thought I’d give you a call.

Me: As a courtesy..?

Him: Riiiight.

Me: And you want to find me my next role?

Him: Yes, you’ve got it! That’s why I’m calling, as a courtesy…

Me: And terms like “Hadoop”, “Pig”, “hive” and “Python” don’t mean anything to you?

Him: Not a thing, that’s why I thought I’d call you.

Me: Yeah, I’m not sure you’re going to be able to help me, but thanks for your… courtesy.

Sorry if You Disagree, but…

… the UK is a ‘Christian Country’. Why do I mention this, you may well ask? Well, recently, David Cameron wrote an article for the Church Times, where he stated:

“I believe we should be more confident about our status as a Christian country, more ambitious about expanding the role of faith-based organisations, and, frankly, more evangelical about a faith that compels us to get out there and make a difference to people’s lives.”

Nothing too controversial there you might imagine; however, defining the UK as a ‘Christian country’ seems to have angered a section of the population, including a group who published an open letter in the Telegraph Newspaper.

Well I don’t really have a lot of time for either politicians or organised religion but, on this occasion, Cameron is correct, we are a ‘Christian country’. The reason I say this is twofold. Firstly, the Church of England is established in this country, 26 of it’s bishops sit in the House of Lords, helping to pass the laws of the land.

Secondly, the majority of people, the majority of the time, self-identify themselves as being Christian.

Now you may not like these facts, but facts they are and they cannot be denied simply by putting your fingers in your ears and shouting “No we’re not! No we’re not! No we’re not!”

Speaking at DDD South West

dddsw_medium.jpgI’m delighted to say I’ve been accepted to speak at DDD South West, where I’ll be delivering my talk on “Hadoop and Big Data for Microsoft Developers”, I hope I’ll so you there; if so, don’t forget to stop and say “hi!”.

The Day HDInsight Died (Again)

So here I am this morning, all ready to get my Data Science Fu on, I pop onto my local HDInsight installation, and it tells me:

Hadoop Not Running

Hmm, time to see if the services are running:

Hadoop Services Not Running

Nope, all stopped. Sigh, right, start them. When I try to do that, I get an error that tells me that there’s been a “log on failure”. A what now? Same log on worked yesterday. Hmm, worked yesterday but not today, I wonder if HDInsight installs a default user with the password set to expire?

Hadoop User Must Change Password Next Logon

Yes it does! I’m not sure if HDInsight installs this way, or if some policy applied on my work machine forces this, but unchecking the “User must change password at next logon” and checking “Password never expires”, should solve this problem.

Services Started

Services are up and running again…

Hadoop Is Working

And HDInsight is back!

So, if your local instance of HDInsight stops working, then check that the hadoop user password hasn’t expired.

How Alex Salmond Answers The Hard Questions on Scottish Independence

As you’ll know, at least hopefully you will, should you vote yes on September 18th, in the Independence Referendum, you won’t be voting for independence per se; what you’ll be voting for is to give Alex Salmond the right to negotiate the terms under which Scotland will eventually become independent.

Alex, has set out his position on each of the major areas under negotiation, this is the right and proper thing to do, and it is to be welcomed. However, what he hasn’t done is to set out a “Plan B” if you will, or informed us what his “Red Lines” would be.

What do I mean by that? Well let’s look at the need for a “Plan B”. First, let’s take the issue of Europe. An independent Scotland will have to negotiate entry into the EU. Alex’s position is that we’ll be accepted and the process will be quick. Recently the EU President, a man in a position to have some idea about what it takes to expand the EU membership, said that, at best, the negotiations would be long and complicated.

Given that, when questioned on what would happen if Scotland were refused entry, or even what would happen to our trade if EU membership took, say, 5 years to complete; Alex should be able to say, “Well if my negotiating position isn’t entirely successful, or if I lose completely, an Independent Scotland would…”. Instead, he waves his hands and claims all will be well, much like the video.

Having dealt with that, let’s look at what I mean by “Red Lines”. To do that, let’s look at the situation with the currency. First Alex claimed that the pound was a “millstone” around the necks of the Scottish people, and he couldn’t wait to rid us of it. Once focus groups told him that ditching the pound was a barrier to people voting yes, that idea was immediately scrapped and we were told we’d be keeping the pound. Notice, as with all these things, Alex never says, this is my negotiating position, he just asserts that, “we will…”, like it’s a foregone conclusion.

However, the other political parties have stated that you can’t have a currency union without a political union, and since the SNP are not interested in the latter then an independent Scotland can’t have the former. By the way, despite what the SNP say, this is not the other parties bullying us Scots, it’s a principle followed by the EU too, if you want to be part of the Eurozone, then you have to be part of the political union, since we know Alex wanted to join the Eurozone not that long ago, we know he’s not against a political union per se, just with the rest of the UK it seems.

So, Salmond’s position is we’ll have a currency union, the remainder of the UK’s position is we won’t, so what’s Salmond’s alternative, should his negotiating position fail? There isn’t one, it’s more hand waving a la the video and more assertion of “There will be a currency union”. That brings me back to the idea of a “Red Line”, if we can’t be in a currency union with the UK, and we can’t join the Euro, even supposing we wanted to, immediately, would that constitute a “Red Line”, a situation whereby Salmond would say, “to carry on under these circumstances would be so damaging to Scotland that we won’t continue with Independence”? I fear not, in fact my fear is that there are no “Red Lines”, my fear is that Salmond wants independence at any cost, and since he won’t tell us what, if anything, his “Red Lines” are, I fear I’m right.

So, before I go, just remember this. If you are voting yes in September, you are betting that Salmond wins every argument, and the result of every negotiation is that he gets his own way. When was the last time any politician achieved that? Hell, when was the last day you won every argument you took part in? The sad truth, that no one seems to be waking up to, is that if you vote yes in September, you have no idea what you’ll be getting, so be careful what you wish for.

I’m voting no, only because “hell no!” isn’t an option.

