Oscars 2015 Tweet Analysis – Part 1

Summary (tl;dr)

The main points of this post are: as data scientists we don’t always get data in the shape we want it. When that happens we have to use the tools we have to make the most of it. Events on social media are transient and only hold an audience’s attention whilst they are running. Event brand managers should publish information on the acceleration down curve, whilst managers of brands, attending events, should publish information on the acceleration up curve. All the code for this post can be found on Github at: https://github.com/garyshort/oscars2015Part1

Introduction

The 2015 Oscars were held at the Dolby Theatre, Hollywood, on February 22nd 2015. Being a curious data scientist I wanted to analyse the corresponding Twitter stream. To capture the Twitter stream, I used a low powered Acer Aspire One ZG5, running the NodeJS build of Turnkey Linux.

I wrote the following service to capture Tweets tagged as #oscars or #oscars2015:

var util = require('util');
var twitter = require('twitter');
   
var twit = new twitter({
  consumer_key: 'YOUR KEY',
  consumer_secret: 'YOUR SECRET',
  access_token_key: 'YOUR TOKEN KEY',
  access_token_secret: 'YOUR TOKEN SECRET'
});

twit.stream('statuses/filter', {track: '#oscars, #oscars2015'}, 
  function(stream) {
    stream.on('data', function(data) {
      console.log(data);
    });
});

And the following upstart command to respawn the service, if it was cut off from the Twitter end, and to pipe the output to a file on the filestore:

#start when we have a files system and network
start on (local-filesystem and net-device-up IFACE!=lo)

#stop when the system stops
stop on runlevel [016]

#restart if stopped
respawn

#start monitoring
exec node /home/gary/oscars2015/TwitterListener.js >>
  /home/gary/oscars2015/results.js

I ran the service from 20th February, until the morning of 23rd February 2015, and captured some 3.2 million tweets.

Analysis of  Tweets

Well that’s what I meant to do… and if I’d done it right, I would have had 3.2 million tweets, in a file, one tweet to a line. Only I didn’t, I messed it up. In an error that one of my friends calls “cut and paste tart”, instead of console.log(data) above, I copied from a service I had been previously debugging and ended up with console.log(util.inspect(data)). Meaning that, instead of a nice file of tweets, one per line, what I had was a “pretty printed” view of what the tweets looked like compiled. In other words, a big pile of text that couldn’t be made much sense of.

Ah well, no point crying over spilt milk. Since there was no chance of asking the Academy to rerun the Oscars, just so I could try to capture the tweets again, there was nothing else for it but to see what I could salvage.

One of the first things I wanted to do was to see how many tweets were posted. To do that I needed to pull out the created date for each tweet from my, now useless, text corpus. The data scientist’s best friend for doing that task is grep. The following lines of code gave me a file with the creation timestamp for each tweet in the corpus:

grep -o "\(Fri\|Sat\|Sun\|Mon\) Feb \(20\|21\|22\|23\) [[:digit:]]\{2\}:[[:digit:]]\{2\}:[[:digit:]]\{2\} +0000 2015" results.js >> timestamps.txt

I could then pull this file down onto my dev box and have at it with C#. What I want to show is posts by day. So first thing I’m going to do is to build a list of DateTime objects, from the file of Twitter timestamps:

private static List<DateTime> ProcessDateTimes()
{
    var dateTimes = new List<DateTime>();
    var path = @"your/path/here";
    var sr = File.OpenText(path);
    while (!sr.EndOfStream) {
        var line = sr.ReadLine();
        var dt = DateTime.ParseExact(
            line, 
            "ddd MMM dd HH:mm:ss zzzz yyyy", 
            CultureInfo.InvariantCulture);
        dateTimes.Add(dt);
    }
    return dateTimes;
}

With that done, I can group on the day and count the tweets:

private static string GetPostingDateCSV(List<DateTime> dateTimes)
{
    var csvString = "Day,NumberOfPostings\n";
    dateTimes.GroupBy(dt => dt.Day)
        .ToList()
        .ForEach(group =>
        {
            csvString += group.Key.ToString() + ",";
            csvString += group.Count().ToString() + "\n";
        });
    return csvString;
}

Bang that output text into Excel, and I can make a pretty graph showing the posts by day:

image

First thing to remember when looking at this graph, is that we don’t get all the tweets published, we only get a statistically relevant sample, as provided by Twitter. What does this mean? Well it means that things we can say about our model, (the Tweets we do have) will be true about the population as a whole (all the Tweets that would have been posted over the same time period). But, specific information will not be correct. So for example, it’s not true to say that 2.2 million Tweets were posted on the 23rd, but is is correct to say that 68% of Tweets were published on the 23rd; got it?

Wait a minute, the 23rd I hear you say? But the Oscars were held on the 22nd? True, and that leads us to the second thing we have to remember. The Oscars were held on the 22nd, Pacific Standard Time, whilst Tweets are recorded in Greenwich Mean Time (or UTC if you want to be more correct; but I’m a Brit, so I don’t. Smile with tongue out). This means that our results are going to be skewed by 8 hours. I don’t intend to do anything about that for the purposes of this analysis, but if you were paying me to do this, and you cared about it, I would correct for that.

Of course, another thing we have to think about, is the fact that I didn’t run the capture service for the full 24 hours in each of the represented days. I started it Friday and stopped it Monday morning (GMT). So we should also look to see what the number of posts per day is, when normalised over the number of hours that Tweets were captured for. To do this, we have to amend our C# code to group by day and by hour:

private static string GetPostsNorm(List<DateTime> dateTimes)
{
    var csvString = "Day,NumberOfPostings\n";
    dateTimes.GroupBy(dt => new { dt.Day, dt.Hour })
        .ToList()
        .ForEach(group =>
        {
            csvString +=
                group.Key.Day.ToString()
                + "~"
                + group.Key.Hour.ToString()
                + ",";

            csvString += group.Count().ToString() + "\n";
        });
    return csvString;
}

Then we can graph that:

image

This brings even more into focus the idea that the “main event” is where all the action is, and everything before, and after, is just a sideshow. Worth bearing in mind if you are responsible for a brand’s social media output.

Let’s emphasise that by taking a look at a frequency distribution of the posts by hour by day:

image

As you can see from this graph, there are hardly any posts, on topic, in the run up to the “main event” and posts tail off very quickly thereafter, (the difference is around 300,000 posts per hour). In other words, if this were a brand you were managing, you have a very limited window in which you can connect with your audience, before they move on somewhere else.

Lastly, let’s look at the acceleration of the Tweets for 20 hours up to 10:00hrs (GMT) on February 23rd. Acceleration is just the first differentiation of the velocity (posts per hour) graph above:

image

As acceleration and deceleration of posting can be be viewed as waxing and waning interest in the topic, this further illustrates that interest in the topic spikes quickly as the event takes place and then plummets after the event finishes. If you are responsible for the social media of the brand of the event, then everything below the X axis represents falling interest in your brand, and demonstrates times when you should be injecting information in order to rekindle the interest of your audience.

If you are managing the brand of an event attendee, say a vendor at an expo, then the spikes above the X axis represent occasions when you should inject your brand message in order to get the maximum attention possible.

Lessons Learned

From this analysis, we’ve learned that – as data scientists – we don’t always get data in the shape we want it, and it’s up to us to use the tools at your disposal to make the most of the data we are given; remembering, at all times, to stay within the realms of what is statistically truthful, so that what we say about our model, can be projected onto the population as a whole.

We also learned, that when it comes to events, there is a very narrow window of opportunity for brand managers to push their message to the optimum audience; but by analysing post acceleration (waxing/waning interest in the topic), brand managers can inject information in order to prolong audience interest for as long as possible.

Brand managers, of brands attending events, should look to post their messages at times of increasing interest (times when the acceleration graph is above the X axis) for maximum exposure at events.

Look Ahead

Well that’s all for this post, where we looked at how many people were posting at and what hours; next time we’ll look at where those people were and how interest moved across the globe, from the event’s location in LA. Until next time, keep crunching those numbers! Smile

Advertisements
This entry was posted in C#, Data Science, Developement, NodeJS, Social Media, Statistics and tagged , , . Bookmark the permalink.

One Response to Oscars 2015 Tweet Analysis – Part 1

  1. Pingback: Oscars 2015 Tweet Analysis – Part 2: Locations | Gary Short

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s