Oscars 2015 Tweet Analysis – Part 2: Locations

Summary (tl;dr)

Looking at the location details from the tweets, we learned that New York and Los Angeles were the most prolific posting locations (accounting for more than 200K votes), and that the vast majority of locations had fewer than 100 posts. We also learned that the PowerMap addin, for Excel, is an excellent tool for displaying geo-coded information. All of the code, and some of the data from this post can be downloaded from my github repo here: https://github.com/garyshort/oscars2015Part2

Introduction

In the first part of this analysis, we looked at how many people were posting and when; in this next part we’ll look at where they were posting from.

Twitter geolocates a tweet in a number of ways. Firstly the user can “self orientate” by stating where they are located; this information is contained in the Location property on the User object. The other way is to have your location appended to your Tweet, by your client – if it supports it – and those properties live on the Tweet object and are: Place, Coordinates and Geo (deprecated). For the purposes of this analysis, we’ll use the Location property. Each method has their flaws, but at least with this one we are using information that the User has given us freely as their stated location, and we’re not going all “creepy stalker” on them by looking at tweet location, which may have been appended to the User’s Tweet by the client app, because the User didn’t realise that function was enabled. (Go and check if your client leaks your location with each Tweet without your knowledge, I’ll wait here ‘til you get back Smile).

Analysis

You’ll recall from the last post too, that I messed up the collection of the data; if you’ve forgotten that go back and read how I screwed up. This means I’ll have to pull the information from the file using SED. I’m going to do this in two parts, firstly I’ll pull out the location information:

sed -n "s/.*location: \x27\(.*\)\x27\,/\1/p" results.js > locations.txt

Although I’ve searched for the line containing the location information, I’ve set up a capture group (the part in parens) to capture only the actual location and it’s that part I pipe to the output file locations.txt, via the “\1” attribute.

Now, because not every user opts to specify a location, the second thing I’m going to do is to remove any blank lines from my locations file:

sed '/^$/d' locations.txt > locsNoBlanks.txt

Having done that, we can use some C# code to give us the top 10 locations by posting frequency:

private static string GetTop10Locations(List<string> data)
{
    return data
        .GroupBy(location => location)
        .OrderByDescending(group => group.Count())
        .ToList()
        .Take(10)
        .ToList()
        .Aggregate(
            string.Empty,
            (acc, group) =>
                acc += String.Format(
                    "{0}~{1}\n",
                    group.Key,
                    group.Count()));
}

Take that output string and load it into Excel and we can build the follow graph:

image

As you can see, New York and California dominate the postings, which has a “truthy” feel to it, as California is the home of the movie industry and the East Coast is home to a lot of the print and broadcast media.

We can use similar code to reverse that graph and take a look at the bottom posting locations. I’ve removed all the locations that only have one posting, as investigation shows that those are mainly locations such as, “My Bedroom”, “Here on Planet Earth”, etc:

private static string GetBottom10Locations(List<string> data)
{
    return data
        .GroupBy(location => location)
        .OrderBy(group => group.Count())
        .ToList()
        .FindAll(group => group.Count() > 1)
        .Take(10)
        .ToList()
        .Aggregate(
            string.Empty,
            (acc, group) =>
                acc += String.Format(
                    "{0}~{1}\n",
                    group.Key,
                    group.Count()));
}

Loaded into Excel, this gives us the following graph:

image

Now those graphs got me wondering about what the frequency distribution of the number of posts per location would look like. To graph that, first we need to grab all the locations:

private static void GetAllLocations(List<string> data)
{
    string path = @"C:\Users\Gary\Documents\BlogPostExamples\"
        + @"Oscars2015\Part2\locations.data";

    StreamWriter sw = File.AppendText(path);
    data
        .AsParallel()
        .GroupBy(location => location)
        .OrderByDescending(group => group.Count())
        .ToList()
        .ForEach(group => sw.Write(String.Format(
                    "{0}\u00B1{1}\n",
                    group.Key,
                    group.Count())));
    sw.Flush();
    sw.Close();
}

It’s a bit of a beefy task, so I’ve asked C# to do the work in parallel, using all four core’s on my I7 and to write the data to file this time, instead of building it in memory.

Loading the data into Excel, we can define a number of “bins”, or ranges, to fit our data too:

image

Then use the Histogram function on the Analysis Tookpack to create our frequency distribution. We can take that that data and graph it; as there’s a large disparity between the frequency of each bin, I’ll use a log scale on the Y axis, for clarity:

image

As you can see, the vast majority of locations have 100 posts or fewer, with just 2 locations having between 50K and 100K posts; those were “New York, NY” and “Los Angeles, CA”.

These numbers tell us a lot, but they are hard to relate to the location. Luckily Excel now has an addin called PowerMap, a nice utility for displaying geo coded data on a map, this really helps us see the data in context:

NAPostsHeatMap

On this map I’ve taken all the locations that have 1,000 or more posts and displayed them on the map; the above map shows those in North America. You can see our two hot spots in New York and Los Angeles.

Now, whilst that is more useful than the bare numbers, it doesn’t tell us about the various posts that were categorised together as being in “New York”, say; to do that we have to tell PowerMap to treat the location as a category attribute. When we do that, then we can get a good idea of how many different locations were categorised as “New York” by the application:

NAStackedBar

In this graphic it’s easy to see the number of disparate locations that were categorised into each “Parent”. For example “NY,NY”, “New York, NY” and “NYC” were all categorised as being “New York”.

Summary

I think we’ll leave it there for this post. Today we learned that there were two major locations interested in the Oscars: “New York” and “Los Angeles’'; we learned that the vast majority of locations recorded have fewer than 100 posts originating from them and we learned that the PowerMap addin for Excel is an excellent tool for visualising geo-coded data.

Looking Ahead

In the next post, we’ll use some computational linguistics techniques to establish what where the key pieces of information being exchanged in the tweets sent. Until then, keep crunching those numbers! Smile

Posted in C#, Data Science, Social Media, Statistics | Tagged , , | Leave a comment

Oscars 2015 Tweet Analysis – Part 1

Summary (tl;dr)

The main points of this post are: as data scientists we don’t always get data in the shape we want it. When that happens we have to use the tools we have to make the most of it. Events on social media are transient and only hold an audience’s attention whilst they are running. Event brand managers should publish information on the acceleration down curve, whilst managers of brands, attending events, should publish information on the acceleration up curve. All the code for this post can be found on Github at: https://github.com/garyshort/oscars2015Part1

Introduction

The 2015 Oscars were held at the Dolby Theatre, Hollywood, on February 22nd 2015. Being a curious data scientist I wanted to analyse the corresponding Twitter stream. To capture the Twitter stream, I used a low powered Acer Aspire One ZG5, running the NodeJS build of Turnkey Linux.

I wrote the following service to capture Tweets tagged as #oscars or #oscars2015:

var util = require('util');
var twitter = require('twitter');
   
var twit = new twitter({
  consumer_key: 'YOUR KEY',
  consumer_secret: 'YOUR SECRET',
  access_token_key: 'YOUR TOKEN KEY',
  access_token_secret: 'YOUR TOKEN SECRET'
});

twit.stream('statuses/filter', {track: '#oscars, #oscars2015'}, 
  function(stream) {
    stream.on('data', function(data) {
      console.log(data);
    });
});

And the following upstart command to respawn the service, if it was cut off from the Twitter end, and to pipe the output to a file on the filestore:

#start when we have a files system and network
start on (local-filesystem and net-device-up IFACE!=lo)

#stop when the system stops
stop on runlevel [016]

#restart if stopped
respawn

#start monitoring
exec node /home/gary/oscars2015/TwitterListener.js >>
  /home/gary/oscars2015/results.js

I ran the service from 20th February, until the morning of 23rd February 2015, and captured some 3.2 million tweets.

Analysis of  Tweets

Well that’s what I meant to do… and if I’d done it right, I would have had 3.2 million tweets, in a file, one tweet to a line. Only I didn’t, I messed it up. In an error that one of my friends calls “cut and paste tart”, instead of console.log(data) above, I copied from a service I had been previously debugging and ended up with console.log(util.inspect(data)). Meaning that, instead of a nice file of tweets, one per line, what I had was a “pretty printed” view of what the tweets looked like compiled. In other words, a big pile of text that couldn’t be made much sense of.

Ah well, no point crying over spilt milk. Since there was no chance of asking the Academy to rerun the Oscars, just so I could try to capture the tweets again, there was nothing else for it but to see what I could salvage.

One of the first things I wanted to do was to see how many tweets were posted. To do that I needed to pull out the created date for each tweet from my, now useless, text corpus. The data scientist’s best friend for doing that task is grep. The following lines of code gave me a file with the creation timestamp for each tweet in the corpus:

grep -o "\(Fri\|Sat\|Sun\|Mon\) Feb \(20\|21\|22\|23\) [[:digit:]]\{2\}:[[:digit:]]\{2\}:[[:digit:]]\{2\} +0000 2015" results.js >> timestamps.txt

I could then pull this file down onto my dev box and have at it with C#. What I want to show is posts by day. So first thing I’m going to do is to build a list of DateTime objects, from the file of Twitter timestamps:

private static List<DateTime> ProcessDateTimes()
{
    var dateTimes = new List<DateTime>();
    var path = @"your/path/here";
    var sr = File.OpenText(path);
    while (!sr.EndOfStream) {
        var line = sr.ReadLine();
        var dt = DateTime.ParseExact(
            line, 
            "ddd MMM dd HH:mm:ss zzzz yyyy", 
            CultureInfo.InvariantCulture);
        dateTimes.Add(dt);
    }
    return dateTimes;
}

With that done, I can group on the day and count the tweets:

private static string GetPostingDateCSV(List<DateTime> dateTimes)
{
    var csvString = "Day,NumberOfPostings\n";
    dateTimes.GroupBy(dt => dt.Day)
        .ToList()
        .ForEach(group =>
        {
            csvString += group.Key.ToString() + ",";
            csvString += group.Count().ToString() + "\n";
        });
    return csvString;
}

Bang that output text into Excel, and I can make a pretty graph showing the posts by day:

image

First thing to remember when looking at this graph, is that we don’t get all the tweets published, we only get a statistically relevant sample, as provided by Twitter. What does this mean? Well it means that things we can say about our model, (the Tweets we do have) will be true about the population as a whole (all the Tweets that would have been posted over the same time period). But, specific information will not be correct. So for example, it’s not true to say that 2.2 million Tweets were posted on the 23rd, but is is correct to say that 68% of Tweets were published on the 23rd; got it?

Wait a minute, the 23rd I hear you say? But the Oscars were held on the 22nd? True, and that leads us to the second thing we have to remember. The Oscars were held on the 22nd, Pacific Standard Time, whilst Tweets are recorded in Greenwich Mean Time (or UTC if you want to be more correct; but I’m a Brit, so I don’t. Smile with tongue out). This means that our results are going to be skewed by 8 hours. I don’t intend to do anything about that for the purposes of this analysis, but if you were paying me to do this, and you cared about it, I would correct for that.

Of course, another thing we have to think about, is the fact that I didn’t run the capture service for the full 24 hours in each of the represented days. I started it Friday and stopped it Monday morning (GMT). So we should also look to see what the number of posts per day is, when normalised over the number of hours that Tweets were captured for. To do this, we have to amend our C# code to group by day and by hour:

private static string GetPostsNorm(List<DateTime> dateTimes)
{
    var csvString = "Day,NumberOfPostings\n";
    dateTimes.GroupBy(dt => new { dt.Day, dt.Hour })
        .ToList()
        .ForEach(group =>
        {
            csvString +=
                group.Key.Day.ToString()
                + "~"
                + group.Key.Hour.ToString()
                + ",";

            csvString += group.Count().ToString() + "\n";
        });
    return csvString;
}

Then we can graph that:

image

This brings even more into focus the idea that the “main event” is where all the action is, and everything before, and after, is just a sideshow. Worth bearing in mind if you are responsible for a brand’s social media output.

Let’s emphasise that by taking a look at a frequency distribution of the posts by hour by day:

image

As you can see from this graph, there are hardly any posts, on topic, in the run up to the “main event” and posts tail off very quickly thereafter, (the difference is around 300,000 posts per hour). In other words, if this were a brand you were managing, you have a very limited window in which you can connect with your audience, before they move on somewhere else.

Lastly, let’s look at the acceleration of the Tweets for 20 hours up to 10:00hrs (GMT) on February 23rd. Acceleration is just the first differentiation of the velocity (posts per hour) graph above:

image

As acceleration and deceleration of posting can be be viewed as waxing and waning interest in the topic, this further illustrates that interest in the topic spikes quickly as the event takes place and then plummets after the event finishes. If you are responsible for the social media of the brand of the event, then everything below the X axis represents falling interest in your brand, and demonstrates times when you should be injecting information in order to rekindle the interest of your audience.

If you are managing the brand of an event attendee, say a vendor at an expo, then the spikes above the X axis represent occasions when you should inject your brand message in order to get the maximum attention possible.

Lessons Learned

From this analysis, we’ve learned that – as data scientists – we don’t always get data in the shape we want it, and it’s up to us to use the tools at your disposal to make the most of the data we are given; remembering, at all times, to stay within the realms of what is statistically truthful, so that what we say about our model, can be projected onto the population as a whole.

We also learned, that when it comes to events, there is a very narrow window of opportunity for brand managers to push their message to the optimum audience; but by analysing post acceleration (waxing/waning interest in the topic), brand managers can inject information in order to prolong audience interest for as long as possible.

Brand managers, of brands attending events, should look to post their messages at times of increasing interest (times when the acceleration graph is above the X axis) for maximum exposure at events.

Look Ahead

Well that’s all for this post, where we looked at how many people were posting at and what hours; next time we’ll look at where those people were and how interest moved across the globe, from the event’s location in LA. Until next time, keep crunching those numbers! Smile

Posted in C#, Data Science, Developement, NodeJS, Social Media, Statistics | Tagged , , | 1 Comment

Installing HDInsight Emulator on Your Local Machine

In this video I demonstrate how to install the HDInsight Emulator on your local machine, via the Web Platform Installer, and how to run the wordcount example to test it’s installed properly.

Posted in Developement, Hadoop, HDInsight, Vlog | Tagged , , | 2 Comments

Call with a recruiter…

You have to admire his honesty, if nothing else. :-)

Him: Hi, it’s Ben Dover here, this is just a courtesy call to find out a little about your skill set, so that I can help you get your next role.

Me: Cool, where’d you get my number?

Him: (Laughs) Oh don’t worry, it’s nothing sinister, it’s right here on your CV.

Me: So you have my CV in front of you?

Him: Sure do!

Me: And you’re phoning to find out about my skills?

Him: Sure am!

Me: Uh huh, so right under where it says “Contact Details”, where you got my phone number from, there’s a section titled “Skills”, that section pretty much details my, well… skills.

Him: Ah, yeah, no, totally read that, just it didn’t really mean that much to me so thought I’d give you a call.

Me: As a courtesy..?

Him: Riiiight.

Me: And you want to find me my next role?

Him: Yes, you’ve got it! That’s why I’m calling, as a courtesy…

Me: And terms like “Hadoop”, “Pig”, “hive” and “Python” don’t mean anything to you?

Him: Not a thing, that’s why I thought I’d call you.

Me: Yeah, I’m not sure you’re going to be able to help me, but thanks for your… courtesy.

Posted in Recruitment | Leave a comment

Sorry if You Disagree, but…

… the UK is a ‘Christian Country’. Why do I mention this, you may well ask? Well, recently, David Cameron wrote an article for the Church Times, where he stated:

“I believe we should be more confident about our status as a Christian country, more ambitious about expanding the role of faith-based organisations, and, frankly, more evangelical about a faith that compels us to get out there and make a difference to people’s lives.”

Nothing too controversial there you might imagine; however, defining the UK as a ‘Christian country’ seems to have angered a section of the population, including a group who published an open letter in the Telegraph Newspaper.

Well I don’t really have a lot of time for either politicians or organised religion but, on this occasion, Cameron is correct, we are a ‘Christian country’. The reason I say this is twofold. Firstly, the Church of England is established in this country, 26 of it’s bishops sit in the House of Lords, helping to pass the laws of the land.

Secondly, the majority of people, the majority of the time, self-identify themselves as being Christian.

Now you may not like these facts, but facts they are and they cannot be denied simply by putting your fingers in your ears and shouting “No we’re not! No we’re not! No we’re not!”

Posted in Data Science | Leave a comment

Speaking at DDD South West

dddsw_medium.jpgI’m delighted to say I’ve been accepted to speak at DDD South West, where I’ll be delivering my talk on “Hadoop and Big Data for Microsoft Developers”, I hope I’ll so you there; if so, don’t forget to stop and say “hi!”.

Posted in C#, Community, Data Science, Developement, Hadoop, HDInsight | Tagged , , , , , , | Leave a comment

The Day HDInsight Died (Again)

So here I am this morning, all ready to get my Data Science Fu on, I pop onto my local HDInsight installation, and it tells me:

Hadoop Not Running

Hmm, time to see if the services are running:

Hadoop Services Not Running

Nope, all stopped. Sigh, right, start them. When I try to do that, I get an error that tells me that there’s been a “log on failure”. A what now? Same log on worked yesterday. Hmm, worked yesterday but not today, I wonder if HDInsight installs a default user with the password set to expire?

Hadoop User Must Change Password Next Logon

Yes it does! I’m not sure if HDInsight installs this way, or if some policy applied on my work machine forces this, but unchecking the “User must change password at next logon” and checking “Password never expires”, should solve this problem.

Services Started

Services are up and running again…

Hadoop Is Working

And HDInsight is back!

So, if your local instance of HDInsight stops working, then check that the hadoop user password hasn’t expired.

Posted in HDInsight | Tagged , , | Leave a comment