Oscars 2015 Tweet Analysis – Part 2: Locations

Summary (tl;dr)

Looking at the location details from the tweets, we learned that New York and Los Angeles were the most prolific posting locations (accounting for more than 200K votes), and that the vast majority of locations had fewer than 100 posts. We also learned that the PowerMap addin, for Excel, is an excellent tool for displaying geo-coded information. All of the code, and some of the data from this post can be downloaded from my github repo here: https://github.com/garyshort/oscars2015Part2

Introduction

In the first part of this analysis, we looked at how many people were posting and when; in this next part we’ll look at where they were posting from.

Twitter geolocates a tweet in a number of ways. Firstly the user can “self orientate” by stating where they are located; this information is contained in the Location property on the User object. The other way is to have your location appended to your Tweet, by your client – if it supports it – and those properties live on the Tweet object and are: Place, Coordinates and Geo (deprecated). For the purposes of this analysis, we’ll use the Location property. Each method has their flaws, but at least with this one we are using information that the User has given us freely as their stated location, and we’re not going all “creepy stalker” on them by looking at tweet location, which may have been appended to the User’s Tweet by the client app, because the User didn’t realise that function was enabled. (Go and check if your client leaks your location with each Tweet without your knowledge, I’ll wait here ‘til you get back Smile).

Analysis

You’ll recall from the last post too, that I messed up the collection of the data; if you’ve forgotten that go back and read how I screwed up. This means I’ll have to pull the information from the file using SED. I’m going to do this in two parts, firstly I’ll pull out the location information:

sed -n "s/.*location: \x27\(.*\)\x27\,/\1/p" results.js > locations.txt

Although I’ve searched for the line containing the location information, I’ve set up a capture group (the part in parens) to capture only the actual location and it’s that part I pipe to the output file locations.txt, via the “\1” attribute.

Now, because not every user opts to specify a location, the second thing I’m going to do is to remove any blank lines from my locations file:

sed '/^$/d' locations.txt > locsNoBlanks.txt

Having done that, we can use some C# code to give us the top 10 locations by posting frequency:

private static string GetTop10Locations(List<string> data)
{
    return data
        .GroupBy(location => location)
        .OrderByDescending(group => group.Count())
        .ToList()
        .Take(10)
        .ToList()
        .Aggregate(
            string.Empty,
            (acc, group) =>
                acc += String.Format(
                    "{0}~{1}\n",
                    group.Key,
                    group.Count()));
}

Take that output string and load it into Excel and we can build the follow graph:

image

As you can see, New York and California dominate the postings, which has a “truthy” feel to it, as California is the home of the movie industry and the East Coast is home to a lot of the print and broadcast media.

We can use similar code to reverse that graph and take a look at the bottom posting locations. I’ve removed all the locations that only have one posting, as investigation shows that those are mainly locations such as, “My Bedroom”, “Here on Planet Earth”, etc:

private static string GetBottom10Locations(List<string> data)
{
    return data
        .GroupBy(location => location)
        .OrderBy(group => group.Count())
        .ToList()
        .FindAll(group => group.Count() > 1)
        .Take(10)
        .ToList()
        .Aggregate(
            string.Empty,
            (acc, group) =>
                acc += String.Format(
                    "{0}~{1}\n",
                    group.Key,
                    group.Count()));
}

Loaded into Excel, this gives us the following graph:

image

Now those graphs got me wondering about what the frequency distribution of the number of posts per location would look like. To graph that, first we need to grab all the locations:

private static void GetAllLocations(List<string> data)
{
    string path = @"C:\Users\Gary\Documents\BlogPostExamples\"
        + @"Oscars2015\Part2\locations.data";

    StreamWriter sw = File.AppendText(path);
    data
        .AsParallel()
        .GroupBy(location => location)
        .OrderByDescending(group => group.Count())
        .ToList()
        .ForEach(group => sw.Write(String.Format(
                    "{0}\u00B1{1}\n",
                    group.Key,
                    group.Count())));
    sw.Flush();
    sw.Close();
}

It’s a bit of a beefy task, so I’ve asked C# to do the work in parallel, using all four core’s on my I7 and to write the data to file this time, instead of building it in memory.

Loading the data into Excel, we can define a number of “bins”, or ranges, to fit our data too:

image

Then use the Histogram function on the Analysis Tookpack to create our frequency distribution. We can take that that data and graph it; as there’s a large disparity between the frequency of each bin, I’ll use a log scale on the Y axis, for clarity:

image

As you can see, the vast majority of locations have 100 posts or fewer, with just 2 locations having between 50K and 100K posts; those were “New York, NY” and “Los Angeles, CA”.

These numbers tell us a lot, but they are hard to relate to the location. Luckily Excel now has an addin called PowerMap, a nice utility for displaying geo coded data on a map, this really helps us see the data in context:

NAPostsHeatMap

On this map I’ve taken all the locations that have 1,000 or more posts and displayed them on the map; the above map shows those in North America. You can see our two hot spots in New York and Los Angeles.

Now, whilst that is more useful than the bare numbers, it doesn’t tell us about the various posts that were categorised together as being in “New York”, say; to do that we have to tell PowerMap to treat the location as a category attribute. When we do that, then we can get a good idea of how many different locations were categorised as “New York” by the application:

NAStackedBar

In this graphic it’s easy to see the number of disparate locations that were categorised into each “Parent”. For example “NY,NY”, “New York, NY” and “NYC” were all categorised as being “New York”.

Summary

I think we’ll leave it there for this post. Today we learned that there were two major locations interested in the Oscars: “New York” and “Los Angeles’’; we learned that the vast majority of locations recorded have fewer than 100 posts originating from them and we learned that the PowerMap addin for Excel is an excellent tool for visualising geo-coded data.

Looking Ahead

In the next post, we’ll use some computational linguistics techniques to establish what where the key pieces of information being exchanged in the tweets sent. Until then, keep crunching those numbers! Smile

Advertisements
This entry was posted in C#, Data Science, Social Media, Statistics and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s