Text Analysis of GE17 Manifestos

I had a quick look at the manifestos of the main parties today, so I thought I’d jot down a few remarks here.

So the first thing I did was to remove all the stop words and then run a frequency distribution across the remaining text, which yielded the following result:








That done I ran a quick trigram colocation across the text. This finds groups of three words, which have a low probability of being next to each other by accident, or just by the nature of the English language. Having found groupings of words, I then took a frequency distribution over them and found the most frequent three words groups, this can help us get a good feel for what ideas are important to the authors of the manifestos. The results are below:








That’s it for now, I’ll post again if I get time to do a little more analysis.

Posted in Data Science | Tagged | Leave a comment

Thoughts on The 2017 UK General Election YouGov Model

I was recently asked to comment on the YouGov model that showed that the Conservative Party may fall short of an overall majority in the upcoming election. As my reply grew longer than I had intended, I decided to post it here too, for information.

So here’s my thoughts on the issue, in no particular order…

1. I don’t have enough evidence to process the news from YouGov properly. Despite how the media brand this, it’s not a poll, it’s a model, and that’s a different thing altogether. The model also comes with a “health warning” from YouGov, saying the margin of error is high. Now, a high margin of error is amplified by the first past the post system here. In other words, YouGov are saying, here’s what we think the election will look like but it could look very different. The same model was used to correctly predict the the result of the Brexit referendum, but that was a binary choice (vote in or out), elections are choices made along a political spectrum from left to right; this should make the model less accurate for elections, however, there have been far more elections than referendums, so the opportunity to apply corrective data is greater, so that will tend to make the model more accurate. See what I mean? There’s just not enough evidence to process this properly, all we can say for certain is that the model is 1 for 1 right now.

2. You must also bear in mind, that in modern elections, you can pretty much ignore the polls. No I’m serious. Polls in the modern era are finished until they work out a way to deal with how elections are conducted now. Let me explain by firstly demonstrating how campaigns are run now. Right now, is like a “phony war” the campaigns are segregating their voter files along two axis the first is vote for us / vote for the opposition and the second is likelihood of voting. Next, they are A/B testing persuasion messages for each group in those quadrants, you don’t see this unless you are targeted by the ads on social media (this is a whole other issue currently being looked at by the Information Commissioners). After A/B testing is complete, in the last 72 hours of the campaign, that’s when they’ll blast out these tested messages on social media and other platforms, literally spending millions, that will make a massive difference to the end result, and guess what, no poll in the world will catch that, because they can’t collect, process and analyse the data fast enough.

3. Whoever wins the election it isn’t going to make that much difference to (what we know about) Brexit. Labour are committed to Brexit, in fact I can’t see how you can be a democrat and not be; like it or not, the country voted out, so out we must go. There’s only one party (the LibDems) wavering on that, and they are nowhere in the polls. Now maybe Labour has a different view of what “out” looks like, but since the Tories haven’t told us anything about Brexit and neither has Labour, we don’t really know how it will be different, if at all.

4. Then there’s the “Scottish Question” which further complicates matters. It has become clear that the “once in a generation” promise made during the 2014 referendum campaign was more of a “get the vote out” tactic than an actual promise, and in fact the SNP are wedded to a neveredum strategy, as here we are just two years down the line facing indyref2. This is deeply destabilizing for the Scottish economy as illustrated by slower job growth and declining inward investment figures. Scotland holds ~10% of the UK parliamentary seats and the two major parties have said no to indyref2… until a couple of days ago. As the polls narrow, Corbyn sees an opportunity for a “progressive alliance” between Labour and the SNP with enough votes to form a coalition government and he’s begun giving interviews hinting at this. Now that leaves the majority of Scottish voters, who voted “No” in the referendum, with only only one party to vote for in order to save Scotland from this neverendum, and that’s the Tories. Many of us, me included, will hold our noses and put country first, and vote Tory on June 8th, This will not be enough to defeat the SNP, but we hope it will give them pause for thought and make them realise that we don’t want another referendum, we want them to get on with governing the country. If we can take 4-5 seats off the SNP then we hope that will be enough, but as a side effect, it also helps the Tories in Westminster.

In summary then, I’d say everything you’ve seen, and will see, up to the last 72 hours is fairly meaningless and the polls will not help you. If you want to know who’ll win the election, ignore the polls and jump on social media trackers like https://trading.co.uk/generalelection/ during the last 3-4 days. 🙂

Posted in Data Science, Politics | Leave a comment

Google DeepMind patient app legality questioned

The head of the Department of Health’s National Data Guardian (NDG) has criticised the NHS for the deal it struck with Google’s DeepMind over sharing patient data.


Posted in Uncategorized | Leave a comment

Will Democratizing AI be The Cancer at The Heart of Future Enterprise?

I’m I alone in thinking that Microsoft’s (and others) push to “Democratize AI” represents a threat to business equal to that of, say, the Y2K Bug?

Here’s why I think it is. A while ago I was attending a conference and I saw an AzureML talk. The presenter did an amazing job, they were really engaging, the talk was deep, technical, and… utterly flawed. During the talk the presenter took Lickart Scale data and clustered it using the K-Means algorithm. The audience were enthralled.

Now for the non data scientists in the audience, here’s a quick catch up. You’ll see Lickart Scales used most often in surveys, you know the kinds of questions that ask: “Where 1 is strongly disagree, 2 is disagree, 3 is neither agree nor disagree, 4 is agree and 5 is strongly agree, how strongly do you agree with the following statement…”. This makes it ordinal data, data which is categorical in nature.

However, because the answers are recorded numerically – 1 through 5 – when the data comes to be analysed it looks like interval data (numbers that you can do arithmetic with). Now, when that data is analysed by an experienced data scientist, then there’s no problem, they recognize the trap and apply other analysis techniques, such as Pearson’s Correlation.

The presenter however, although a very experience database person, had little to no experience as a data scientist and immediately fell into the trap of either, seeing numbers and assuming the data was interval in nature, or not knowing that, in the K-Means algorithm, the “Means” part relates to averages. The data being ordinal in nature meant that the presenter was not entitled to do arthritic on it.

Now this wouldn’t be a real problem if it were not for two issues. Firstly, there’s the issue that the very nature of “Democratizing AI” means that the majority of analysis is going to be carried out by non trained professionals in the future and secondly, tools like AzureML can’t tell that what you are doing is nonsense because they don’t have the context of how the data was captured. You feed them numbers and ask them to do arithmetic, they do that arithmetic and give you an answer.

Going forward, the danger will be that untrained people will take that answer and run with it, not having the training to realize that the answer is utterly meaningless in the context of the problem.

To solve this problem enterprises need to understand that no matter how simple Microsoft make data science appear, to operate it safely you need trained staff. Also, Microsoft and other companies need to think about how they can make these tools “safer”, something like compilers and linters for data science need to be developed.

The bottom line is, “Democratizing AI” should mean LESS training is required to use them, not that NO training is required.

Posted in Data Science | Leave a comment

Yesterday in Data Science March 12th 2017

Following my post about logistic regressions, Ryan got in touch about one bit of building logistic regressions models that I didnÔÇÖt cover in much detail ÔÇô interpreting┬áregression coefficients. This post will hopefully help Ryan (and others) out. @SteffLocke This was
The post How to go about interpreting regression cofficients appeared first on Locke Data. Locke Data are a data science consultancy aimed at helping organisations get ready and get started with data science.
More details at… http://feedproxy.google.com/~r/RBloggers/~3/r6eQu42S844/

Focus for books on R tend to be highly focused on either statisticians or programmers. There is a dearth of material to assist those in typically less quantitative field access the powerful tools in the R ecosystem. Enter Text Analysis with R for Students of Literature. I haven’t done a deep read of the book, [ÔǪ]
More details at… http://feedproxy.google.com/~r/RBloggers/~3/t7GZ9GZ46A4/

Recently, I read a post regarding a sentiment analysis of Mr Warren Buffetts annual shareholder letters in the past 40 years written by Michael Toth. In this post, only five of the annual shareholder letters showed negative net sentiment scores, whereas a majority of the letters (88%) displayed a positive net sentiment score. Toth noted []Related PostUsing MongoDB with RFinding Optimal Number of ClustersAnalyzing the first Presidential DebateGoodReads: Machine Learning (Part 3)Machine Learning for Drug Adverse Event Discovery
More details at… http://feedproxy.google.com/~r/RBloggers/~3/xa89N8oIGxk/

There’s a handy new function in R 3.4.0 for anyone interested in data about CRAN packages. It’s not documented, but it’s pretty simple: tools::CRAN_package_db() returns a data frame with one row for every package on CRAN and 65 columns of data on those packages, as shown below. > names(tools::CRAN_package_db()) [1] “Package” “Version” “Priority” [4] “Depends” “Imports” “LinkingTo” [7] “Suggests” “Enhances” “License” [10] “License_is_FOSS” “License_restricts_use” “OS_type” [13] “Archs” “MD5sum” “NeedsCompilation” [16] “Additional_repositories” “Author” “Authors@R” [19] “Biarch” “BugReports” “BuildKeepEmpty” [22] “BuildManual” “BuildResaveData” “BuildVignettes” [25] “Built” “ByteCompile” “Classification/ACM” [28] “Classification/ACM-2012” “Classification/JEL” “Classification/MSC” [31] “Classification/MSC-2010” “Collate” “Collate.unix” [34] “Collate.windows” “Contact” “Copyright” [37] “Date” “Description”…
More details at… http://feedproxy.google.com/~r/RBloggers/~3/Qknl1yY37PE/

This is part of a new series of articles: once or twice a month, we post previous articles that were very popular when first published. These articles are at least 6 month old but no more than 12 month old. The previous digest in this series was posted here a while back. 
20 Great Blogs Posted in the last 12
More details at… http://www.datasciencecentral.com/xn/detail/6448529:BlogPost:561994

Posted in Data Science Digest | Leave a comment

Yesterday in Azure Cloud – March 12th 2017

The built-in geo-replication feature has been generally available to SQL Database customers since 2014. During this time one of the most common customer requests has been about supporting transparent failover with automatic activation. Today we are happy to announce a public preview of auto-failover groups that extends geo-replication with the following additional capabilities:
More details at… https://azure.microsoft.com/blog/azure-sql-database-now-supports-transparent-geographic-failover-of-multiple-databases-featuring-automatic-activation/

Automatically scaling out or scaling in applications to handle the demands of your business is an essential element of the cloud strategy. AzureÔÇÖs Autoscale service empowers you to automatically scale your compute and App Service workloads based on user-defined rules regarding metric conditions, time/date schedules, or both.
More details at… https://azure.microsoft.com/blog/manage-your-business-needs-with-new-enhancements-in-azure-autoscale/

Broad support for regulatory compliance and ongoing innovation are at the core of Microsofts commitment to enabling U.S. government missions with a complete, trusted, and secure cloud platform.
More details at… https://azure.microsoft.com/blog/azure-government-the-most-secure-compliant-cloud-for-defense-with-new-compliance-and-service-offerings/

App Service on Linux (Preview) enables developers to run their cloud apps apps natively on Linux Docker Containers. It makes it easier to migrate existing apps hosted on a Linux platform elsewhere
More details at… https://azure.microsoft.com/blog/see-whats-new-for-azure-app-service-on-linux-preview/

Application Insights has new tools to empower your development team to better understand how customers use your web apps. These tools are available as a preview today in Application Insights in the
More details at… https://azure.microsoft.com/blog/new-tools-for-understanding-user-behavior-with-application-insights/

Azure DevTest Labs is a commercial Azure service that enables IT admins to create a cost-controlled self-service for developers and testers to quickly create environments in Azure, while minimizing waste and optimizing cost. We announced the service GA last May, and never stop exploring more opportunities to build solutions that solve our customersÔÇÖ real problems in various scenarios. Today, as Microsoft Build 2017 happening now in Seattle, I would like to take this moment with you to look back all the key functionalities we’ve shipped since Connect() conference last November, and explain how they can help you in various scenarios.
More details at… https://azure.microsoft.com/blog/azure-devtest-labs-updates-at-build-2017/

We are excited to announce the general availabilty of Application Insights Profiler for Azure App Service.
More details at… https://azure.microsoft.com/blog/application-insights-profiler/

Posted in Azure | Leave a comment

Microsoft is Building Literate Machines

…now, the company’s leading AI experts are working on systems that can do something even more complex: Read passages of text and answer questions about them.


Posted in Uncategorized | 7 Comments