Data & AI Digest for 2018-05-02

I’ve been in the process of transferring my blog (along with creating a personal website) to blogdown, which is hosted on Github Pages. The new blog, or rather, the continuation of this blog, will be at, and it went live today. I’ll be cross-posting here for a while, at least until Tal gets my […]
More details at…

Hi there! I was training some ways to simulate animal (or other organisms) movements having into account habitat suitability. To do this, I used my previous eWalk model as the underlying process to simulate random or directional walks. This model is based on Brownian / Ornstein–Uhlenbeck process. You can find more about eWalk model here! Today, I will add one more element to this movement simulations. In this case, we will have into account the habitat or environmental preferences of the simulated species, to perform a simulation like this: First, we will create a raster layer as a random environmental variable, for example tree cover. library (raster) library (dismo) tc
More details at…

Summary: Our starting assumption that sequence problems (language, speech, and others) are the natural domain of RNNs is being challenged.  Temporal Convolutional Nets (TCNs) which are our workhorse CNNs with a few new features are outperforming RNNs on major applications today.  Looks like RNNs may well be history.

More details at…

Posted in Data & AI Digest | Tagged

Data & AI Digest for 2018-04-27

Today, we are excited to announce the availability of the OS Disk Swap capability for VMs using Managed Disks. Until now, this capability was only available for Unmanaged Disks. With this capability, it becomes very easy to restore a previous backup of the OS Disk or swap out the OS Disk for VM troubleshooting without having to delete the VM.
More details at…

Today we’re sharing the public preview of per disk metrics for all Managed & Unmanaged Disks. This enables you to closely monitor and make the right disk selection to suit your application usage…
More details at…

We illustrate pattern recognition techniques applied to an interesting mathematical problem: The representation of a number in non-conventional systems, generalizing the familiar base-2 or base-10 systems. The emphasis is on data science rather than mathematical theory, and the style is that of a tutorial, requiring minimum knowledge in mathematics or statistics. However, some off-the-beaten-path, state-of-the-art number theory research is discussed here, in a way that is accessible to…
More details at…

Posted in Data & AI Digest | Tagged

Data & AI Digest for 2018-04-26

Azure Cosmos DB is Microsoft’s globally distributed, multi-model database. Azure Cosmos DB enables you to elastically and independently scale throughput and storage across any number of Azure’s geographic regions with a single click. It offers throughput, latency, availability, and consistency guarantees with comprehensive service level agreements (SLAs), a feature that no other database service can offer.
More details at…

A quick note for displaying R htmlwidgets in Jupyter notebooks without requiring pandoc – there may be a more native way but this acts as a workaround in the meantime if not: PS and from the other side, using reticulate for Python powered Shiny apps.
More details at…

Our onboarding
that ensure that packages contributed by the community undergo a
transparent, constructive, non adversarial and open review process, take
place in the issue tracker of a GitHub repository. Development of the
packages we onboard also takes place in the open, most often in GitHub

Therefore, when wanting to get data about our onboarding system for
giving a data-driven overview, my mission was to extract data from
GitHub and git repositories, and to put it into nice rectangles (as
defined by Jenny
Bryan) ready for
analysis. You might call that the first step of a “tidy git analysis”
using the term coined by Simon
So, how did I collect data?

A side-note about GitHub

In the following, I’ll mention repositories. All of them are git
repositories, which means they’re folders under version control, where
roughly said all changes are saved via commits and their messages (more
or less) describing what’s been changed in the commit. Now, on top of
that these repositories live on GitHub which means they get to enjoy
some infratructure such as issue trackers, milestones, starring by
admirers, etc. If that ecosystem is brand new to you, I recommend
reading this book, especially its big
picture chapter.

Package review processes: weaving the threads

Each package submission is an issue thread in our onboarding repository,
see an example
here. The first
comment in that issue is the submission itself, followed by many
comments by the editor, reviewers and authors. On top of all the data
that’s saved there, mostly text data, we have a private
Airtable workspace where we have a table of
reviewers and their reviews, with direct links to the issue comments
that are reviews.

Getting issue threads

Unsurprisingly, the first step here was to “get issue threads”. What do
I mean? I wanted a table of all issue threads, one line per comment,
with columns indicating the time at which something was written, and
columns digesting the data from the issue itself, e.g. guessing the role
from the commenter from other information: the first user of the issue
is the “author”.

I used to use GitHub API V3 and then heard about GitHub API
V4 which blew my mind. As if I
weren’t impressed enough by the mere existence of this API and its

I discovered the rOpenSci ghql
package allows one to interact
with such an API and that its docs actually use GitHub API V4 as an

Carl Boettiger told me about his way to rectangle JSON
using jq, a language for
processing JSON, via a dedicated rOpenSci package,
I have nothing against GitHub API V3 and
gh and purrr workflows, but I was
curious and really enjoyed learning these new tools and writing this
code. I had written a gh/purrr code for getting the same information
and it felt clumsier, but it might just be because I wasn’t
perfectionist enough when writing it! I achieved writing the correct
GitHub V4 API query to get just what I needed by using its online
explorer. I then succeeded
in transforming the JSON output into a rectangle by reading Carl’s post
but also by taking advantage of another online explorer, jq
play where I pasted my output via
writeClipboard. That’s nearly always the way I learn about query
tools: using some sort of explorer and then pasting the code into a
script. When I am more experienced, I can skip the explorer part.

The first function I wrote was one for getting the issue number of the
last onboarding issue, because then I looped/mapped over all issues.

# function to get number of last issue
More details at…

On March the 17th I had the honor to give a keynote talk about rOpenSci’s package onboarding system at the satRday conference in Cape Town, entitled “Our package reviews in review: introducing and analyzing rOpenSci onbo…
More details at…

[Disclaimer: I received this book of Coryn Bailer-Jones for a review in the International Statistical Review and intend to submit a revised version of this post as my review. As usual, book reviews on the ‘Og are reflecting my own definitely personal and highly subjective views on the topic!] It is always a bit of […]
More details at…

Data Dictionary to Meta Data III is the third and final blog devoted to demonstrating the automation of meta data creation for the American Community Survey 2012-2016 household data set, using a published data dictionary. DDMDI was a teaser to show how Python could be used to generate R statements that…
More details at…

Posted in Data & AI Digest | Tagged

Markov, Trump and Counter Radicalization

Here’s a video of a talk I did at DDD12, it’s more a functional demo of an idea really.

Posted in Uncategorized | Leave a comment

Text Analysis of GE17 Manifestos

I had a quick look at the manifestos of the main parties today, so I thought I’d jot down a few remarks here.

So the first thing I did was to remove all the stop words and then run a frequency distribution across the remaining text, which yielded the following result:








That done I ran a quick trigram colocation across the text. This finds groups of three words, which have a low probability of being next to each other by accident, or just by the nature of the English language. Having found groupings of words, I then took a frequency distribution over them and found the most frequent three words groups, this can help us get a good feel for what ideas are important to the authors of the manifestos. The results are below:








That’s it for now, I’ll post again if I get time to do a little more analysis.

Posted in Data Science | Tagged | Leave a comment

Thoughts on The 2017 UK General Election YouGov Model

I was recently asked to comment on the YouGov model that showed that the Conservative Party may fall short of an overall majority in the upcoming election. As my reply grew longer than I had intended, I decided to post it here too, for information.

So here’s my thoughts on the issue, in no particular order…

1. I don’t have enough evidence to process the news from YouGov properly. Despite how the media brand this, it’s not a poll, it’s a model, and that’s a different thing altogether. The model also comes with a “health warning” from YouGov, saying the margin of error is high. Now, a high margin of error is amplified by the first past the post system here. In other words, YouGov are saying, here’s what we think the election will look like but it could look very different. The same model was used to correctly predict the the result of the Brexit referendum, but that was a binary choice (vote in or out), elections are choices made along a political spectrum from left to right; this should make the model less accurate for elections, however, there have been far more elections than referendums, so the opportunity to apply corrective data is greater, so that will tend to make the model more accurate. See what I mean? There’s just not enough evidence to process this properly, all we can say for certain is that the model is 1 for 1 right now.

2. You must also bear in mind, that in modern elections, you can pretty much ignore the polls. No I’m serious. Polls in the modern era are finished until they work out a way to deal with how elections are conducted now. Let me explain by firstly demonstrating how campaigns are run now. Right now, is like a “phony war” the campaigns are segregating their voter files along two axis the first is vote for us / vote for the opposition and the second is likelihood of voting. Next, they are A/B testing persuasion messages for each group in those quadrants, you don’t see this unless you are targeted by the ads on social media (this is a whole other issue currently being looked at by the Information Commissioners). After A/B testing is complete, in the last 72 hours of the campaign, that’s when they’ll blast out these tested messages on social media and other platforms, literally spending millions, that will make a massive difference to the end result, and guess what, no poll in the world will catch that, because they can’t collect, process and analyse the data fast enough.

3. Whoever wins the election it isn’t going to make that much difference to (what we know about) Brexit. Labour are committed to Brexit, in fact I can’t see how you can be a democrat and not be; like it or not, the country voted out, so out we must go. There’s only one party (the LibDems) wavering on that, and they are nowhere in the polls. Now maybe Labour has a different view of what “out” looks like, but since the Tories haven’t told us anything about Brexit and neither has Labour, we don’t really know how it will be different, if at all.

4. Then there’s the “Scottish Question” which further complicates matters. It has become clear that the “once in a generation” promise made during the 2014 referendum campaign was more of a “get the vote out” tactic than an actual promise, and in fact the SNP are wedded to a neveredum strategy, as here we are just two years down the line facing indyref2. This is deeply destabilizing for the Scottish economy as illustrated by slower job growth and declining inward investment figures. Scotland holds ~10% of the UK parliamentary seats and the two major parties have said no to indyref2… until a couple of days ago. As the polls narrow, Corbyn sees an opportunity for a “progressive alliance” between Labour and the SNP with enough votes to form a coalition government and he’s begun giving interviews hinting at this. Now that leaves the majority of Scottish voters, who voted “No” in the referendum, with only only one party to vote for in order to save Scotland from this neverendum, and that’s the Tories. Many of us, me included, will hold our noses and put country first, and vote Tory on June 8th, This will not be enough to defeat the SNP, but we hope it will give them pause for thought and make them realise that we don’t want another referendum, we want them to get on with governing the country. If we can take 4-5 seats off the SNP then we hope that will be enough, but as a side effect, it also helps the Tories in Westminster.

In summary then, I’d say everything you’ve seen, and will see, up to the last 72 hours is fairly meaningless and the polls will not help you. If you want to know who’ll win the election, ignore the polls and jump on social media trackers like during the last 3-4 days. 🙂

Posted in Data Science, Politics | Leave a comment

Google DeepMind patient app legality questioned

The head of the Department of Health’s National Data Guardian (NDG) has criticised the NHS for the deal it struck with Google’s DeepMind over sharing patient data.

Posted in Uncategorized | Leave a comment