Data & AI Digest for 2018-08-14

Keep current on what’s happening in Azure, including what’s now in preview, generally available, news & updates, and more.
More details at…

We continue to expand the Azure Marketplace ecosystem. From July 1 to 15, 98 new offers successfully met the onboarding criteria and went live.
More details at…

You can accelerate your cloud migration using intelligent migration assessment services like Azure Migrate. Azure Migrate is a generally available service, offered at no additional charge, that helps you plan your migration to Azure.
More details at…

The data.table R package is really good at sorting. Below is a comparison of it versus dplyr for a range of problem sizes. The graph is using a log-log scale (so things are very compressed). But data.table is routinely 7 times faster than dplyr. The ratio of run times is shown below. Notice on the … Continue reading data.table is Really Good at Sorting
More details at…

Introduction As part of our effort to provide users ways to replicate our analyses and improve their performance in fantasy football, we are continuously looking at ways we can improve[…]
The post 2018 Update of ffanalytics R Package appeared first on Fantasy Football Analytics.
More details at…

The development version of splashr now support authenticated connections to Splash API instances. Just specify user and pass on the initial splashr::splash() call to use your scraping setup a bit more safely. For those not familiar with splashr and/or Splash: the latter is a lightweight alternative to tools like Selenium and the former is an… Continue reading →
More details at…

I love Arkham Horror; The Card Game. I love it more than I really should;it’s ridiculously fun. It’s a cooperative card game where you build a deck representing a character in the Cthulhu mythos universe, and with that deck you play scenarios in a narrative campaign where you grapple with the horrors of the mythos. Your actions in (and between) scenarios have repercussions for how the campaign plays out, changing the story, and you use experience points accumulated in the scenarios to upgrade your deck with better cards.
More details at…

Model interpretability is critical to businesses. If you want to use high performance models (GLM, RF, GBM, Deep Learning, H2O, Keras, xgboost, etc), you need to learn how to explain them. With machine learning interpretability growing in importance, s…
More details at…

Posted in Data & AI Digest | Tagged

Data & AI Digest for 2018-08-13

More details at…

A Deluge of Content Over the last two decades, the accessibility of media has increased dramatically. One of the main reasons for this surge in available content is the evolution of online streaming platforms. Services like Netflix, Hulu, Amazon Prime, and others enable access for millions of consumers to a seemingly countless number of movies […]
More details at…

Principal component analysis (PCA) is a dimensionality reduction technique which might come handy when building a predictive model or in the exploratory phase of your data analysis. It is often the case that when it is most handy you might have forgot it exists but let’s neglect this aspect for now 😉

I decided to write this post mainly for two reasons:

I had to make order in my mind about the terminology used and complain about a few things.
I wanted to try to use PCA in a meaningful example.

What does it do?
There are a lot of ways to try to explain what PCA does and there are a lot of good explanations online. I highly suggest you to google around a bit on the topic.
PCA looks for a new the reference system to describe your data. This new reference system is designed in such a way to maximize the variance of the data across the new axis. The first principal component accounts for as much variance as possible, as does the second and so on. PCA transforms a set of (tipically) correlated variables into a set of uncorrelated variables called principal components. By design, each principal component will account for as much variance as possible. The hope is that a fewer number of PCs can be used to summarise the whole dataset. Note that PCs are a linear combination of the original data.
The procedure simply boils down to the following steps

Scale (normalize) the data (not necessary but suggested especially when variables are not homogeneous).
Calculate the covariance matrix of the data.
Calculate eigenvectors (also, perhaps confusingly, called “loadings”) and eigenvalues of the covariance matrix.
Choose only the first N biggest eigenvalues according to one of the many criteria available in the literature.
Project your data in the new frame of reference by multipliying your data matrix by a matrix whose columns are the N eigenvectors associated with the N biggest eigenvalues.
Use the projected data (very confusingly called “scores”) as your new variables for further analysis.

Everything clear? No? Ok don’t worry, it takes a bit of time to sink in. Check the practical example below.
Use case: why should you use it?
The best use cases I can find at the top of my mind are the following:

Redundant data: you have gathered a lot of data on a system and you suspect that most of that data is not necessary to describe it. Example: a point moving on a straight line with some noise, say X axis, but you somehow have collected data also on Y and Z axis. Note that PCA doesn’t get rid of the redundant data, it simply costructs new variables that summarise the original data.

Noise in your data: the data you have collected is noisy and there are some correlations that might annoy you later
The data is heavily redundant and correlated. You think out of the 300 variables you have, 3 might be enough to describe your system.

In all these 3 cases PCA might help you getting closer to your goal of improving the results of your analysis.
Use case: when should you use it?
PCA is a technique used in the exploratory phase of data analysis. Personally, I would use it only in one of the 3 circumstances above, at the beginning of my analisis, after having confirmed through the basic descriptive statistics tools and some field knowledge, that I am indeed in one of the use cases mentioned. I would also probably use it when trying to improve a model that relies on independence of the variables.
Not use case: when should you not use it?
PCA is useless when your data is mostly indipendent or even when you have very few variables I would argue. Personally, I would not use it when a clean and direct interpretation of the model is required either. PCA is not a black box but its output is difficult to a) explain to non technical people and, in general, b) not easy to interpret.
Example of application: the Olivetti faces dataset
The Olivetti faces dataset is a collection of 64×64 greyscale pixel images of 40 faces, there are 10 images for each face. It is available in Scikit-learn and it is a good example dataset for showing what PCA can do. I think it is a better cheap example than the iris dataset since it can show you why you would use PCA.
First, we need to download the data using Python and Scikit-Learn.
Then, we need to shape the data in a way that is acceptable for our analysis, that is, features (also known as variables) should be columns, and samples (also known as observations) should be the rows. In our case the data matrix should have size 399×4096. Row 1 would represent image 1, row 2 image 2, and so on… To achieve this, I used this simple R script with a for loop
The images in the dataset are all head shots with mostly no angle. This is an example of the first face (label 0)

A very simple thing to do would be to look what the average face looks like. We can do this by using dplyr, grouping by label and then taking the average of each pixel. Below you can see the average face for labels 0, 1 and 2

Then it is time to actually applying PCA on the dataset
Since the covariance matrix has dimensions 4096×4096 and it is real and symmetric, we get 4096 eigenvalues and 4096 corresponding eigenvectors. I decided (after a few trials) to keep only the first 20 which account roughly for 76.5 % of the total variance in the dataset. As you can see from the plot below, the magnitude of the eigenvalues goes down pretty quickly.

A few notes:

The eigenvalues of the covariance matrix represent the variance of the principal components.
Total variance is preserved after a rotation (i.e. what PCA is at its core).
The variable D_new contains the new variables in the reduced feature space. Note that by using only 20 variables (the 20 principal components) we can capture 76.5 % of the total variance in the dataset. This is quite remarkable since we can probably avoid using all the 4096 pixels for every image. This is good news and it means that PCA application in this case might be useful! If I had to use 4000 principal components to capture a decent amount of variance, that would have been a hint that PCA wasn’t probably the right tool for the job. This point is perhaps trivial, but particularly important. One must not forget what each tool is designed to accomplish.
There are several methods for choosing how many principal components should be kept.  I simply chose 20 since it seems to work out fine. In a real application you might want to research a more scientific criterion or make a few trials.

The eigenfaces
What is an eigenface? The eigenvectors of the covariance matrix are the eigenfaces of our dataset. These face images form a set of basis features that can be linearly combined to reconstruct images in the dataset. They are the basic characteristic of a face (a face contained in this dataset of course) and every face in this dataset can be considered to be a combination of these standard faces. Of course this concept can be applied also to images other than faces. Have a look at the first 4 (and the 20th) eigenfaces. Note that the 20th eigenface is a bit less detailed than the first 4. If you were to look at the 4000 eigenface, you would find that it is almost white noise.

Then we can project the average face onto the eigenvector space. We can do this also for a new image and use this new data, for instance, to do a classification task.
This is the plot we obtain

As you can see, the 3 average faces rank quite differently in the eigenvectors’ space. We could use the projections of a new image to perform a classification task on an algorithm trained on a database of images transformed according to PCA. By using only a few principal components we could save computational time and speed up our face recognition (or other) algorithm.
Using R built-in prcomp function
For the first few times, I would highly suggest you to do PCA manually, as I did above, in order to get a firm grasp at what you are actually doing with your data. Then, once you are familiar with the procedure, you can simply use R built-in prcomp function (or any other that you like) which, however,  has the downside of calculating all the eigenvalues and eigenvectors and therefore can be a bit slow on a large dataset. Below you can find how it can be done.

I hope this post was useful.
More details at…

IRA Tweet Data You may have heard that two researchers at Clemson University analyzed almost 3 millions tweets from the Internet Research Agency (IRA) – a “Russian troll factory”. In partnership with FiveThirtyEight, they made all of their data available on GitHub. So of course, I had to read the files into R, which I was able to do with this code:

files %arrange(desc(tf_idf))tweet_tfidf %__%mutate(word = factor(word, levels = rev(unique(word)))) %__%group_by(account_category) %__%top_n(15) %__%ungroup() %__%ggplot(aes(word, tf_idf, fill = account_category)) +geom_col(show.legend = FALSE) +labs(x = NULL, y = “tf-idf”) +facet_wrap(~account_category, ncol = 2, scales = “free”) +coord_flip()
## Selecting by tf_idf

But another method of examining terms and topics in a set of documents is Latent Dirichlet Allocation (LDA), which can be conducted using the R package, topicmodels. The only issue is that LDA requires a document term matrix. But we can easily convert our wordcounts dataset into a DTM with the cast_dtm function from tidytext. Then we run our LDA with topicmodels. Note that LDA is a random technique, so we set a random number seed, and we specify how many topics we want the LDA to extract (k). Since there are 6 account types (plus 1 unknown), I’m going to try having it extract 6 topics. We can see how well they line up with the account types.tweets_dtm %cast_dtm(account_category, word, n)library(topicmodels)tweets_lda %ungroup() %__%arrange(topic, -beta)top_terms %__%mutate(term = reorder(term, beta)) %__%ggplot(aes(term, beta, fill = factor(topic))) +geom_col(show.legend = FALSE) +facet_wrap(~topic, scales = “free”) +coord_flip()

Based on these plots, I’d say the topics line up very well with the account categories, showing, in order: news feed, left troll, fear monger, right troll, hash gamer, and commercial. One interesting observation, though, is that Trump is a top term in 5 of the 6 topics.
More details at…

In this post, I discuss the development of the Enterprise AI business case through a framework of four quadrants.  According to Gartner: “The mindset shift required for AI can lead to “cultural anxiety” because it calls for a deep change in behaviors and ways of thinking”. Deployment of AI in an Enterprise is complex and multi-disciplinary. Hence, this framework is evolutionary.  The vendors and initiatives listed are included to…
More details at…

Posted in Data & AI Digest | Tagged

Data & AI Digest for 2018-05-02

I’ve been in the process of transferring my blog (along with creating a personal website) to blogdown, which is hosted on Github Pages. The new blog, or rather, the continuation of this blog, will be at, and it went live today. I’ll be cross-posting here for a while, at least until Tal gets my […]
More details at…

Hi there! I was training some ways to simulate animal (or other organisms) movements having into account habitat suitability. To do this, I used my previous eWalk model as the underlying process to simulate random or directional walks. This model is based on Brownian / Ornstein–Uhlenbeck process. You can find more about eWalk model here! Today, I will add one more element to this movement simulations. In this case, we will have into account the habitat or environmental preferences of the simulated species, to perform a simulation like this: First, we will create a raster layer as a random environmental variable, for example tree cover. library (raster) library (dismo) tc
More details at…

Summary: Our starting assumption that sequence problems (language, speech, and others) are the natural domain of RNNs is being challenged.  Temporal Convolutional Nets (TCNs) which are our workhorse CNNs with a few new features are outperforming RNNs on major applications today.  Looks like RNNs may well be history.

More details at…

Posted in Data & AI Digest | Tagged

Data & AI Digest for 2018-04-27

Today, we are excited to announce the availability of the OS Disk Swap capability for VMs using Managed Disks. Until now, this capability was only available for Unmanaged Disks. With this capability, it becomes very easy to restore a previous backup of the OS Disk or swap out the OS Disk for VM troubleshooting without having to delete the VM.
More details at…

Today we’re sharing the public preview of per disk metrics for all Managed & Unmanaged Disks. This enables you to closely monitor and make the right disk selection to suit your application usage…
More details at…

We illustrate pattern recognition techniques applied to an interesting mathematical problem: The representation of a number in non-conventional systems, generalizing the familiar base-2 or base-10 systems. The emphasis is on data science rather than mathematical theory, and the style is that of a tutorial, requiring minimum knowledge in mathematics or statistics. However, some off-the-beaten-path, state-of-the-art number theory research is discussed here, in a way that is accessible to…
More details at…

Posted in Data & AI Digest | Tagged

Data & AI Digest for 2018-04-26

Azure Cosmos DB is Microsoft’s globally distributed, multi-model database. Azure Cosmos DB enables you to elastically and independently scale throughput and storage across any number of Azure’s geographic regions with a single click. It offers throughput, latency, availability, and consistency guarantees with comprehensive service level agreements (SLAs), a feature that no other database service can offer.
More details at…

A quick note for displaying R htmlwidgets in Jupyter notebooks without requiring pandoc – there may be a more native way but this acts as a workaround in the meantime if not: PS and from the other side, using reticulate for Python powered Shiny apps.
More details at…

Our onboarding
that ensure that packages contributed by the community undergo a
transparent, constructive, non adversarial and open review process, take
place in the issue tracker of a GitHub repository. Development of the
packages we onboard also takes place in the open, most often in GitHub

Therefore, when wanting to get data about our onboarding system for
giving a data-driven overview, my mission was to extract data from
GitHub and git repositories, and to put it into nice rectangles (as
defined by Jenny
Bryan) ready for
analysis. You might call that the first step of a “tidy git analysis”
using the term coined by Simon
So, how did I collect data?

A side-note about GitHub

In the following, I’ll mention repositories. All of them are git
repositories, which means they’re folders under version control, where
roughly said all changes are saved via commits and their messages (more
or less) describing what’s been changed in the commit. Now, on top of
that these repositories live on GitHub which means they get to enjoy
some infratructure such as issue trackers, milestones, starring by
admirers, etc. If that ecosystem is brand new to you, I recommend
reading this book, especially its big
picture chapter.

Package review processes: weaving the threads

Each package submission is an issue thread in our onboarding repository,
see an example
here. The first
comment in that issue is the submission itself, followed by many
comments by the editor, reviewers and authors. On top of all the data
that’s saved there, mostly text data, we have a private
Airtable workspace where we have a table of
reviewers and their reviews, with direct links to the issue comments
that are reviews.

Getting issue threads

Unsurprisingly, the first step here was to “get issue threads”. What do
I mean? I wanted a table of all issue threads, one line per comment,
with columns indicating the time at which something was written, and
columns digesting the data from the issue itself, e.g. guessing the role
from the commenter from other information: the first user of the issue
is the “author”.

I used to use GitHub API V3 and then heard about GitHub API
V4 which blew my mind. As if I
weren’t impressed enough by the mere existence of this API and its

I discovered the rOpenSci ghql
package allows one to interact
with such an API and that its docs actually use GitHub API V4 as an

Carl Boettiger told me about his way to rectangle JSON
using jq, a language for
processing JSON, via a dedicated rOpenSci package,
I have nothing against GitHub API V3 and
gh and purrr workflows, but I was
curious and really enjoyed learning these new tools and writing this
code. I had written a gh/purrr code for getting the same information
and it felt clumsier, but it might just be because I wasn’t
perfectionist enough when writing it! I achieved writing the correct
GitHub V4 API query to get just what I needed by using its online
explorer. I then succeeded
in transforming the JSON output into a rectangle by reading Carl’s post
but also by taking advantage of another online explorer, jq
play where I pasted my output via
writeClipboard. That’s nearly always the way I learn about query
tools: using some sort of explorer and then pasting the code into a
script. When I am more experienced, I can skip the explorer part.

The first function I wrote was one for getting the issue number of the
last onboarding issue, because then I looped/mapped over all issues.

# function to get number of last issue
More details at…

On March the 17th I had the honor to give a keynote talk about rOpenSci’s package onboarding system at the satRday conference in Cape Town, entitled “Our package reviews in review: introducing and analyzing rOpenSci onbo…
More details at…

[Disclaimer: I received this book of Coryn Bailer-Jones for a review in the International Statistical Review and intend to submit a revised version of this post as my review. As usual, book reviews on the ‘Og are reflecting my own definitely personal and highly subjective views on the topic!] It is always a bit of […]
More details at…

Data Dictionary to Meta Data III is the third and final blog devoted to demonstrating the automation of meta data creation for the American Community Survey 2012-2016 household data set, using a published data dictionary. DDMDI was a teaser to show how Python could be used to generate R statements that…
More details at…

Posted in Data & AI Digest | Tagged

Markov, Trump and Counter Radicalization

Here’s a video of a talk I did at DDD12, it’s more a functional demo of an idea really.

Posted in Uncategorized | Leave a comment

Text Analysis of GE17 Manifestos

I had a quick look at the manifestos of the main parties today, so I thought I’d jot down a few remarks here.

So the first thing I did was to remove all the stop words and then run a frequency distribution across the remaining text, which yielded the following result:








That done I ran a quick trigram colocation across the text. This finds groups of three words, which have a low probability of being next to each other by accident, or just by the nature of the English language. Having found groupings of words, I then took a frequency distribution over them and found the most frequent three words groups, this can help us get a good feel for what ideas are important to the authors of the manifestos. The results are below:








That’s it for now, I’ll post again if I get time to do a little more analysis.

Posted in Data Science | Tagged | Leave a comment