6 Nations – Scotland vs England 2020

So the bookies give England a 75% probability of winning their #6nations match against Scotland tomorrow. Hmm, may well be worth a punt on Scotland at those odds. England’s win/loss ratio at Murrayfield is 50/50. The score against France, last week, flattered England as the two individual tries by May made the score more respectable. I’ve done the analysis and England only have a 3% edge. That said, they out score Scotland in every area except the back of the scrum, fullback and the right wing – and that’s concerning. It’s even more reason to lament the loss of Russell. Hastings is the better (read more consistent) fly half, but with only a 3% edge to overcome, the brilliance that Russell could have brought to the game, may well have won it for us. In conclusion, I think England will win (sadly) but it’s no way as likely as the bookies think.

Posted in Data Science, Statistics | Tagged , | Leave a comment

Three Pointer Robot!

Dunk! Dunk!


Posted in Technology, Uncategorized | Leave a comment

Digital Analytical Exchange programme – Now open for bids – Digital

Updates from the Scottish Government’s Digital Dirtectorate
— Read on blogs.gov.scot/digital/2019/01/24/analytical-exchange-programme-now-open-for-bids/

Posted in Uncategorized | Leave a comment

Data & AI Digest for 2018-10-06

At rOpenSci we are developing on a suite of packages that expose powerful graphics and imaging libraries in R. Our latest addition is av – a new package for working with audio/video based on the FFmpeg AV libraries. This ambitious new project will become the video counterpart of the magick package which we use for working with images.


The package can be installed directly from CRAN and includes a test function av_demo() which generates a demo video from random histograms.

Why AV in R?

One popular application is animating graphics by combining a sequence of graphics into a video. The animation and gganimate packages have many great examples. However up till now these packages would have to shell out to external software (such as the ffmpeg command line program) to generate the video. This process is inefficient and error prone and requires that the correct version of the external software is installed on the user/server machines, which is often not the case.

The av package takes away this technical burden. It uses the same libraries as FFmpeg, however because we interface directly to the C API, there is no need to shell out or install utilities. Everything we need is linked into the R package, which means that if the package is installed, it always works.

FFmpeg provides a full-featured video editing library, and now that the core package is in place, we can take things a step further. For example you can already enhance an animation with an audio track (to narrate what is going on or show off your karaoke skills) or apply one of the 100+ built-in video filters. In future versions we also want to add things like screen capturing and reading raw video frames and audio samples for analysis in R.

Create video from images

The av_encode_video() function converts a set of images (png, jpeg, etc) into a video file with custom container format, codec, fps, and filters. The video format is determined from the file extension (mp4, mkv, flv, gif). Av supports all popular codecs and muxers (codecs compress the raw audio/video and a muxer is the container format which interleaves one or more audio and video streams into a file).

# Create some PNG images
png(“input%03d.png”, width = 1280, height = 720, res = 108)
for(i in 1:10){
More details at…

Here are some conferences focused on R taking place in the next few months: Oct 26: Nor’eastR Conference (Providence, RI). A one-day R conference, organized by grassroots R community members in the Northeastern US. Oct 27: SatRdays Belgrade (Serbia). Another city joins the popular SatRDays series of one-day, community led conferences. Nov 7: EARL Seattle (Seattle, WA). The EARL London conference goes on a Stateside road trip, with a 1-day stop in Seattle. Nov 8-9: DC R Conference (Washington, DC). From the same organizers as the fantastic New York R Conference, this DC-based outpost will feature speakers including Mara Averick,…
More details at…

This started out as a “hey, I wonder…” sort of thing, but as usual, they tend to end up as interesting voyages into the deepest depths of code, so I thought I’d write it up and share. Shoutout to @coolbutuseless…Continue Reading →
More details at…

Mango Solutions are pleased to be exhibiting at ACoP 9 on 7-10th October at the Loews Coronado Bay Resort near San Diego, CA and welcome our customers and conference attendees to booth 27 where we will be showcasing our PK/PD model informed drug development tool – Navigator Workbench. Navigator Workbench provides a powerful, validated platform for PK/PD model development, execution, evaluation and reporting tasks.  Alongside Mango’s ModSpace product – a proven document, model and code repository which supports cross-functional teams…
More details at…

Posted in Data & AI Digest | Tagged | Leave a comment

Data & AI Digest for 2018-09-26

In my recent conversations with customers, they have shared the security challenges they are facing on-premises. These challenges include recruiting and retaining security experts, quickly responding to an increasing number of threats, and ensuring that their security policies are meeting their compliance requirements.
More details at…

We are excited to preview a set of Azure Resource Manager Application Program Interfaces (ARM APIs) to view cost and usage information in the context of a management group for Enterprise Customers. Azure customers can utilize management groups today to place subscriptions into containers for organization within a defined business hierarchy.
More details at…

This week at Microsoft Ignite 2018, we are excited to announce seven new features in Azure Stream Analytics (ASA).
More details at…

Today, we are excited to announce the preview of the new Azure HDInsight Management SDK. This preview SDK brings support for new languages and can be used to easily manage your HDInsight clusters.
More details at…

Today I am excited to announce the private preview of Azure VM Image Builder, a service which allows users to have an image building pipeline in Azure. Creating standardized virtual machine (VM) images allow organizations to migrate to the cloud and ensure consistency in the deployments. Users commonly want VMs to include predefined security and configuration settings as well as application software they own.
More details at…

We are happy to share the release of additions and enhancements to Bing Custom Search. Bing Custom Search is an easy-to-use, ad-free search solution that enables users to build a search experience and query content on their specific site, or across a hand-picked set of websites or domains.
More details at…

Today we are extending Try Cosmos DB for 30 days free! Try Cosmos DB allows anyone to play with Azure Cosmos DB, with no Azure sign-up required and at no charge, for 30 days with ability to renew unlimited times.
More details at…

SQL Information Protection policy can now be managed in Azure Security Center. The policy management enables centrally defining customized classification and labeling policies, that will be applied across all databases on your tenant.
More details at…

Azure Stack features a growing independent software vendor (ISV) community that operates across a broad spectrum of environments, empowering you to create compelling and powerful solutions.
More details at…

Earlier this year, we released the public preview of Azure Serial Console for Virtual Machines (VMs). Today we are announcing the general availability of Serial Console and we’ve added numerous features and enhanced performance to make Serial Console an even better tool for systems administrators, IT administrators, DevOps engineers, VM administrators, and systems engineers.
More details at…

Recently I began to look further into Time Series(TS). During the course of my Master’s degree, I used the forecast package quite a bit (Thanks to Prof. Hyndman), and TS got my attention. So, after reading lots of publications about everything you can imagine about TS, I came across one publication from Prof. Eamonn, of the University of … Continue reading Time Series with Matrix Profile
More details at…

Here is the course link.
Course Description
Linear regression serves as a workhorse of statistics, but cannot handle some types of complex data. A generalized linear model (GLM) expands upon linear regression to include non-normal distributions inc…
More details at…

Package developers relaxed a bit in August.; only 160 new packages went to CRAN that month. Here are my “Top 40” picks organized into seven categories: Data, Machine Learning, Science, Statistics, Time Series, Utilities, and Visualization.


nsapi v0.1.1: Provides an interface to the Nederlandse Spoorwegen (Dutch Railways) API, allowing users to download current departure times, disruptions and engineering work, the station list, and travel recommendations from station to station. There is a vignette.

repec v0.1.0: Provides utilities for accessing RePEc (Research Papers in Economics) through a RESTful API. You can request an access code and get detailed information here.

rfacebookstat v1.8.3: Implements an interface to the Facebook Marketing API, allowing users to load data by campaigns, ads, ad sets, and insights.

UCSCXenaTools v0.2.4: Provides access to data sets from UCSC Xena data hubs, which are a collection of UCSC-hosted public databases.

ZipRadius v1.0.1: Generates a data frame of US zip codes and their distance to the given zip code, when given a starting zip code and a radius in miles. Also includes functions for use with choroplethrZip, which are detailed in the vignette.

Machine Learning

dials v0.0.1: Provides tools for creating model parameters that cannot be directly estimated from the data. There is a vignette.

tosca v0.1-2: Provides a framework for statistical analysis in content analysis. See the vignette for details.

tsmap v0.3.1: Implements the Matrix Profile concept for classification.


DSAIRM v0.4.0: Provides a collection of Shiny apps that implement dynamical systems simulations to explore within-host immune response scenarios. See the package Tutorial.

epiflows v0.2.0: Provides functions and classes designed to handle and visualize epidemiological flows between locations, as well as a statistical method for predicting disease spread from flow data initially described in Dorigatti et al. (2017). For more information, see the RECON toolkit for outbreak analysis. There is an Overview and a vignette on Data Preparation.

fieldRS v0.1.1: Provides functions for remote-sensing field work using best practices suggested by Olofsson et al. (2014). See the vignette for details.

Rnmr1D v1.2.1: Provides functions to perform the complete processing of proton nuclear magnetic resonance spectra from the free induction decay raw data. For details see Jacob et al. (2017) and the vignette.


bcaboot v0.2-1: Provides functions to compute bootstrap confidence intervals in an almost automatic fashion. See the vignette.

bivariate v0.2.2: Contains convenience functions for constructing and plotting bivariate probability distributions. See the vignette for details.

DesignLibrary v0.1.1: Provides a simple interface to build designs and allow users to compare performance of a given design across a range of combinations of parameters, such as effect size, sample size, and assignment probabilities. Look here for more information.

doremi v0.1.0: Provides functions to fit the dynamics of a regulated system experiencing exogenous inputs using differential equations and linear mixed-effects regressions to estimate the characteristic parameters of the equation. See the vignette.

eikosograms v0.1.1: An eikosogram (probability picture from the ancient Greek εὶκὀσ – likely or probable) divides the unit square into rectangular regions whose areas, sides, and widths represent various probabilities associated with the values of one or more categorical variates. For a discussion on the eikosogram and its superiority to Venn diagrams in teaching probability, see Cherry and Oldford (2003), and for a discussion of its value in exploring conditional independence structure and relation to graphical and log-linear models, see Oldford (2003). There is an Introduction and vignettes on Data Analysis and Independence Relations.

localIV v0.1.0: Provides functions to estimate marginal treatment effects using local instrumental variables. See Heckman et al. (2006) and Zhou and Xie (2018) for background.

merlin v0.0.1: Provides functions to fit linear, non-linear, and user-defined mixed effects regression models following the framework developed by Crowther (2017). See the vignette for details.

MRFcov v1.0.35: Provides functions to approximate node interaction parameters of Markov Random Fields graphical networks. The general methods are described in Clark et al. (2018). There are vignettes on Preparing Datasets, Gaussian and Poisson Fields, and an example using Bird parasite data.

SCPME v1.0: Provides functions to estimate a penalized precision matrix via an augmented ADMM algorithm as described in Molstad and Rothman (2018). There is a Tutorial and a vignette describing Algorithm Details.

survxai v0.2.0: Contains functions for creating a unified representation of survival models, which can be further processed by various survival explainers. There are vignettes on Local explanations, global explanations, comparing models, and on a custom prediction function.

Time Series

hpiR v0.2.0: Provides functions to compute house price indexes and series, and evaluate index goodness based on accuracy, volatility and revision statistics. For the background on model construction, see Case and Quigley (1991), and for hedonic pricing models, see Bourassa et al. (2006). There is an an introduction to the package and a vignette on Classes.

STMotif v0.1.1: Provides functions to identify motifs (previously identified sub-sequences) in spatial-time series. There are vignettes on motif discovery, examples, candidate generation, and candidate validation.

trawl v0.2.1: Contains functions for simulating and estimating integer-valued trawl processes as described in Veraart (2018), and for simulating random vectors from the bivariate negative binomial and the bi- and trivariate logarithmic series distributions. There is a vignette on trawl processes, and another on the binomial distributions.


arkdb v0.0.3: Provides functions for exporting tables from relational database connections into compressed text files, and streaming those text files back into a database without requiring the whole table to fit in working memory. See the vignette for a tutorial.

aws.kms v0.1.2: Implements an interface to AWS Key Management Service, a cloud service for managing encryption keys. See the README for details.

DatapackageR v0.15.3: Provides a framework to help construct R data packages in a reproducible manner. It maintains data provenance by turning the data-processing scripts into package vignettes, as well as enforcing documentation and version checking of included data objects. There is a Guide to using the package, and a vignette on YAML configuration.

hedgehog v0.1: Enables users to test properties of their programs against randomly generated input, providing far superior test coverage compared to unit testing. There is a general tutorial and a description of the Hedgehog state machine.

jsonstat v0.0.2: Implements an interface to JSON-stat, a simple, lightweight ‘JSON’ format for data dissemination. There is a short quickstart quide.

nseval v0.4: Provides an API for Lazy and Non-Standard Evaluation with facilities to capture, inspect, manipulate, and create lazy values (promises), “…” lists, and active calls. See README.

runner v0.1.0: Provides running functions (windowed, rolling, cumulative) with varying window size and missing handling options for R vectors. See the vignette for details.

RTest v1.1.9.0: Provides an XML-based testing framework for automated component tests of R packages developed for a regulatory environment. There is a short vignette.

sparkbq v0.1.0: Extends sparklyr by providing integration with Google BigQuery. It supports direct import/export from/to BigQuery, as well as intermediate data extraction from Google Cloud Storage. See README.

vapour v0.1.0: Provides low-level access to GDAL, the Geospatial Data Abstraction Library. There is a vignette.


mapdeck v0.1.0: Provides a mechanism to plot interactive maps using Mapbox GL, a JavaScript library for interactive maps, and Deck.gl, a JavaScript library which uses WebGL for visualizing large data sets. The vignette explains how to use the package.

rayshader v0.5.1: Provides functions that use a combination of raytracing, spherical texture mapping, lambertian reflectance, and ambient occlusion to produce hillshades of elevation matrices. Includes water-detection and layering functions, programmable color palette generation, built-in textures, 2D and 3D plotting options, and more. See README for details and examples.

sigmajs v0.1.1: Provides an interface to the sigma.js graph-visualization library, including animations, plugins, and shiny proxies. There is a brief Get Started Guide, and vignettes on Animation, Buttons, Coloring by Cluster, Dynamic graphs, igraph & gexf, Layout, Plugins, Settings, Shiny, and Crosstalk.

survsup v0.0.1: Implements functions to plot survival curves. The vignette provides examples.

tidybayes v1.0.1: Provides functions for composing data and extracting, manipulating, and visualizing posterior draws from Bayesian models (JAGS, Stan, rstanarm, brms, MCMCglmm, coda, …) in a tidy data format. There is a vignette on Using tidy data with Bayesian Models, and vignettes for brms and rstanarm models.
More details at…

Posted in Data & AI Digest | Tagged | Leave a comment

Dataset Search: Google launches new search engine to help scientists find datasets – The Verge

Google is launching a new service for scientists, journalists, and anyone else looking to track down data online. It’s called Dataset Search, and it will hopefully unify the fragmented world of open data repositories.
— Read on www.theverge.com/2018/9/5/17822562/google-dataset-search-service-scholar-scientific-journal-open-data-access

Posted in Data Science, Uncategorized | Leave a comment

Data & AI Digest for 2018-09-05

Blue Bikes is a bicycle sharing system in the Boston, Massachusetts. The bikes sharing program started on 28 July 2011. This program aimed for individuals to use it for short-term basis for a price. The program allows individuals to borrow a bike from a dock station and retrun to another dock station after using it. […]
Related Post
Analysis of Los Angeles Crime with R
Mapping the Prevalence of Alzheimer Disease Mortality in the USA
Animating the Goals of the World Cup: Comparing the old vs. new gganimate and tweenr API
Machine Learning Results in R: one plot to rule them all! (Part 1 – Classification Models)
Seaborn Categorical Plots in Python

CategoriesVisualizing Data
Data Visualisation
R Programming
Tips & Tricks
More details at…

You might have read my blog post analyzing the social weather of
based on a text analysis of GitHub issues. I extracted text out of
Markdown-formatted threads with regular expressions. I basically
hammered away at the issues using tools I was familiar with until it
worked! Now I know there’s a much better and cleaner way, that I’ll
present in this note. Read on if you want to extract insights about
text, code, links, etc. from R Markdown reports, Hugo website sources,
GitHub issues… without writing messy and smelly code!

Introduction to Markdown rendering and parsing

This note will appear to you, dear reader, as an html page, either here
on ropensci.org or on R-Bloggers, but I’m writing it as an R Markdown
document, using Markdown syntax. I’ll knit it to Markdown and then
Hugo’s Markdown processor,
Blackfriday, will transform
it to html. Elements such as # blabla thus get transformed to
blabla. Awesome!

The rendering of Markdown to html or XML can also be used as a way to
parse it, which is what the spelling package does in order to
identify text
of R Markdown files, before spell checking them only, not code. I had an
aha moment when seeing this spelling strategy: why did I ever use
regex to parse Markdown for text analysis?! Transforming it to XML
first, and then using XPath, would be much cleaner!

As a side-note, realizing how to simplify my old code made me think of
Jenny Bryan’s inspiring useR! keynote talk about code
smells. I asked her
whether code full of regular expressions instead of dedicated parsing
tools was a code smell, sadly it doesn’t have a specific name, but she
confirmed my feeling that not using dedicated purpose-built tools
might mean you’ll end up “re-inventing all of that logic yourself, in
hacky way.”. If you have code falling under the definition below, maybe
try to re-factor and if needed get

It’s that feeling when you want to do something that sounds simple but
instead your code is like 10 stack overflow snippets slapped together
that you could never explain to another human what they do 😰

— Dr. Alison Hill (@apreshill)
d’agost de 2018

From Markdown to XML

In this note I’ll use my local fork of rOpenSci’s website source, and
use all the Markdown sources of blog posts as example data. The chunk
below is therefore not portable, sorry about that.

roblog %
commonmark::markdown_xml(extensions = TRUE) %__%

See what it gives me for one post.


## {xml_document}
## [1] \n We just released a new version of \n __ …
## [2] \n First, install and load taxize\ …
## [3] install.packages(“rgbif”)\n
## [4] library(taxize)\n
## [5] \n New things\n
## [6] \n New functions: class2tree\n
More details at…

If you want to do statistical analysis or machine learning with data in SQL Server, you can of course extract the data from SQL Server and then analyze it in R or Python. But a better way is to run R or Python within the database, using Microsoft ML Services in SQL Server 2017. Why? It’s faster. Not only to you get to use the SQL Server instance (which is likely to be faster than your local machine), but it also means you no longer have to transport the data over a network, which is likely to be the biggest…
More details at…

Week 1 Gold Mining and Fantasy Football Projection Roundup now available. Go get that free agent gold!
The post Gold-Mining W1 (2018) appeared first on Fantasy Football Analytics.
More details at…

Posted in Data & AI Digest | Tagged | Leave a comment

Data & AI Digest for 2018-08-29

The post Microsoft 365 is the smartest place to store your content appeared first on The AI Blog.
More details at…

The post AI and fish farming: High-tech help for a sushi and sashimi favorite in Japan appeared first on The AI Blog.
More details at…

Digitization of healthcare information, EHR systems, precision and personalized medicine, health information exchange, consumer health, Internet of Medical Things (IoMT), and other major trends affecting healthcare are accelerating this data growth rate.
More details at…

The videos from the NYC R conference have been published, and there are so many great talks there to explore. I highly recommend checking them out: you’ll find a wealth of interesting R applications, informative deep dives on using R (and a few other applications as well), and some very entertaining deliveries. In this post, I wanted to highlight a couple of talks in particular. The talk by Jonathan Hersh (Chapman University), Applying Deep Learning to Satellite Images to Estimate Violence in Syria and Poverty in Mexico is both fascinating and a real technical achievement. In the first part of…
More details at…

Hi everyone! In this series we are going to work with a gene called the Mitochondrial Control Region. This series will be about getting some insights into it from sequence analysis. We are going through the entire process, from getting data to reach some conclusions and more importantly raise some questions. All of the bash […]
More details at…

Welcome to Part Two of the three-part tutorial series on proteomics data analysis. The ultimate goal of this exercise is to identify proteins whose abundance is different bewteen the drug-resistant cells and the control. In other words, we are looking for a list of differentially regulated proteins that may shed light on how cells escape […]
Related Post
Clean Your Data in Seconds with This R Function
Hands-on Tutorial on Python Data Processing Library Pandas – Part 2
Hands-on Tutorial on Python Data Processing Library Pandas – Part 1
Using R with MonetDB
Recording and Measuring Your Musical Progress with R
More details at…

A collection of some commonly used and some newly developed methods for the visualization of outcomes in oncology studies include Kaplan-Meier curves, forest plots, funnel plots, violin plots, waterfall plots, spider plots, swimmer plot, heatmaps, circos plots, transit map diagrams and network analysis diagrams (reviewed here). Previous articles in this blog presented an introduction to … Continue reading Visualization of Tumor Response – Spider Plots
More details at…

Posted in Data & AI Digest | Tagged | Leave a comment

Data & AI Digest for 2018-08-14

Keep current on what’s happening in Azure, including what’s now in preview, generally available, news & updates, and more.
More details at…

We continue to expand the Azure Marketplace ecosystem. From July 1 to 15, 98 new offers successfully met the onboarding criteria and went live.
More details at…

You can accelerate your cloud migration using intelligent migration assessment services like Azure Migrate. Azure Migrate is a generally available service, offered at no additional charge, that helps you plan your migration to Azure.
More details at…

The data.table R package is really good at sorting. Below is a comparison of it versus dplyr for a range of problem sizes. The graph is using a log-log scale (so things are very compressed). But data.table is routinely 7 times faster than dplyr. The ratio of run times is shown below. Notice on the … Continue reading data.table is Really Good at Sorting
More details at…

Introduction As part of our effort to provide users ways to replicate our analyses and improve their performance in fantasy football, we are continuously looking at ways we can improve[…]
The post 2018 Update of ffanalytics R Package appeared first on Fantasy Football Analytics.
More details at…

The development version of splashr now support authenticated connections to Splash API instances. Just specify user and pass on the initial splashr::splash() call to use your scraping setup a bit more safely. For those not familiar with splashr and/or Splash: the latter is a lightweight alternative to tools like Selenium and the former is an… Continue reading →
More details at…

I love Arkham Horror; The Card Game. I love it more than I really should;it’s ridiculously fun. It’s a cooperative card game where you build a deck representing a character in the Cthulhu mythos universe, and with that deck you play scenarios in a narrative campaign where you grapple with the horrors of the mythos. Your actions in (and between) scenarios have repercussions for how the campaign plays out, changing the story, and you use experience points accumulated in the scenarios to upgrade your deck with better cards.
More details at…

Model interpretability is critical to businesses. If you want to use high performance models (GLM, RF, GBM, Deep Learning, H2O, Keras, xgboost, etc), you need to learn how to explain them. With machine learning interpretability growing in importance, s…
More details at…

Posted in Data & AI Digest | Tagged | Leave a comment

Data & AI Digest for 2018-08-13

More details at…

A Deluge of Content Over the last two decades, the accessibility of media has increased dramatically. One of the main reasons for this surge in available content is the evolution of online streaming platforms. Services like Netflix, Hulu, Amazon Prime, and others enable access for millions of consumers to a seemingly countless number of movies […]
More details at…

Principal component analysis (PCA) is a dimensionality reduction technique which might come handy when building a predictive model or in the exploratory phase of your data analysis. It is often the case that when it is most handy you might have forgot it exists but let’s neglect this aspect for now 😉

I decided to write this post mainly for two reasons:

I had to make order in my mind about the terminology used and complain about a few things.
I wanted to try to use PCA in a meaningful example.

What does it do?
There are a lot of ways to try to explain what PCA does and there are a lot of good explanations online. I highly suggest you to google around a bit on the topic.
PCA looks for a new the reference system to describe your data. This new reference system is designed in such a way to maximize the variance of the data across the new axis. The first principal component accounts for as much variance as possible, as does the second and so on. PCA transforms a set of (tipically) correlated variables into a set of uncorrelated variables called principal components. By design, each principal component will account for as much variance as possible. The hope is that a fewer number of PCs can be used to summarise the whole dataset. Note that PCs are a linear combination of the original data.
The procedure simply boils down to the following steps

Scale (normalize) the data (not necessary but suggested especially when variables are not homogeneous).
Calculate the covariance matrix of the data.
Calculate eigenvectors (also, perhaps confusingly, called “loadings”) and eigenvalues of the covariance matrix.
Choose only the first N biggest eigenvalues according to one of the many criteria available in the literature.
Project your data in the new frame of reference by multipliying your data matrix by a matrix whose columns are the N eigenvectors associated with the N biggest eigenvalues.
Use the projected data (very confusingly called “scores”) as your new variables for further analysis.

Everything clear? No? Ok don’t worry, it takes a bit of time to sink in. Check the practical example below.
Use case: why should you use it?
The best use cases I can find at the top of my mind are the following:

Redundant data: you have gathered a lot of data on a system and you suspect that most of that data is not necessary to describe it. Example: a point moving on a straight line with some noise, say X axis, but you somehow have collected data also on Y and Z axis. Note that PCA doesn’t get rid of the redundant data, it simply costructs new variables that summarise the original data.

Noise in your data: the data you have collected is noisy and there are some correlations that might annoy you later
The data is heavily redundant and correlated. You think out of the 300 variables you have, 3 might be enough to describe your system.

In all these 3 cases PCA might help you getting closer to your goal of improving the results of your analysis.
Use case: when should you use it?
PCA is a technique used in the exploratory phase of data analysis. Personally, I would use it only in one of the 3 circumstances above, at the beginning of my analisis, after having confirmed through the basic descriptive statistics tools and some field knowledge, that I am indeed in one of the use cases mentioned. I would also probably use it when trying to improve a model that relies on independence of the variables.
Not use case: when should you not use it?
PCA is useless when your data is mostly indipendent or even when you have very few variables I would argue. Personally, I would not use it when a clean and direct interpretation of the model is required either. PCA is not a black box but its output is difficult to a) explain to non technical people and, in general, b) not easy to interpret.
Example of application: the Olivetti faces dataset
The Olivetti faces dataset is a collection of 64×64 greyscale pixel images of 40 faces, there are 10 images for each face. It is available in Scikit-learn and it is a good example dataset for showing what PCA can do. I think it is a better cheap example than the iris dataset since it can show you why you would use PCA.
First, we need to download the data using Python and Scikit-Learn.
Then, we need to shape the data in a way that is acceptable for our analysis, that is, features (also known as variables) should be columns, and samples (also known as observations) should be the rows. In our case the data matrix should have size 399×4096. Row 1 would represent image 1, row 2 image 2, and so on… To achieve this, I used this simple R script with a for loop
The images in the dataset are all head shots with mostly no angle. This is an example of the first face (label 0)

A very simple thing to do would be to look what the average face looks like. We can do this by using dplyr, grouping by label and then taking the average of each pixel. Below you can see the average face for labels 0, 1 and 2

Then it is time to actually applying PCA on the dataset
Since the covariance matrix has dimensions 4096×4096 and it is real and symmetric, we get 4096 eigenvalues and 4096 corresponding eigenvectors. I decided (after a few trials) to keep only the first 20 which account roughly for 76.5 % of the total variance in the dataset. As you can see from the plot below, the magnitude of the eigenvalues goes down pretty quickly.

A few notes:

The eigenvalues of the covariance matrix represent the variance of the principal components.
Total variance is preserved after a rotation (i.e. what PCA is at its core).
The variable D_new contains the new variables in the reduced feature space. Note that by using only 20 variables (the 20 principal components) we can capture 76.5 % of the total variance in the dataset. This is quite remarkable since we can probably avoid using all the 4096 pixels for every image. This is good news and it means that PCA application in this case might be useful! If I had to use 4000 principal components to capture a decent amount of variance, that would have been a hint that PCA wasn’t probably the right tool for the job. This point is perhaps trivial, but particularly important. One must not forget what each tool is designed to accomplish.
There are several methods for choosing how many principal components should be kept.  I simply chose 20 since it seems to work out fine. In a real application you might want to research a more scientific criterion or make a few trials.

The eigenfaces
What is an eigenface? The eigenvectors of the covariance matrix are the eigenfaces of our dataset. These face images form a set of basis features that can be linearly combined to reconstruct images in the dataset. They are the basic characteristic of a face (a face contained in this dataset of course) and every face in this dataset can be considered to be a combination of these standard faces. Of course this concept can be applied also to images other than faces. Have a look at the first 4 (and the 20th) eigenfaces. Note that the 20th eigenface is a bit less detailed than the first 4. If you were to look at the 4000 eigenface, you would find that it is almost white noise.

Then we can project the average face onto the eigenvector space. We can do this also for a new image and use this new data, for instance, to do a classification task.
This is the plot we obtain

As you can see, the 3 average faces rank quite differently in the eigenvectors’ space. We could use the projections of a new image to perform a classification task on an algorithm trained on a database of images transformed according to PCA. By using only a few principal components we could save computational time and speed up our face recognition (or other) algorithm.
Using R built-in prcomp function
For the first few times, I would highly suggest you to do PCA manually, as I did above, in order to get a firm grasp at what you are actually doing with your data. Then, once you are familiar with the procedure, you can simply use R built-in prcomp function (or any other that you like) which, however,  has the downside of calculating all the eigenvalues and eigenvectors and therefore can be a bit slow on a large dataset. Below you can find how it can be done.

I hope this post was useful.
More details at…

IRA Tweet Data You may have heard that two researchers at Clemson University analyzed almost 3 millions tweets from the Internet Research Agency (IRA) – a “Russian troll factory”. In partnership with FiveThirtyEight, they made all of their data available on GitHub. So of course, I had to read the files into R, which I was able to do with this code:

files %arrange(desc(tf_idf))tweet_tfidf %__%mutate(word = factor(word, levels = rev(unique(word)))) %__%group_by(account_category) %__%top_n(15) %__%ungroup() %__%ggplot(aes(word, tf_idf, fill = account_category)) +geom_col(show.legend = FALSE) +labs(x = NULL, y = “tf-idf”) +facet_wrap(~account_category, ncol = 2, scales = “free”) +coord_flip()
## Selecting by tf_idf

But another method of examining terms and topics in a set of documents is Latent Dirichlet Allocation (LDA), which can be conducted using the R package, topicmodels. The only issue is that LDA requires a document term matrix. But we can easily convert our wordcounts dataset into a DTM with the cast_dtm function from tidytext. Then we run our LDA with topicmodels. Note that LDA is a random technique, so we set a random number seed, and we specify how many topics we want the LDA to extract (k). Since there are 6 account types (plus 1 unknown), I’m going to try having it extract 6 topics. We can see how well they line up with the account types.tweets_dtm %cast_dtm(account_category, word, n)library(topicmodels)tweets_lda %ungroup() %__%arrange(topic, -beta)top_terms %__%mutate(term = reorder(term, beta)) %__%ggplot(aes(term, beta, fill = factor(topic))) +geom_col(show.legend = FALSE) +facet_wrap(~topic, scales = “free”) +coord_flip()

Based on these plots, I’d say the topics line up very well with the account categories, showing, in order: news feed, left troll, fear monger, right troll, hash gamer, and commercial. One interesting observation, though, is that Trump is a top term in 5 of the 6 topics.
More details at…

In this post, I discuss the development of the Enterprise AI business case through a framework of four quadrants.  According to Gartner: “The mindset shift required for AI can lead to “cultural anxiety” because it calls for a deep change in behaviors and ways of thinking”. Deployment of AI in an Enterprise is complex and multi-disciplinary. Hence, this framework is evolutionary.  The vendors and initiatives listed are included to…
More details at…

Posted in Data & AI Digest | Tagged | Leave a comment