Data & AI Digest for 2018-04-26

Azure Cosmos DB is Microsoft’s globally distributed, multi-model database. Azure Cosmos DB enables you to elastically and independently scale throughput and storage across any number of Azure’s geographic regions with a single click. It offers throughput, latency, availability, and consistency guarantees with comprehensive service level agreements (SLAs), a feature that no other database service can offer.
More details at…

A quick note for displaying R htmlwidgets in Jupyter notebooks without requiring pandoc – there may be a more native way but this acts as a workaround in the meantime if not: PS and from the other side, using reticulate for Python powered Shiny apps.
More details at…

Our onboarding
that ensure that packages contributed by the community undergo a
transparent, constructive, non adversarial and open review process, take
place in the issue tracker of a GitHub repository. Development of the
packages we onboard also takes place in the open, most often in GitHub

Therefore, when wanting to get data about our onboarding system for
giving a data-driven overview, my mission was to extract data from
GitHub and git repositories, and to put it into nice rectangles (as
defined by Jenny
Bryan) ready for
analysis. You might call that the first step of a “tidy git analysis”
using the term coined by Simon
So, how did I collect data?

A side-note about GitHub

In the following, I’ll mention repositories. All of them are git
repositories, which means they’re folders under version control, where
roughly said all changes are saved via commits and their messages (more
or less) describing what’s been changed in the commit. Now, on top of
that these repositories live on GitHub which means they get to enjoy
some infratructure such as issue trackers, milestones, starring by
admirers, etc. If that ecosystem is brand new to you, I recommend
reading this book, especially its big
picture chapter.

Package review processes: weaving the threads

Each package submission is an issue thread in our onboarding repository,
see an example
here. The first
comment in that issue is the submission itself, followed by many
comments by the editor, reviewers and authors. On top of all the data
that’s saved there, mostly text data, we have a private
Airtable workspace where we have a table of
reviewers and their reviews, with direct links to the issue comments
that are reviews.

Getting issue threads

Unsurprisingly, the first step here was to “get issue threads”. What do
I mean? I wanted a table of all issue threads, one line per comment,
with columns indicating the time at which something was written, and
columns digesting the data from the issue itself, e.g. guessing the role
from the commenter from other information: the first user of the issue
is the “author”.

I used to use GitHub API V3 and then heard about GitHub API
V4 which blew my mind. As if I
weren’t impressed enough by the mere existence of this API and its

I discovered the rOpenSci ghql
package allows one to interact
with such an API and that its docs actually use GitHub API V4 as an

Carl Boettiger told me about his way to rectangle JSON
using jq, a language for
processing JSON, via a dedicated rOpenSci package,
I have nothing against GitHub API V3 and
gh and purrr workflows, but I was
curious and really enjoyed learning these new tools and writing this
code. I had written a gh/purrr code for getting the same information
and it felt clumsier, but it might just be because I wasn’t
perfectionist enough when writing it! I achieved writing the correct
GitHub V4 API query to get just what I needed by using its online
explorer. I then succeeded
in transforming the JSON output into a rectangle by reading Carl’s post
but also by taking advantage of another online explorer, jq
play where I pasted my output via
writeClipboard. That’s nearly always the way I learn about query
tools: using some sort of explorer and then pasting the code into a
script. When I am more experienced, I can skip the explorer part.

The first function I wrote was one for getting the issue number of the
last onboarding issue, because then I looped/mapped over all issues.

# function to get number of last issue
More details at…

On March the 17th I had the honor to give a keynote talk about rOpenSci’s package onboarding system at the satRday conference in Cape Town, entitled “Our package reviews in review: introducing and analyzing rOpenSci onbo…
More details at…

[Disclaimer: I received this book of Coryn Bailer-Jones for a review in the International Statistical Review and intend to submit a revised version of this post as my review. As usual, book reviews on the ‘Og are reflecting my own definitely personal and highly subjective views on the topic!] It is always a bit of […]
More details at…

Data Dictionary to Meta Data III is the third and final blog devoted to demonstrating the automation of meta data creation for the American Community Survey 2012-2016 household data set, using a published data dictionary. DDMDI was a teaser to show how Python could be used to generate R statements that…
More details at…

This entry was posted in Data & AI Digest and tagged . Bookmark the permalink.