Hiya! Week number two, research research research.

I scheduled the first interviews (next week with Elizabeth Yuu, week after with Seldo Voss from Netlify, outreach to RKI via Bernhard).

I also finished the first iteration of interview questions, and will be testing them on Elizabeth. Let’s see how that goes!

I also read through some articles + blog posts that I found during googling last week. Most of them turned out to be pretty irrelevant, but one piqued my interest:

Great Expectations

Down with Pipeline debt / Introducing Great Expectations

“Introducing Great Expectations” is a piece from 2018, introducing an open-source framework for what they call “pipeline tests”. A pipeline test is a set of assertions + statistical test that are run on datasets during the processing of data (“the pipeline”). This ensures that data conforms to one’s expectations (hence the framework name).

And it seems like people like it! While the company behind it started out as “Superconductive Health” (former webpage) and offered consulting to the life science industry, they’ve since rebranded to “Superconductive” and are focusing their efforts on Great Expecations. They’ve brought in $61M in VC funding from the likes of Tiger Global and Index Capital, without a single dollar in product revenue. Wow, markets are really bonkers at the moment...

Anyway, excuse my startuppy stint here. Point is: Great Expectations is a well-liked tool, its open-source version has seen some adoption across the tech industry, with the product page listing GitHub, ING, Delivery Hero amongst users. Most importantly: Even our bachelor’s project almost started using it!

Great Expectations is about checking data quality, and it’s one take on tackling the same problems I’m trying to solve with my thesis. When I looked into it for the bachelor’s project though, I found it to be very powerful, but clunky to use. Looking more into Great Expectations will definitely be a focus for me next week, as it’s very important prior art for my research.

Building a Prototype

This week, I also built a first small prototype that I’ll use during the interviews.

It’s a document of the (imaginary) “Economic Bureau Data Team”, a government chapter dedicated to supporting economic decision making through data research.

see for yourself at https://deepnote.com/project/Economic-Bureau-Data-Docs-XFuwjUVoQbCVF9Xa-6hTPA/%2Fnotebook.ipynb/#25946ad3f57b4935b2f28488766ed62a

Most of the notebook is irrelevant to my studies, but there’s one code cell that I’d like to talk about to interview participants:

describe(income.yearly_brutto).between(50000, 60000).normal(alpha=.05).histogram()

In here, income is a pandas DataFrame, and yearly_brutto is a column that contains salary numbers. With the .between(50000, 60000) statement, we’re asserting that all values in that column shall be in the given range. With the .normal(alpha=.05) statement, we’re running a normaltest on the column, and .histogram() simply display the distribution as you can see in the screenshot above.

But wait, the screenshot contains something else!

Screenshot 2022-04-01 at 13.00.05.png

Because the data wasn’t in the expected range, it’s printing a warning. For now, this warning lives in the Notebook, but potentially it could also send out a notice via email or Slack!

I suspect that writing such a describe statement is a pretty low-effort way of expressing assumptions & expectations about a dataset. It has a significantly lower barrier of entry than something like Great Expectations, and I imagine that a company like NetCheck could realistically take this up. But we’ll see about that ^^