Give Great Expecations a solid try

https://www.youtube.com/watch?v=8Tw2TLUpQn8

“continuous data validation”, “automatic data validation”, “establish trust in your data for your stakeholders”

this assumes a very engineering-heavy org. definitely too much for NC!

q: what is a great expectation backend? there’s some state-keeping file at 11’40

prefect seems very smart, they’re basically a pythonic DSL for DAGs

https://www.youtube.com/watch?v=uM9DB2ca8T8

examples are always for relatively simple problems (e.g. increase in 0-passenger-rides), but useful for a lot of usecases

show datadocs

terminology:

“expectation suite”: a set of tests on data

great_expectations init creates a “data context”

“data source” lives in a data context, and is a database connection or S3 connection (wherever the data lives) (created with great_expectations datasource new)

they make the analogy “test suite” / “expectation suite”

at the moment, there’s a jupyter notebook to create the expectations - but a point-and-click UI is planned

jupyter notebook:

loads data context
sets up an expectation suite
creates a “batch” (not explained further)
you then create expectations interactively

there’s a big glossary of available expectations.

idea: write expectation suite against “gold standard data”, and then run it against new data

there’s a “mostly” operator, which adds tolerance (e.g. ”I expect most data to be in range“)

Interesting thought: is there a way to see if some rows are more prone to be subject to “mostly”?

great expectations generates data docs, which is basically a list of known facts about data