https://www.youtube.com/watch?v=8Tw2TLUpQn8
“continuous data validation”, “automatic data validation”, “establish trust in your data for your stakeholders”
this assumes a very engineering-heavy org. definitely too much for NC!
q: what is a great expectation backend? there’s some state-keeping file at 11’40
prefect seems very smart, they’re basically a pythonic DSL for DAGs
https://www.youtube.com/watch?v=uM9DB2ca8T8
examples are always for relatively simple problems (e.g. increase in 0-passenger-rides), but useful for a lot of usecases
show datadocs
terminology:
“expectation suite”: a set of tests on data
great_expectations init
creates a “data context”
“data source” lives in a data context, and is a database connection or S3 connection (wherever the data lives) (created with great_expectations datasource new
)
they make the analogy “test suite” / “expectation suite”
at the moment, there’s a jupyter notebook to create the expectations - but a point-and-click UI is planned
jupyter notebook:
there’s a big glossary of available expectations.
idea: write expectation suite against “gold standard data”, and then run it against new data
there’s a “mostly” operator, which adds tolerance (e.g. ”I expect most data to be in range“)
Interesting thought: is there a way to see if some rows are more prone to be subject to “mostly”?
great expectations generates data docs, which is basically a list of known facts about data