Got a great question in my talk’s Q&A at the #AutomationGuild 2020 :

How to create test data for #BigData projects?

Generally there are three types of ‘test data management’ you want to focus,

1. Mocks / stubs
2. Generate synthetic test data
3. Masked production data

For big data, the most important one is masked production data.

You would also need to create synthetic data too, but will not be enough to see if the model is working properly or not

So make an effort to get masked production data to have greater confidence in your data pipeline and data models.

#QsDaily #BigData #TestData #Testing