Data quality in big data
One of the biggest challenge in #BigData projects is ensuring the ‘quality’ of data
Debugging ‘anomalies’ across #DataPipelines can be a nightmare. Certain teams end up spending more time debugging than actually coding new features.
That’s because usually there are no automated data quality checks in place to catch issues, and they have to trace back anomalies across huge amounts of data sifting through complex ETL processes.
With other development projects, the behavior is more predictive because inputs into the system are homogeneous.
For #BigData projects, there is no guarantee on the data we ingest & the how it get’s processed will always be accurate – the input is NOT homogeneous.
The solution to this: have #automated data quality checks running in #production across the data pipeline
#RedefiningSoftwareQuality #BigData #Testing #Automation