Data Quality¶
Data quality is an important aspect whenever we ingest data. Incomplete or wrong data can lead to more false predictions by a machine learning algorithm, we may also lose opportunities to monetize our data because of the data issues and business can lose their confidence on the data.
In sparkflows, user can create the workflow using Summary, Correlation etc nodes to get more details about the dataset.
Sample Dataset: http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/
Example:
Workflow¶
Below is the workflow to do Data Profile.
- Reads data from a sample dataset.
- Summary of the numeric fields.
- Correlation of the fields in dataset
- Verfiy the quality of data in sparkflows Data Quality tab.






