Data Cleaning

This workflow cleans the input data. It does the following:

  • Handles null values
  • Replaces N/Y values etc. with 0/1

Workflow

Below is the workflow. It does the following:

  • Reads data from a dataset
  • Handles the null values by imputing the missing values with the constant value provided in the specified columns
  • Convert Strings to Integer Indexes
  • Convert Gender to Integer Values
  • Replace Gender and Family with 0/1
DataCleaning

Reading from Dataset

DatasetCSV reads in the input Dataset file and creates DataFrame from it.

Processor Output

DataCleaning

Handling Null Values

ReplaceMissingValueWithConstant processor handles the null values by imputing the missing values with the constant value provided in the specified columns.

Processor Configuration

DataCleaning

Processor Output

../../_images/Capture4.PNG

Convert Strings to Integer Indexes

StringIndexer processor encodes a string type column to a column of label indices.

Processor Configuration

DataCleaning

Processor Output

../../_images/Capture6.PNG
../../_images/Capture7.PNG
../../_images/Capture8.PNG

Convert Gender to Integer Values

CaseWhen processor sets values for the variables based on conditions, as shown below:

Processor Configuration

DataCleaning

Processor Output

../../_images/Capture10.PNG

Replace Gender and Family with 0/1

FindAndReplaceUsingRegexMultiple processor sets values for the variables based on conditions, as shown below:

Processor Configuration

DataCleaning

Processor Output

DataCleaning

Prints the Results

It prints the first few records onto the screen.