Titanic Data Cleaning/Wrangling

This workflow shows how to wrangle the Titanic Dataset with Sparkflows.

Workflow

This workflow performs the following steps:

  • Reads the Titanic dataset
  • Drops Rows containing Null values
  • Filters the Rows for whom Age has not been specified
  • Changes the data type of the Age column to integer
  • Filters rows for persons of age > 30 and who are female
titanic-data-cleaning

Reading Titanic dataset

DatasetStructured processor creates a Dataframe of your dataset named Titanic Data by reading data from HDFS, HIVE etc. which had been defined earlier in Fire by using the Dataset feature.

Processor Output

titanic-data-cleaning

Dropping the rows with null values

DropRowsWithNull processor drops the rows with null values.

Processor Configuration

titanic-data-cleaning

Processor Output

titanic-data-cleaning

Filter by string length

FilterByStringLength processor filters the rows within the provided string length

Processor Configuration

titanic-data-cleaning

Processor Output

titanic-data-cleaning

Convert Age to Integer

CastColumnType processor performs conversion of Age to integer type.

Processor Configuration

titanic-data-cleaning

Processor Output

titanic-data-cleaning

Get Rows of Interest

RowFilter processor filters the data based on provided conditions as shown below:

Processor Configuration

titanic-data-cleaning

Processor Output

titanic-data-cleaning

Prints the results

It prints the first few records onto the screen.