Data Analysis (Part-I): Data Wrangling

Read Time: 5 min

Let’s take first step towards journey of learning..

For any manipulation, we need to understand the data first. Basically, data is of two types:

Unstructured: A form of data, where data points are not in particular order and it’s not possible to extract information from that directly such as raw text data, image data (where we need to deal with pixels).

Structured: Structured data is mostly in excel sheet form which contains rows and columns from where it can be processed easily. e.g. Titanic dataset which contains information in form of rows and columns.

So Here I am considering “data analysis” for structured data first. For unstructured data, some extra steps need to be followed.

Data Analysis is a process in which data is prepared for modelling or make it ready to apply machine learning model. So every kind of process from importing the data to analysis of features, comes under data analysis.

Data analysis is comprised of many levels:

  • Data Wrangling
  • Data Visualization
  • Feature Engineering

I will try to cover data wrangling in first part of this tutorial.

So first, let’s check what is data wrangling all about?

It is a process of converting data in such a manner that it can be readily used for further process.

Usual steps to be followed for data wrangling:

  • Identify missing values and handling them
  • Check the data format and its correction if needed
  • Data Normalization
  • Data Encoding

Here I have used Titanic dataset for data analysis. Also, some datasets need more processing, that is purely based on use case available. In titanic dataset, we need to figure out if a passenger is survived or not given other features like Age, Sex, Fare.

Titanic Problem can be referred from Kaggle through the link below:
https://www.kaggle.com/c/titanic

First import the dataset into dataframe:

import pandas as pd
import numpy as np
df = pd.read_csv("train.csv")
df.head()

Let’s start with identifying missing values

Missing values can be present in many forms, like “null”, “NaN” or blank (no value is present) or “?” So we need to check as follows:

missing_data = df.isnull()

mising_data = df.isna()    # only one is used at one time

This will give you output as True/False. If you want to know that how many values are missing from a particular column, use sum() function as follows:

missing_data = df.isnull().sum()

Output will be like this:

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Let’s check how can we deal with these missing values

If missing values are not replaced or removed, it will cause to absurd results. Now question arises, should I remove it or replace it? If I replace it, what value I needed for replacement.

Let’s see..

Removal of column

First, It depends on data that how many values are missing. If missing values consist of large amount of data then remove the whole column. For example, Titanic dataset “Cabin” column contains 687 values, so it will be better to remove that column.

df = df.drop(['Cabin'],axis=1)

Now come to replacement..

Again, I would suggest you to check context of the data for replacement first, then these methods can be applied:

  1. Replace it by mean
  2. Replace it by frequency
  3. Replace it with other functions

Here check the “Age” column, Here we can replace missing values with mean of the values remaining in that column.

df.Age = df.Age.fillna(df.Age.mean())

But “Embarked” column, contains 2 missing values, so first check the values in this column.

df['Embarked'].value_counts()
S    644
C    168
Q     77
Name: Embarked, dtype: int64

Oh! these column does not have numeric value, it contains 3 categories: “S”, “Q”,”C”. So I checked, which category has most occured values:

Okay! So it contains “S” 644 times. So let’s replace ‘nan’ value, with string “S”.

Note: Here “nan” is not a string, so to replace it we need to use “np.nan”

df.Embarked.replace(np.nan, 'S', inplace = True)

Now let’s check the data type of each column

df.dtypes

We need to check the data type because if a column has a category variable it’s dtype should of “obj” type. If it contains numerical values in the column, it will have dtype as “int”/”float”

Normalization

Normalization is done to bring the data to a common scale, without distorting differences in the ranges of values. When Data columns have a finite range, model converges efficiently. So for normalization we generally use, MinMaxScaler function.

from sklearn import preprocessing
# Create x, where x the 'Fare' column's values as floats
x = df[['Fare']].values.astype(float)

# Create a minimum and maximum processor object
min_max_scaler = preprocessing.MinMaxScaler()

# Create an object to transform the data to fit minmax processor
x_scaled = min_max_scaler.fit_transform(x)

# Run the normalizer on the dataframe
df['Fare'] = pd.DataFrame(x_scaled)
df.Fare.head()
0    0.014151
1    0.139136
2    0.015469
3    0.103644
4    0.015713
Name: Fare, dtype: float64

Data Encoding

Before feeding the data into any machine learning model, we need to convert categorical data into numerical form so that data can be processed.

There are 3 ways possible for encoding the data:

  • One Hot Encoding
  • Label Encoding
  • Creating Dummy Variables

One Hot Encoding:

If category column has 4 categories then using one hot encoding, 1 is assigned to one category and other classes are assigned as 0. One hot encoding is mostly used in processing the image data. (Will explain in detail in later posts)

Label Encoding:

Label encoding assigns numbers to each category. For example, in this titanic dataset, in “Sex” column, there are 2 classes: Male and Female

So Label encoding assigns ‘1’ to male and ‘2’ to Female values. In df[‘Name’], as it seems that almost all values are different, so using label encoding each unique no will be assigned to each name.

from sklearn.preprocessing import LabelEncoder
df.Sex = LabelEncoder().fit_transform(df.Sex.astype(str))

Dummy Variable:

Using dummy variable, we can create new columns for each class. It works similar to one hot encoding.

Here in df[‘Embarked’], it contains 3 categories, “S”, “Q”,”C” . So at one time, one class is assigned one and others zero and using this 3 new columns were created.

dummy_variable_1 = pd.get_dummies(df["Embarked"])
dummy_variable_1.head()

C	Q	S
0	0	0	1
1	1	0	0
2	0	0	1
3	0	0	1
4	0	0	1
dummy_variable_1.rename(columns={'Embark-type':'C', 'Embark-type':'Q','Embark-type':'S'}, inplace=True)

# merge data frame "df" and "dummy_variable_1" 
df = pd.concat([df, dummy_variable_1], axis=1)

# drop original column "fuel-type" from "df"
df.drop("Embarked", axis = 1, inplace=True)
df.head()

At the end, I have removed “Name” and “Ticket” columns too as they were not useful for classification.

You can check the full code with the training dataset on my github repository through the mentioned link:
https://github.com/Appiiee/Data-analysis-part-I

So till here, I have covered basic data wrangling techniques. Let me know if you have any queries.