上QQ阅读APP看书，第一时间看更新

Handling missing data in Python

The problem of missing data is quite common in data mining. One of the first problems that a researcher is faced with when analyzing results is that of an incomplete dataset and the presence of errors. This generally happens because whoever collects the data has not correctly interpreted the structure, accidentally commits some errors, does not want to deliberately insert that data because of an error in the encoding tool, or is someone who instead deals with the data entry.

There is no single technique or methodology to approach the problem of how to monitor the effect of missing data; each situation is a case in itself. In general, it is always advisable to test the survey instrument with pilot surveys to study its strengths and weaknesses, so as to intervene on the latter and prevent the presence of omitted answers. In the moment in which, despite all the tricks, the problem continues to present itself, the quantity and distribution of missing data, that is, the structure of the data and the nature of the variables involved, will be the only indication on which to make decisions.

Therefore, a missing value is a value that contains no content or is non-existent. These missing values may be due to a series of occurrences:

Errors in the creation of the dataset; the values have been entered incorrectly, leaving empty cells
The dataset contains fields that have been created automatically, with cells that do not contain values
The problem of an encoding tool
The result of an impossible calculation

If a variable contains missing values, Python cannot apply certain functions to it. For this reason, it is necessary to process the missing values in advance.

When a Database Management System (DBMS) identifies empty cells, it can behave unpredictably. For example, you can enter special characters to indicate that something has gone wrong with data encoding. If we analyze the Excel file that we are using as a data source, we can see that, in correspondence to the two columns that have been presented, anomalies appear as question marks, as shown in the following screenshot:

It is precisely these values that create problems—we must treat them appropriately. Missing values of any type of variable are indicated by the NA code, which means not available. The Not a Number (NaN) code, on the other hand, indicates invalid numeric values, such as a numeric value divided by zero.

The first thing to do is identify these values with NaN. To do this, we can use the pandas.replace() function, along with the np.nan value, as follows:

import numpy as np
DataNew = Data.replace('?', np.nan)

The pandas.replace() function replaces values that have been given to replace a specific value. Values of the DataFrame are replaced with other values dynamically. The np.nan value is a special value that's defined in NumPy.

Let's see what has changed in the DataFrame:

print(DataNew.info())

The following results are returned:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 302 entries, 0 to 301
Data columns (total 14 columns):
age 302 non-null int64
sex 302 non-null int64
cp 302 non-null int64
trestbps 302 non-null int64
chol 302 non-null int64
fbs 302 non-null int64
restecg 302 non-null int64
thalach 302 non-null int64
exang 302 non-null int64
oldpeak 302 non-null float64
slope 302 non-null int64
ca 298 non-null float64
hal 300 non-null float64
HeartDisease 302 non-null int64
dtypes: float64(3), int64(11)
memory usage: 33.1 KB
None

Amazing! Now, all the variables are numeric. All of the variables contain 302 values, except for the variables ca and hal, which have 298 and 300 values, respectively. This is because the describe() function omits the NaN values. To confirm what we have just said, we will use the describe() function once again:

print(DataNew.describe())

The following results are returned:

We have confirmed that the two variables under control are now numeric (float64) and contain 298 and 300 values, respectively. We want more— proof of the presence of NaN values. To get this, we can use the isnull() function, as follows:

print(DataNew.isnull().sum())

The pandas.isnull() function detects missing values for an array-like object. This function takes a scalar or array-like object and indicates whether values are missing (NaN in numeric arrays, None or NaN in object arrays, NaT in datetimelike). The pandas.sum() function returns the sum of the values for the requested data. In the following code block, we can see the results:

age          0
sex          0
cp           0
trestbps     0
chol         0
fbs          0
restecg      0
thalach      0
exang        0
oldpeak      0
slope        0
ca           4
hal          2
HeartDisease 0
dtype: int64

Now, everything is clear—4 NaN for the ca variable and 2 NaN for the hal variable have been identified. Now what do we do? To fix the missing data, there are several options available:

Replace the values with constant values
Set the values with other columns' values
Transform the data with functions
Delete rows

For now, we will choose the simplest choice—we will remove the rows in which NaN values are present. It should be noted that the choice we are going to make is extremely important. A wrong choice can affect the goodness of the results. In this case, as it is a few rows on the total, we can consider the removal of these lines to be negligible. In general, it is necessary to evaluate the effect that this operation determines on the data. To remove the NaN values, we will use the pandas.dropna() function, as follows:

DataNew = DataNew.dropna()

Let's see how this operation changes the DataFrame:

print(DataNew.info())

The following results are returned:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 296 entries, 0 to 300
Data columns (total 14 columns):
age 296 non-null int64
sex 296 non-null int64
cp 296 non-null int64
trestbps 296 non-null int64
chol 296 non-null int64
fbs 296 non-null int64
restecg 296 non-null int64
thalach 296 non-null int64
exang 296 non-null int64
oldpeak 296 non-null float64
slope 296 non-null int64
ca 296 non-null float64
hal 296 non-null float64
HeartDisease 296 non-null int64
dtypes: float64(3), int64(11)
memory usage: 34.7 KB
None

Please note that all of the variables now have the same number of instances (296). Let's see whether NaN is actually present by using the following command:

print(DataNew.isnull().sum())

The following results are returned:

age          0
sex          0
cp           0
trestbps     0
chol         0
fbs          0
restecg      0
thalach      0
exang        0
oldpeak      0
slope        0
ca           0
hal          0
HeartDisease 0
dtype: int64

NaN values are no longer present. We can therefore proceed with data analysis.