Classification Project.
In this blog post, I would like to go through my experience of building a classification model and some of the methods and tools that I used.
The Project
The goal of this project is to help the Tanzanian Government, which struggles with providing clean water to its citizens, predict the condition of water points.
The Method

In order to complete this project, I made use of the OSEMN process of obtaining data, scrubbing data, exploring data, modeling data, and finally interpreting the data.
The data is provided by the Tanzanian Government on water points inside the country. The dataset contains nearly fifty thousand rows of individuals waterpoints. The waterpoints are grouped into three categories of functional, non-functional, and functional needs repair. Forty columns containing information about the water points such as location, pump type, well altitude, and others are provided.
In the scrubbing and exploring the data stage of the process, I dropped several columns that contain the same information as other columns. I also dropped several columns that contained information that I believe to be useless for our purpose such as the funder, the date the row was recorded, and who it was recorded by.
I also had to drop a few columns that contained too many unique discrete information that if we were to create dummy variables it would overwhelm the model. This includes the waterpoint name and scheme name.
Missing Data
Unfortunately, the number of rows missing the construction year and population make up nearly half of our dataset, so I had to drop them. For all the other columns with missing and placeholder values, I decided to use a tool from Scikit-learn called KNN Imputer.
The KNN imputer looks at the nearest K neighbors and fills the missing information with the most common values in the corresponding columns.
from sklearn.impute import KNNImputerimputer = KNNImputer(n_neighbors=5)X_train = pd.DataFrame(imputer.fit_transform(X_train),columns = X_train.columns)X_test = pd.DataFrame(imputer.transform(X_test),columns = X_test.columns)
Building The Models
My approach to the project was to build several types of classifiers and compare the performance metrics. The models that I chose to build were
- Logistic Regression
- K-Nearest Neighbour
- Decision Tree
- Random Forest
- Xtreme Gradient Boosting (XGBoost)
- Support Vector Machine (SVM)
GridSearchCV
While building the models I made heavy use of GridSearchCV from Scikit-Learn library, which is a tool that when given a dictionary of parameters searches out the values that allows the best model performance. An example of the code I used to search out the parameters and check the best parameters is provided below.
from sklearn.model_selection import GridSearchCVrf_param_grid = {‘n_estimators’: [30, 100],‘criterion’: [‘gini’, ‘entropy’],‘max_depth’: [6, None],‘min_samples_split’: [2, 5, 10],‘min_samples_leaf’: [1, 3, 6]}rf_clf = RandomForestClassifier()rf_clf_gridsearch = GridSearchCV(estimator=rf_clf, param_grid=rf_param_grid, scoring='accuracy', cv=5)rf_clf_gridsearch.fit(X, y)rf_clf_gridsearch.best_params_Best Parameters: {'criterion': 'entropy', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 100}
Since GridSearchCV uses an iterative method of building each model with different parameters, it can take up massive computational time. Therefore it is limited in that you have to make a careful choice of which parameters and what values you want to pass into it.
In the example above to build a Random Forest Classifier, I chose to search out five different parameters with twelve total values. With all combinations of the parameters plus the five-fold cross-validation, GridSearchCV had to build three hundred and sixty models and compared the accuracy scores.
The parameters I chose to search out for Random Forest Classifier include a criterion of ‘gini’ which checks Gini impurity and ‘entropy’ which checks information gained. I also checked for the best value for max_depth between the default value of 6 and no limit. Min_samples_split and min_samples_leaf are similar parameters. Min_samples_leaf is the minimum number of samples required to be at the leaf node, while min_samples_split is the required number of samples needed to split an internal node.
Comparing And Choosing The Models
When comparing models, the main metrics that I used were accuracy and F-1 Score.


As you can see in the table and graph above, the best performing classifiers were XGBoost and Random Forest. The accuracy and F-1 score for the two classifiers were nearly identical with a slight edge to XGboost.

Every classifier struggle with detecting the ‘functional needs repair’ category. As seen in the confusion matrix to the left the true positive rate for the ‘functional needs repair’ category is much lower than the other two categories at 0.55. A lot of this category is predicted as functional. The f-1 score for this category saw a massive drop-off between the training and testing data as seen below.

Therefore, I also considered choosing SVM which had the best recall score for the ‘functional needs repair’ category at 0.65. A high recall score for a category would mean that the model can detect more of its type but at the same time increase the false positive rate. Which in this case would mean that functional water points will be classified as needing repairs. Being able to repair water points before they break is important but at the same time having a high false-positive rate for ‘functional needs repair’ and ‘non-functional’ categories is also far from ideal. This would waste manpower and budget sending out people to fix waterpoints that do not need repairs.
In the end, I decided to go with a compromise and chose Random Forest Classifier. Which had a better recall score for the ‘functional needs repair’ category than XGBoost and a much higher Accuracy and F-1 score than SVM.
Interpreting The Final Model
After having settled on Random Forest Classifier as the final model, I took a look at the top twenty most important features.

The most important feature for the classifier is the gps_height column which is the altitude of where the well is located. It seems that the average altitude of functional water points is higher than non-functional ones. A possible explanation for this might be that more rural regions are located at lower altitudes.
The reason why the next three features are important seems obvious. They are all dummy variables created from the quantity column with the values of dry, enough, and insufficient. If the waterpoints are dry or contain the insufficient amount, it would mean that it is non-functional.
The payment type of never pay and permit columns being in the top twenty important features shows that some water points are being taken care of better than others. Other features in the top twenty include scheme types and extraction types. Perhaps it would be a good idea to investigate these features more to see which schemes and methods of extraction the government should avoid in the future. Several district codes are also included in the list. It seems that some districts contain more non-functional water points than others, I believe this warrants a closer inspection to find the reason why this might be.