Analysis and Prediction of Dementia in Patients

Time period:

Start Date: November, 2021
End Date: February, 2022

Background

This was a collaboration with my friend Anshuman. I was sitting at home with a heap of free time and knowledge in Data Science that I didn’t have any use for when he approached me with his 4th year project, which was what is mentioned above.

The Problem Statement

We basically had to come up with a way to read the database of patients with dementia, pass it through a model or combination of models to come up with a solution that predicted a result on the basis of input.

Important Hurdles

The important hurdle was that we had to beat the current obvious logistic regression, k-means and decision tree classifier models in accuracy, consistently. So we had to find a combination of models that beat these. As we started out, I was not even sure this was possible, as decision trees were seriously effective in judging the data (the model sometimes got 92-95% accuracy).

The Problem Solving Process

We ended up brute-forcing it.

After analyzing the data using various graphs and correlational visuals, we had a good idea of what columns meant what and what related to what. Anshuman also did his share of research on Dementia and was very helpful in answering my questions whenever they were about the disease and whether some data given was relevant or not. After this initial process, we tried our hand at many models, but no singular model seemed to work.

This is when we started testing model-combinations. We knew about 6 basic models, and all of them in combination with each other made up 36 different combinations. I wrote a python script that looped through all of these combinations and told us whenever a model made significant progress above the previous top contender. But even while running Model Combinations, we could not find a clear winner over the Random Forest Algorithm. The only contender to the Random Forest Classifier was also the Random Forest + Random Forest classifier combination, and that seemed like a waste of time.

We made many other changes after that. We learned how to stack models over one another, and then brute forced all the permutations and combinations of all 6 models. There was a point where we were running code for 5-7 minutes straight because we were testing over 200 different iterations of model combinations.

After sitting on this problem for another couple of weeks, Anshuman came to me with a new model he had heard mentions of on the internet, called XGboost. I immediately looked up the documentation and added XGBoost to the list of models running and we had a winner. XGBoost consistently won over Random Forest Classifiers, and the XGBoost + Random Forest Classfier combination had no contenders at all.

Resources

Links

GitHub