The human factors challenge invites teams to identify patterns between population density and COVID - 19 cases and recognize factors that could help predict hotspots of disease spread. Safe Planet AI, created by the Fuzzy Cats team, is a data driven tool that aids in decision making. Our tool utilizes the explainability of a genetic fuzzy system and the accuracy and efficiency of a neural network.
Identifies factors that could help predict hotspots of disease spread
An initial set of factors collected by the team that were hypothesized to have an effect on disease spread are the inputs to the Fuzzy Inference System (FIS). The genetic algorithm trains the membership functions of these inputs and the rules of the system. FISs have rules written in linguistic form, giving the FIS its explainability. Analyzing the trained rules, the most important factors of disease spread can be identified.
Highly accurate prediction of new daily COVID - 19 cases
To further strengthen the confidence in the results, an artificial neural network is run on same data as the Genetic Fuzzy System - having 55 inputs and 1 output - where the output is the number of cases of the virus. Though an artificial neural network is not explainable, it is able to strengthen the results of the genetic fuzzy system.
Correlate various environmental factors with the spread of the disease at a city level and predict current cases of COVID - 19 in that particular area
This tool uses the daily basis information of a city to predict current cases of COVID-19. The algorithm inside is trained based on indirect relevant data that can have an impact in the spread of the disease.
The members were inspired to choose this challenge based on the chance they saw to apply their experience with fuzzy logic systems and other machine learning methods. As engineers and data lovers, the team members followed the COVID - 19 news closely, always thinking about the underlying factors that caused the disease to spread so rapidly in some hotspots but not others. The challenge gave us the perfect chance to examine these factors the team had hypothesized about, and find those patterns among many inputs that only machine learning could do in a reasonable time period.
In developing the genetic fuzzy system the team first examined open source data and especially satellite data that might be a factor in identifying hotspots of disease spread. To initially build the model, the focus was counties in the United States, where we knew we would have access to the inputs we were interested in. Once the data was processed and the inputs were determined the fuzzy inference system was coded in MATLAB and trained using a genetic algorithm. Due to the nature of the time constraints, the genetic algorithm was unable to complete a sufficient training cycle. In the future, as the genetic algorithm is able to finish training the fuzzy system the team will analyze the trained rules of the fuzzy system to identify those factors most important in predicting the spread of the disease hotspots.
IF
population is high and
population density is high and
county area is small and
median household income is low and
people going out is high and
people above age 65+ is medium and
unemployment rate is high and
number of hospital beds is low and
fatality rate is high for the state and
the number of people going and
the number of ventilators is low
THEN
The number of cases in the county will be high.
It is clear that any “Ai” model won’t make itself work; we are still at least 20 years from such technology where Ai will find data, train itself, and deploy onto necessary demographic. The point is, “Ai is only as good as the data we can provide.”
For this challenge, we thought about using the most extensive and well documented data available to us also with the fact that we want to achieve fine granularity with the Ai module. Because of those constraints, we sticked to USA county-level data as they provide fine details and all the details can work as independent variables to the system. The data that goes into this Network is taken from (this link) - under the name JHU Data.
We gradually made the neural network with different number of variables and tuned each of the networks the best we could for the time being. The different architectures that we tried included various activation functions for each layer, different number of hidden layers (2-5 layers) as the dataset grew, different loss functions depending on the dataset and purpose of the neural net (regression / classification) and much more.
However, as we all know, “Artificial Neural Networks” are “black-boxes” and it is very difficult to make sense of the final model and understand the input-output relationship. To make our model explainable, we tried to use the Fuzzy Cascade System whose parameters were tuned with a Genetic Algorithm used for supervised learning. The idea was to compare the results and see if you could make good predictions with high accuracy using an explainable-Ai which could also give us an inside view of the model. The results for the Genetic Fuzzy System are inconclusive for now. We will definitely try to work on it beyond the scope of the hackathon. We think that this is the right direction to proceed in as we want to make Ai modules as explainable as we can.
Getting back to the Neural Network : We trained the neural network using 55 independent numerical variables. We used only the numerical variables for now but for the next milestone we are thinking of using the categorical data as well in our network which would further improve our model. All the 55 variables were normalized to keep the loss function within the bounds of the computer memory.
After many many iterations, this is the neural network architecture that we feel is best optimized in the timeframe we had. Of course, further work can be done to optimize and tune the hyper-parameters. However, for the hackathon we would like to go with this model.
1. SmoothL1Loss :
Why SmoothL1Loss and not RMSELoss ?
When to use it ? :
2. R-squared (R2) Error :
R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression.
R-squared = Explained Variation / Total Variation
Definition : It is the percentage of the response variable variation that is explained by a linear model.
Training Error : Plotting Error against Training Epochs
Training Plot : Plotting the Predicted Values against Ground Truth
Testing Plot : Plotting the Predicted Values against Ground Truth
For the Training Plot, we can see that the trained model is predicting the values with near perfect accuracy when compared to the ground truth values. However, when such results present themselves, we as data engineers just feel that “The model has definitely been overfitted and it has developed a certain bias towards the training data.”
However, when we take a look at the Testing Plot, we can confirm that the model is not overfitted or biased. And it exhibits the order of accuracy it showed in the Training phase. Further analysis of the plot suggests that the model is falling apart at higher prediction values for counties like New York or California. But, it could be argued that we have less training data for the counties where such high infected cases exist and most of the data available is of counties where the number of cases are less compared to the magnitude of cases that exist in the highly infected counties.
The next milestone that we set for ourselves was to convert this neural network architecture from regression to a binary classification - i.e. not to detect the number of infected cases a particular county would have, but to detect whether the mentioned county would become a hotspot or not. We wanted to achieve this in the timeframe of the hackathon; however, the optimization of the network and trying various approaches filled the time allocation and this for now will be the milestone we take on next. We are also very pleased that we took the time and optimized the network and took time to understand everything as it will make our progress very quick for the next milestones.
First, a correlation matrix is obtained that quantifies how important the identified time variables are. Subsequently, a Principal Component Analysis algorithm is applied that filters the information, leaving only the variables with an identified high contribution to the evolution of the disease. Finally, three methods are used to obtain a prediction of the number of people currently infected by COVID-19:
This tool has been tested with New York City given the information between February 1st and March 31st of 2020. The inputs considered were:
In the future, more variables will be incorporated, such as stock data, prices of different products, number of events carried out in the city, etc.
Nevertheless, with the data considered, the results have been impressive:
Additional Tools : MATLAB, Python, Kepler, Jupyter, Tableau, PyTorch, Keras, TensorFlow
"Our ultimate aim is to develop a user friendly software tool accessible via the web on a laptop or mobile which can easily be deployed and used by decision makers at the federal, state, county or city level."
Johns Hopskins University (Access by C3AI API):
esri: COVID-19 Cases data among US counties with respect of different features
Environmental Protection Agency (EPA): Outdoor Air Quality Data