Fuzzy Cats| Human Factors

Human Factors

The emergence and spread of infectious diseases, like COVID-19, are on the rise. Can you identify patterns between population density and COVID-19 cases and identify factors that could help predict hotspots of disease spread?

Safe Planet AI

Summary

Today we are offering to the society our product: " Safe Planet Ai " . Safe Planet Ai is a data-based decision-making software that allows governments and authorities to: forecast future disease hotspots across the whole country given environmental and socioeconomic data; predict daily real COVID-19 cases in a city using air quality, hospital resources and transportation time-series variables; quantify the contribution of certain human activities to the spread of the disease.

How We Addressed This Challenge

Project Overview

The human factors challenge invites teams to identify patterns between population density and COVID - 19 cases and recognize factors that could help predict hotspots of disease spread. Safe Planet AI, created by the Fuzzy Cats team, is a data driven tool that aids in decision making. Our tool utilizes the explainability of a genetic fuzzy system and the accuracy and efficiency of a neural network.

Our software includes three tools:

1. Genetic Fuzzy Inference System

Identifies factors that could help predict hotspots of disease spread

An initial set of factors collected by the team that were hypothesized to have an effect on disease spread are the inputs to the Fuzzy Inference System (FIS). The genetic algorithm trains the membership functions of these inputs and the rules of the system. FISs have rules written in linguistic form, giving the FIS its explainability. Analyzing the trained rules, the most important factors of disease spread can be identified.

2. Neural Network

Highly accurate prediction of new daily COVID - 19 cases

To further strengthen the confidence in the results, an artificial neural network is run on same data as the Genetic Fuzzy System - having 55 inputs and 1 output - where the output is the number of cases of the virus. Though an artificial neural network is not explainable, it is able to strengthen the results of the genetic fuzzy system.

3. Factor Correlation and Disease Spread Forecasting Tool

Correlate various environmental factors with the spread of the disease at a city level and predict current cases of COVID - 19 in that particular area

This tool uses the daily basis information of a city to predict current cases of COVID-19. The algorithm inside is trained based on indirect relevant data that can have an impact in the spread of the disease.

How We Developed This Project

Inspiration

The members were inspired to choose this challenge based on the chance they saw to apply their experience with fuzzy logic systems and other machine learning methods. As engineers and data lovers, the team members followed the COVID - 19 news closely, always thinking about the underlying factors that caused the disease to spread so rapidly in some hotspots but not others. The challenge gave us the perfect chance to examine these factors the team had hypothesized about, and find those patterns among many inputs that only machine learning could do in a reasonable time period.

1. Genetic Fuzzy System Development

In developing the genetic fuzzy system the team first examined open source data and especially satellite data that might be a factor in identifying hotspots of disease spread. To initially build the model, the focus was counties in the United States, where we knew we would have access to the inputs we were interested in. Once the data was processed and the inputs were determined the fuzzy inference system was coded in MATLAB and trained using a genetic algorithm. Due to the nature of the time constraints, the genetic algorithm was unable to complete a sufficient training cycle. In the future, as the genetic algorithm is able to finish training the fuzzy system the team will analyze the trained rules of the fuzzy system to identify those factors most important in predicting the spread of the disease hotspots.

Result

Example of Explanatory Result :

population is high and

population density is high and

county area is small and

median household income is low and

people going out is high and

people above age 65+ is medium and

unemployment rate is high and

number of hospital beds is low and

fatality rate is high for the state and

the number of people going and

the number of ventilators is low

THEN

The number of cases in the county will be high.

2. Neural Network Development

“There is no bad Ai. There is only good Ai with bad data.”

It is clear that any “Ai” model won’t make itself work; we are still at least 20 years from such technology where Ai will find data, train itself, and deploy onto necessary demographic. The point is, “Ai is only as good as the data we can provide.”

For this challenge, we thought about using the most extensive and well documented data available to us also with the fact that we want to achieve fine granularity with the Ai module. Because of those constraints, we sticked to USA county-level data as they provide fine details and all the details can work as independent variables to the system. The data that goes into this Network is taken from (this link) - under the name JHU Data.

We gradually made the neural network with different number of variables and tuned each of the networks the best we could for the time being. The different architectures that we tried included various activation functions for each layer, different number of hidden layers (2-5 layers) as the dataset grew, different loss functions depending on the dataset and purpose of the neural net (regression / classification) and much more.

However, as we all know, “Artificial Neural Networks” are “black-boxes” and it is very difficult to make sense of the final model and understand the input-output relationship. To make our model explainable, we tried to use the Fuzzy Cascade System whose parameters were tuned with a Genetic Algorithm used for supervised learning. The idea was to compare the results and see if you could make good predictions with high accuracy using an explainable-Ai which could also give us an inside view of the model. The results for the Genetic Fuzzy System are inconclusive for now. We will definitely try to work on it beyond the scope of the hackathon. We think that this is the right direction to proceed in as we want to make Ai modules as explainable as we can.

Getting back to the Neural Network : We trained the neural network using 55 independent numerical variables. We used only the numerical variables for now but for the next milestone we are thinking of using the categorical data as well in our network which would further improve our model. All the 55 variables were normalized to keep the loss function within the bounds of the computer memory.

The Neural Architecture :

Total Neurons in Input Layer : 55
Total Neurons in Hidden Layer - 1 : 144
Total Neurons in Hidden Layer - 2 : 44
Total Neurons in Hidden Layer - 3 : 20
Total Neurons in Output Layer : 1
Criterion / Loss Function : SmoothL1Loss()
Optimizer : Adam Optimizer
Learning Rate : 0.001
Training Epochs : 18,000
Hidden Layers Tried : ReLU, Leaky ReLU, Logistic / Sigmoid, TanH / Hyperbolic
Finalized Hidden Layer : Leaky ReLU
1st Validation Criterion : SmoothL1Loss
2nd Validation Criterion : R2 (R-squared)

After many many iterations, this is the neural network architecture that we feel is best optimized in the timeframe we had. Of course, further work can be done to optimize and tune the hyper-parameters. However, for the hackathon we would like to go with this model.

Model Criterion :

1. SmoothL1Loss :

Why SmoothL1Loss and not RMSELoss ?

It uses a squared term if the absolute error falls below 1 and an absolute term otherwise. It is less sensitive to outliers than the mean square error loss and also prevents exploding gradients. In mean square error loss, we square the difference which results in a number which is much larger than the original number. These high values result in exploding gradients. This is avoided by using SmoothL1Loss - as for numbers greater than 1, the number are not squared.

When to use it ? :

It is useful in most Regression Problems.
When the features have large input values.
Well suited for most problems.

2. R-squared (R2) Error :

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression.

R-squared = Explained Variation / Total Variation

Definition : It is the percentage of the response variable variation that is explained by a linear model.

0% indicates that the model explains none of the variability of the response data around its mean.
100% indicates that the model explains all the variability of the response data around its mean.

Results :

Training Error : Plotting Error against Training Epochs

Training Plot : Plotting the Predicted Values against Ground Truth

Testing Plot : Plotting the Predicted Values against Ground Truth

Explanation of Plots :

For the Training Plot, we can see that the trained model is predicting the values with near perfect accuracy when compared to the ground truth values. However, when such results present themselves, we as data engineers just feel that “The model has definitely been overfitted and it has developed a certain bias towards the training data.”

However, when we take a look at the Testing Plot, we can confirm that the model is not overfitted or biased. And it exhibits the order of accuracy it showed in the Training phase. Further analysis of the plot suggests that the model is falling apart at higher prediction values for counties like New York or California. But, it could be argued that we have less training data for the counties where such high infected cases exist and most of the data available is of counties where the number of cases are less compared to the magnitude of cases that exist in the highly infected counties.

Further Work (Next milestones beyond the hackathon) :

The next milestone that we set for ourselves was to convert this neural network architecture from regression to a binary classification - i.e. not to detect the number of infected cases a particular county would have, but to detect whether the mentioned county would become a hotspot or not. We wanted to achieve this in the timeframe of the hackathon; however, the optimization of the network and trying various approaches filled the time allocation and this for now will be the milestone we take on next. We are also very pleased that we took the time and optimized the network and took time to understand everything as it will make our progress very quick for the next milestones.

3. Factor Correlation and Disease Spread Forecasting Tool Development

First, a correlation matrix is obtained that quantifies how important the identified time variables are. Subsequently, a Principal Component Analysis algorithm is applied that filters the information, leaving only the variables with an identified high contribution to the evolution of the disease. Finally, three methods are used to obtain a prediction of the number of people currently infected by COVID-19:

Cascading Neural Network
Support Vector Machine Regression
Multidimensional Regression with Interactions

This tool has been tested with New York City given the information between February 1st and March 31st of 2020. The inputs considered were:

Air Quality Data: Unusual trends in the daily concentrations of ground level CO, NO2, Ozone and SO2 of every county. Data was obtained from the Aura Satellite (OMI instrument) and from the Environmental Protection Agency. The data is smoothed and pre-processed by comparing it to the evolution of the recent years.
Hospital Data: Daily number of emergencies attended for several age-ranges and number of hospitalizations, for all the hospitals in NYC.
Flight Data: Total number of departing and arriving flights per day at each of the airports of NYC.
Traffic Data: Amount of new driving licenses issued each day in NYC.

In the future, more variables will be incorporated, such as stock data, prices of different products, number of events carried out in the city, etc.

Nevertheless, with the data considered, the results have been impressive:

Additional Tools : MATLAB, Python, Kepler, Jupyter, Tableau, PyTorch, Keras, TensorFlow

PROBLEMS

Finding data that was available for all counties in the United States and pre-processing the data so it was in the correct format to be fed to our models was time consuming. Because we are in the midst of the COVID-19 pandemic, the open source data that is available is often incomplete and is not yet validated. Before creating our models, the bad data needed to be pruned, filtered out and then reorganized into a suitable format for our training purposes.
Time limitation and computation power was another difficulty. The team had to make a decision between pursuing better results with high dimensional data and tuning the models faster with relatively low dimensional data.
Our team changed models in the middle of the hackathon because we did not have sufficient memory space on our computers to use the MATLAB ANFIS code.
The genetic algorithm did not have enough time to sufficiently train the fuzzy system.

ACHIEVEMENTS

Collecting and combining all the variables that the team had combed through and tried to access.
Great success with the COVID - 19 case prediction with the Factor Correlation and Disease Spread Forecasting Tool Development.
Very high accuracy with the test data in the Neural Network.

Ultimate Goal

"Our ultimate aim is to develop a user friendly software tool accessible via the web on a laptop or mobile which can easily be deployed and used by decision makers at the federal, state, county or city level."

Daily Infected Cases & Daily Death Cases
Hospital Capacity

Demographical Data:

Average Social Activity Cases
Total Population & Population Density

esri: COVID-19 Cases data among US counties with respect of different features

Environmental Protection Agency (EPA): Outdoor Air Quality Data

Ground level CO Concentrations
Ground level NO2Concentrations
Ground level Ozone Concentrations
Ground level SO2 Concentrations

New York City (NYC) OpenData:

TLC New Driver Application Data
Daily Counts of Cases
Daily Emergencies
Daily Flight Data of NYC
Daily Hospitalizations

NASA Aura Data:

NO2 Concentrations
SO2 Concentrations
Aerosols Concentrations

Copernicus Open Access Hub

Global Judging

This project was submitted for consideration during the Space Apps Global Judging process.