Our team was asked to identify a pattern between population density and COVID-19 cases and to identify factors which could aid in predicting future hotspots. We took into account factors such as average household income, population density, GDP, percent of GDP spent on education, and geography (water-locked or land-locked). We used RStudio and ran 4 comprehensive models (see slides) to predict the accuracy of each variable in predicting COVID-19 case/death counts. Based on the accuracy of the line of best fit (which was generated based on the researcher’s data), we identified which factors were relevant and provided reasoning to explain the results of the model for the specific factor. Thus, if any of our factors were deemed relevant to predicting COVID-19 case/death count, we could predict future hotspots by stating that countries affected by this factor have a high chance of developing hotspots.
Our team consists of students from India (Pune), Texas (Austin), California (San Jose and Tracy), and Indiana (Indianapolis). Thus, during our initial discussions, we realized that the COVID-19 pandemic affected each of our communities differently, and we quickly became interested in finding out which factors specifically contributed to the spread of the disease in our communities. However, we decided to expand our initial query to what factors affected the spread of the disease in countries because we realized that if we expanded to countries, we could consider a broader range of factors such as GDP, average air quality index, and more. Thus, we decided to choose the Human Factors challenge because we hoped to identify which factors were most relevant in tracking COVID-19 spread and how we can use those factors to predict further spread. Specifically, we hope to contribute our research to the process of preventing the spread of COVID-19 because if we can provide the significant reasons behind the spread, then governments can respond more effectively to mitigate further exposure to the virus.
During the very first team meeting we had decided to break the prompt into task-focused questions:
By asking such insightful questions, we quickly establish a professional and guided workflow. We divided ourselves into three distinct teams (Data Collection, Coding, and Presentation) to accomplish different aspects of our project. The Coding team went for the “Intro to Data Exploration” video by Bea Hernandez, the Presentation team attended the “How to Pitch a Winning Hackathon Solution” boot-camp by Paloma Lecheta, and the Data collection team took the “How to Find Space Resources on the Internet” boot-camp by Alexandre Belloni Alves. The entire team worked hard on the project to fulfill their obligations, as the coding team would be unable to develop their models without data from the research team, and the presentation team would be unable to highlight and explain the results without the models from the coding team. Ultimately, during the last hours of our project, we came together and completed a comprehensive review of our work and created the slideshow to explicate our results and suggestions for future application.
We used NASA’s “EARTHDATA” data pathfinder and one of NASA’s air quality information pages to help guide what variables we considered when analyzing correlation to COVID-19 cases. NASA’s air quality information page was instrumental in allowing us to track the air quality in each country.
We used the coding language “R” in RStudio to write a program to determine the best fit lines to predict COVID-19 cases and deaths with the data we gathered using multiple regression. With the different possible contributing factors, we created all possible combinations of factors and thus all possible multiple regression lines using the given data. The program assigned a score to each regression line called Mallows’ Cp; the lower the Cp value, the better the model will be in terms of fitting the data. The best model would be the line with the lowest Cp value where all coefficients of input variables are different from 0 (the P-values of each coefficient of the factors involved in the particular line are all less than 0.05). This is the line that we would choose to model the data, and the input factors present in it are therefore the factors that can help to predict COVID-19 cases and deaths in a particular place, which can predict future hotspots for coronavirus.
One problem we encountered was that we had difficulty extracting data from certain websites and datasets due to complex presentation formats, incomplete sources, and outdated values. This forced our researchers to expand beyond just the provided space agency data and sort through other publications and official reports. A potential problem that we thought we would have faced is that our varying time zones would hamper our collaborative work sessions. However, we developed an efficient time management system where the researchers would work independently and later share their progress during a time that worked for everyone (11 AM PST), and the presenters and the coding team worked synchronously after the researchers completed their data collection. In fact, our situation actually enabled us to work longer hours on the project, which led our researchers to analyze more factors and our programmers to set up more models.
As for achievements, our final conclusion was that there was only one variable that had a strong correlation with COVID-19 cases and deaths — GDP. All other variables (air quality, population density, percent of GDP spent on education, average household income, and whether the country was landlocked or water locked) had little to no correlation because at any given value for these other variables, both high and low numbers of COVID-19 cases/deaths are present. However, this could be a result of the United States being a major outlier — the United States’ GDP and COVID-19 cases/deaths are greater than the other countries’ GDP and COVID-19 cases/deaths. Another possible reason for GDP’s positive correlation with COVID-19 cases/deaths is that poorer countries may not have the means to report COVID-19 cases/deaths, thus making those countries appear healthier.
The main conclusion that we can draw from our methods is that no matter how developed a country is, whether it is a first or a third world country, it has the potential to be the hotspot for a deadly disease. It is important for all countries to remain vigilant to this threat, as no country is immune to it. Our project proves that pre-existing factors of countries have little to no correlation to the spread of COVID-19; rather, researchers should look into responses by certain countries to control the spread of virus and identify which actions were beneficial in halting the spread of the virus. We discussed potential ideas for expansion and application for our models in our slideshow.
https://docs.google.com/presentation/d/1TXGcASLA9dv66edLJqUsPsf4pxz1GkJloRp15Yc844I/edit?usp=sharing
NASA Source:
From this source, we navigated to the Public Health tab, where we found this link: https://airquality.gsfc.nasa.gov/health
On this page, we found this link, which consolidated NASA air quality measurements for outer space into an interactive data program:
https://www.stateofglobalair.org/data/#/air/plot
Other Sources: