Our project addresses the challenge of SGD 11, Sustainable Cities and Communities. Our project is able to predict the potential hotspots of COVID-19 at the county level across the entire United States. This helps solve the challenge because in order for these cities and communities to be sustainable, they need to plan in advance to help stop COVID-19 from impacting them. The second portion of our project is being able to predict what a future country could expect in the number of infected people if another pandemic happens. These SDGs are mostly goals targeted at the year 2030, and we can’t be certain that another major pandemic won’t affect the entire world. With us being able to compare the data of a future country with countries in the world right now, we can show them a graph of what they could expect from a future pandemic helping them fight it as efficiently as possible. Using data collected at the county level, we are able to visualize potential Covid-19 hotspots, and using data collected at the global level, we are able to compare a country’s future characteristics with a country’s current one to help predict the expected outcomes. Covid-19 has had a mostly negative impact on SDG 11 and using the data from both our models will aid in the sustainability of cities and communities, for governments will be able to more adequately plan for the future.
We as a team chose this challenge because we feel that it is important to understand how COVID-19 impacted the progress of the SDGs implemented by the United Nations and that it is particularly relevant in today’s landscape. We want the United Nations to succeed with their goals, and in order to do that, we have to find the impact of COVID-19, and how we can counteract it in order to stay on track to meet these goals. Our approach to this project was to make our own metric, so that we can see where the hotspots of COVID-19 would be in the United States. We did this through the medium of a choropleth. We used the space agency data by accessing the amount of NO2 in the air because according to NASA themselves, the amount of NO2 in the air has been reduced drastically since the beginning of stay-at-home orders. We also chose to use light pollution because with less activity, the amount of light pollution has also decreased. The light pollution distribution also gave us an initial idea where major metropolitan regions were. This gives us a look at potential COVID-19 hotspots without analyzing any data. Furthermore, when we thought a little deeper, we realized that instead of just limiting ourselves to the counties in the United States, we can also have another model that includes the countries of the entire world. With even more brainstorming, we decided that if another pandemic happens in the future, we can have the user input their country’s characteristics (that can further the spread of a pandemic). Consequently, with this data, we can match it with an existing country that is fighting COVID-19, so someone in the future can see if their country is at risk, and how fast/slow the pandemic would spread in their country. This would directly help with the SDGs because the whole point is that these goals have to be sustainable for years into the future and with the predicting of how a pandemic can affect a certain country, it would help the nation protect themselves and continue on the track of reaching their goals.
Our team developed two different models to tackle our goal.
Model 1 uses the Julia visualization language and a python script to predict two factors: a country's pandemic cases growth over time and NO2 levels over time based on the user’s input of key factors: population density, hospital beds, senior population, tourism per year, and fraction of population below poverty. We found the key factors’ datasets for each country in the world right now using reliable and credible resources, so there won’t be any discrepancies. Our python script took the inputted user data and used the nearest neighbors approach to determine which current country best matches the user’s inputted key factors. For each country it scores how well it does compared to the inputted factors and uses euclidean distance and our defined weights to determine the inputs nearest neighbors. We use weights to help signify how each of our defined key factors contributes to the spread of COVID-19 and subsequently pollution decreased. We acknowledge using a lower number of data points than desired (focusing only on countries) could result in slight differences of the weights used in the model. In the future, we hope to gather more data for states and cities to better define weights.
Julia is a great visualization language in which we programmed interactive graphs that allow users to visualize how the curve grows since the start of the pandemic. We hope that if something in the future were to occur, a country could input their data, and we would provide a visualization of the most accurate possible pandemic growth and NO2 levels based on similarities in key factors. In Julia, we created three interactive graphs: COVID cases over time, new COVID cases vs total cases over time, and pollution NO2 level over time. In each, there is a bar that allows the user to see how the graph changes over elapsed time.
Early on, we ran into issues regarding how to work on Julia code together since, as a more obscure language, some of our teammates did not have access. We found a tool called mybinder and installed julia into its kernel to allow all of us to easily access the code. We also ran into issues finding adequate data that would provide meaningful visualizations. As we could not find standardized data on the city level, we decided to pursue using countries for this model. As a team, we worked tirelessly to search for datasets that reflected our key factors as well as the NASA air pollution data. We are proud of what we accomplished and are glad that one of our team members was able to share and teach us a new programming language that helped emphasize visualization.
Model 2
Model 2 predicts the number of expected COVID-19 cases for a given date. First, a comprehensive list of cumulative COVID-19 cases per day was acquired for each county in the US, from 1/22/20 to 5/29/20. This data (up to 5/25/20) was then fit with a polynomial regression model to predict the number of cases for a future date. Since we wanted the error between actual and expected cases to be as small as possible, we sought to minimize the Root Mean Squared Error, or RMSE. Thus, each county was fit with a polynomial function of degree varying from 0 to 20, and the prediction with the smallest RMSE value was chosen. We then used the model to predict the number of cases per county from May 26th-29th and compared the results with the actual number of cases. As the number of days increased, we observed, so did the deviation between actual and expected cases. Consequently, we shortened the prediction window for the sake of accuracy. We then created a choropleth map that displayed these predictions for all counties across the United States. The results appear to be in line with recent trends: hotspots appear in the New York/California regions, while much of the Midwest is relatively sparse.
During the development of this model, we considered including other important factors such as age, population density, income level, and mobility patterns. However, with the exception of mobility, all of these categories were relatively static numbers and would not account for fluctuations in the number of COVID-19 cases per county. Given the challenging nature of collecting such variable data during a pandemic, we chose to focus solely on the daily county records of COVID-19 cases for the time being, since those were the most recent, relevant statistics available.
Perhaps the most interesting aspect to this project was the model creation itself. All scripts were written in Python. Data extraction was performed using Python’s Pandas library, and polynomial regression was done using Python’s machine learning library Scikit-Learn. Our team members were amazed as to how much we could apply classroom-level statistics to such a real-world problem, and learned many useful programming hacks along the way.
https://docs.google.com/presentation/d/1umv9s3uhsNXq0pN2QmC1dP6tJ7ghFIZHQWdUz1ML7cU/edit?usp=sharing