As data analysts we want to use our skills to help our local and global communities, so for this project we wanted to explore an issue regarding climate-change. As Oregonians the issue of wildfires is of recurring relevance to our lives. After decades of ever-more frequent wildfires Oregon saw its worst fire season in 2021. With average temperatures continuing to rise locally and around the globe, the situation shows no signs of slowing. We discovered a great dataset with which to explore this topic. This data is from Kaggle and includes an SQLite file and various CSV files containing data about 1.88 million US wildfires over 23 years.
Questions we hope to answer in the coming weeks:
• Have wildfires in Oregon increased in frequency?
• Have Oregon wildfires increased in size over time?
• Has Oregon wildfire season changed over time?
• Using machine-learning can we predict how wildfires may change in the next 5 years?
• Can we determine which areas of the State are of higher-risk?
We will communicate primarily through Slack over the next few weeks, with check-ins on Saturdays and Sundays, as well as during class time. For the first segment Nina explored the raw data and cleaned it up for use. Alex explored possible machine learning options. Jovan created the repository and documented the process.
After importing libraries and the SQLite file into Jupyter Notebook, and creating an engine to talk to the database we performed the following steps:
-
Inspected the names of the tables and printed the columns to see what data we have to work with.
-
Filtered the data to show only fire data from Oregon. This left us with 61,088 rows of data.
-
Determined what the data types in the columns are.
-
Created a dataframe of Oregon fires with only the columns relevant to our inquiry.
-
Checked the counts of the columns to see if any have missing data, then dropped those rows. This left us with 60,751 rows of data.
-
Explored the value counts of the column containing causes of fires in Oregon.
-
Explored the value counts of the column containing counties where fires are reported.
-
Converted Julian version of discover dates and containment dates into a readable format, and dropped the Julian versions from the dataframe.
-
Calculated the number of days each fire burned for.
Once the data has been cleaned, machine learning models can be run on some of the numerical data.
Fires dataframe was merged with precipitation and temperature dataframes grouped by year and county. Final dataframe for ML included county code, year, average fire size, average discovery month, average containment month, average fire duration, average precipitation, and average temperature (832 rows × 6 columns).
Simple multiple linear regression and various tuned random forest regression models were tested. Also attempted was a time series VAR model, but it did not pass the Granger’s Causality Test.
Method:
- Split into train and test sets.
- Run .LinearRegression()
- Fit and test the model.
Accuracy: r2 = -0.235
Conclusion: Not a good model. R2 score is negative, which is not good.
Method:
- Split into train and test sets.
- Scale the data (because we have very large values)
- Run .RandomForestRegressor()
- Fit and test the model.
- Tune with GridSearchCV. {'criterion': 'mse', 'max_depth': 7, 'max_features': 'sqrt', 'n_estimators': 500}
- Tune multiple times with RandomizedSearchCV. Final parameters: {'n_estimators': 415, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': None}
Accuracy: No tuning: 21.86%, r2 = 0.193 GridSearchCV: 38.6%, r2 = 0.215 RandomSearchCV: 47.41 %, r2 = 0.258
Conclusion: Not a good model. Accuracy and r2 scores are too low.
Method:
- Test causation using Granger’s causality test
- Perform cointegration test
- Split the series into training and testing data
- Check for stationarity and make the time series stationary
- Select the order (p) of VAR model
- Train the VAR model of selected order (p)
- Check for serial correlation of residuals (errors) using Durbin Watson statistic
- Train the VAR model of selected order (p)
- Invert the transformation to get the real forecast
- Plot of forecast vs actuals
- Evaluate the forecast
Accuracy: Did not pass the Granger’s causality test.
Conclusion: Could not use this model because it did not pass the Granger's causality test.
To improve the random forest regression analysis, we should have encoded (dummy or simple) for the year and the county code to improve accuracy. Even so, these were not the best models and data to use. What we were really after was a forecasting model that would help us predict fire outcomes into the future using a form of time series analysis, and the best data to use is geographic data (MODIS) paired with weather and vegetation data, using Google Earth Engine. MODIS data includes a product, Global Daily Fire Location Product, that would be best for this analysis. We thought VAR would be the best model to use in this instance, but actually the Autoregressive Integrated Moving Average model (ARIMA) would be best, which finds the best autoregression (AR) model and the moving average (MA) of weighted linear combination to obtain the prediction method. Global Fire Season Severity Analysis and Forecasting by Ferreira et al. (2019) outlined their methods for accomplishing a fire prediction model, and future model building for Oregon could also use similar methods, including ARIMA.
A future project could include a categorical analysis using neural networks that could predict final fire class size (A,B,C,D,E,F,or G) based on current parameters you entered, including current temperature, humidity, county, month of the year, acres burned already, and duration so far. MODIS data could also be incorporated. Unfortunately, we did not have the time to complete this model.
We will be using postgresql running in AWS RDS.
Here is the ERD containing the graphical representation of table relationships:
Our presentation will likely contain the following:
• Plot of fire frequency
• Plot of fire size
• Plot of fire season
• Plots of 2027 predictions
• Plot of high-risk areas
• The results of our machine learning tests
Link to presentation url: Google Slides
Link to Heroku app: https://oregonfires.herokuapp.com/
Team: Alex Dallman, Jovan Humphrey, Nina Q
Sources: https://www.kaggle.com/datasets/rtatman/188-million-us-wildfires
https://www.ncdc.noaa.gov/cag/county/mapping/35/pcp/200506/1/value.
https://data.oregon.gov/Natural-Resources/Oregon-counties-map/djry-8qn8