In this notebook, we analyze the performance of the machine learning model that forecasts the amount of wind power in the California energy grid.
The use case we are interested in in this workbook is to find the time window of a specified width within the next 26 hours in which there is the most wind energy in the grid. It is worth noting that the machine learning model was not initially trained that way. Instead, it forecasts time series of wind power from which we need to deduce the location of these windows.
Before diving deep into model validation, let’s clarify some details about the model. The model is a weighted ensemble model of 8 tree-based models. 2 of those models are based on Cubist, a boosted regression model. 3 models are based on XGBoost and another 3 models are random forest models. These types of models have been chosen according to their ability to incorporate a set of 188 features and good training performance in R.
Each model has been trained on about 143k data points. The model parameters have been selected after hyperparameter tuning, subject to 4-fold time series cross-validation. The number of folds was constrained by training performance.
A weighted average was chosen for ensembling the models due to performance constraints over potentially more accurate methods like stacking. The weights were chosen after assessing 20 different weight combinations through a latin hypercube experimental design.
In summary, the modeling process was heavily constrained by performance considerations and available project time.
When building the model, we withheld 20% of data from it so we can use it for validation. We purposely withheld the most recent data, as to not give away any future information of which the model would not know if we put it in production in a real-time scenario. We now use these most recent data to validate the model.
Let’s load the required libraries and the data!
## ── Attaching packages ────────────────────────────────────── tidymodels 0.1.1 ──
## ✓ broom 0.7.1 ✓ recipes 0.1.15.9000
## ✓ dials 0.0.9.9000 ✓ rsample 0.0.8.9000
## ✓ dplyr 1.0.2 ✓ tibble 3.0.4
## ✓ ggplot2 3.3.2 ✓ tidyr 1.1.2
## ✓ infer 0.5.3 ✓ tune 0.1.2.9000
## ✓ modeldata 0.1.0 ✓ workflows 0.2.1.9000
## ✓ parsnip 0.1.4.9000 ✓ yardstick 0.0.7
## ✓ purrr 0.3.4
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x purrr::discard() masks scales::discard()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x recipes::step() masks stats::step()
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ✓ stringr 1.4.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x readr::col_factor() masks scales::col_factor()
## x purrr::discard() masks scales::discard()
## x dplyr::filter() masks stats::filter()
## x stringr::fixed() masks recipes::fixed()
## x dplyr::lag() masks stats::lag()
## x readr::spec() masks yardstick::spec()
## Loading required package: modeltime.resample
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
orig_data <- readRDS('./data/processed/comb_data_rev8.rds')
validation_samples <- round(nrow(orig_data)*0.2,0)
print(paste('Number of validation samples: ', validation_samples))
## [1] "Number of validation samples: 47765"
Now, let’s extract the validation set!
## # A tibble: 6 x 188
## Time Wind `0_wind_speed_m… `0_temp_c` `1_wind_speed_m…
## <dttm> <dbl> <dbl> <dbl> <dbl>
## 1 2020-06-26 01:30:00 2714 6.38 20.3 35.4
## 2 2020-06-26 01:35:00 2687 6.46 20.2 36.2
## 3 2020-06-26 01:40:00 2685 6.54 20.1 36.9
## 4 2020-06-26 01:45:00 2660 6.62 20.1 37.7
## 5 2020-06-26 01:50:00 2666 6.70 20.0 38.4
## 6 2020-06-26 01:55:00 2688 6.78 19.9 39.2
## # … with 183 more variables: `1_temp_c` <dbl>, `2_wind_speed_ms` <dbl>,
## # `3_wind_speed_ms` <dbl>, `4_wind_speed_ms` <dbl>, `8_temp_c` <dbl>,
## # `0_wind` <dbl>, `1_wind` <dbl>, `2_wind` <dbl>, `3_wind` <dbl>,
## # `4_wind` <dbl>, `0_wind_speed_ms_lag1` <dbl>, `0_temp_c_lag1` <dbl>,
## # `1_wind_speed_ms_lag1` <dbl>, `1_temp_c_lag1` <dbl>,
## # `2_wind_speed_ms_lag1` <dbl>, `3_wind_speed_ms_lag1` <dbl>,
## # `4_wind_speed_ms_lag1` <dbl>, `8_temp_c_lag1` <dbl>, `0_wind_lag1` <dbl>,
## # `1_wind_lag1` <dbl>, `2_wind_lag1` <dbl>, `3_wind_lag1` <dbl>,
## # `4_wind_lag1` <dbl>, `0_wind_speed_ms_lag2` <dbl>, `0_temp_c_lag2` <dbl>,
## # `1_wind_speed_ms_lag2` <dbl>, `1_temp_c_lag2` <dbl>,
## # `2_wind_speed_ms_lag2` <dbl>, `3_wind_speed_ms_lag2` <dbl>,
## # `4_wind_speed_ms_lag2` <dbl>, `8_temp_c_lag2` <dbl>, `0_wind_lag2` <dbl>,
## # `1_wind_lag2` <dbl>, `2_wind_lag2` <dbl>, `3_wind_lag2` <dbl>,
## # `4_wind_lag2` <dbl>, `0_wind_speed_ms_lag3` <dbl>, `0_temp_c_lag3` <dbl>,
## # `1_wind_speed_ms_lag3` <dbl>, `1_temp_c_lag3` <dbl>,
## # `2_wind_speed_ms_lag3` <dbl>, `3_wind_speed_ms_lag3` <dbl>,
## # `4_wind_speed_ms_lag3` <dbl>, `8_temp_c_lag3` <dbl>, `0_wind_lag3` <dbl>,
## # `1_wind_lag3` <dbl>, `2_wind_lag3` <dbl>, `3_wind_lag3` <dbl>,
## # `4_wind_lag3` <dbl>, `0_wind_speed_ms_lag4` <dbl>, `0_temp_c_lag4` <dbl>,
## # `1_wind_speed_ms_lag4` <dbl>, `1_temp_c_lag4` <dbl>,
## # `2_wind_speed_ms_lag4` <dbl>, `3_wind_speed_ms_lag4` <dbl>,
## # `4_wind_speed_ms_lag4` <dbl>, `8_temp_c_lag4` <dbl>, `0_wind_lag4` <dbl>,
## # `1_wind_lag4` <dbl>, `2_wind_lag4` <dbl>, `3_wind_lag4` <dbl>,
## # `4_wind_lag4` <dbl>, `0_wind_speed_ms_lag5` <dbl>, `0_temp_c_lag5` <dbl>,
## # `1_wind_speed_ms_lag5` <dbl>, `1_temp_c_lag5` <dbl>,
## # `2_wind_speed_ms_lag5` <dbl>, `3_wind_speed_ms_lag5` <dbl>,
## # `4_wind_speed_ms_lag5` <dbl>, `8_temp_c_lag5` <dbl>, `0_wind_lag5` <dbl>,
## # `1_wind_lag5` <dbl>, `2_wind_lag5` <dbl>, `3_wind_lag5` <dbl>,
## # `4_wind_lag5` <dbl>, `0_wind_speed_ms_lag9` <dbl>, `0_temp_c_lag9` <dbl>,
## # `1_wind_speed_ms_lag9` <dbl>, `1_temp_c_lag9` <dbl>,
## # `2_wind_speed_ms_lag9` <dbl>, `3_wind_speed_ms_lag9` <dbl>,
## # `4_wind_speed_ms_lag9` <dbl>, `8_temp_c_lag9` <dbl>, `0_wind_lag9` <dbl>,
## # `1_wind_lag9` <dbl>, `2_wind_lag9` <dbl>, `3_wind_lag9` <dbl>,
## # `4_wind_lag9` <dbl>, `0_wind_speed_ms_lag276` <dbl>,
## # `0_temp_c_lag276` <dbl>, `1_wind_speed_ms_lag276` <dbl>,
## # `1_temp_c_lag276` <dbl>, `2_wind_speed_ms_lag276` <dbl>,
## # `3_wind_speed_ms_lag276` <dbl>, `3_temp_c_lag276` <dbl>,
## # `4_wind_speed_ms_lag276` <dbl>, `8_temp_c_lag276` <dbl>,
## # `0_wind_lag276` <dbl>, `1_wind_lag276` <dbl>, `2_wind_lag276` <dbl>, …
Finally, let’s load the model!
As noted above, we want to only consider data for the next 26 hours. The fundamental reason for this is that we expect to always be able to create all feature data 26 hours in advance, for example from weather forecasts. Since our model forecasts 5-min data, we need to convert 26 hours into the appropriate number of samples.
## [1] 312
If we had a lot of resources, we could evaluate the model performance for every window of the above length in the validation dataset. Unfortunately, our model is quite slow for making predictions. So, we need to find a way to extract a subset of the validation set that is representative and for which predictions do not take too long.
After some experimentation (outside of this notebook) it became clear that predicting 40 26-hour timeframes can be done within 1.5 hours, an appropriate timeframe for this analysis.
In the next cell, we continuously draw 40 start times for the validation subset. We stop drawing when the 26-hour time frames following the start times do not overlap. This allows us to have all samples be reasonable independent.
A weakness of this approach is that we may unintentionally oversample and undersample certain scenarios. For example, we may bias our validation towards certain months of the year, between which wind usually may differ.
n_samples <- 40
# Draw until non-overlapping samples
set.seed(42)
sample_nxs <- NULL
while(is_null(sample_nxs) | min(diff(sort(sample_nxs))) < forecast_period_samples){
sample_nxs <- sample(nrow(data), n_samples, replace=F)
}
## Warning in min(diff(sort(sample_nxs))): no non-missing arguments to min;
## returning Inf
## [1] 1160 26259 36297 7858 4155 34398 8389 17768 20321 16363 44236 660
## [13] 24847 23340 43415 28760 16752 21410 29670 31928 3301 26667 39883 37670
## [25] 6802 35636 21861 17123 45904 4853 13016 12051 38701 40264 9886 47165
## [37] 11695 5969 19404 8947
Before analyzing model performance for all samples, let’s do it for a single sample only. This gives us the space to think about performance metrics in detail without spending a lot of time on having the model run predictions for us.
First, we need to determine the width of the prediction window. Reminder: This is the width of the window with the most wind energy in the grid within the 26-hour time frame. Let’s assume we wanted to find the 30-min window with the most wind energy within the 26-hour time frame.
prediction_window_width_h <- 0.5
prediction_window_samples <- prediction_window_width_h*12
prediction_window_samples
## [1] 6
Since our model forecasts in 5-min intervals, the 30-min window width is equal to 6 samples.
Next, let’s select an arbitrary 26-hour window and plot the actual, observed, amount of wind power in the California grid.
samples <- data %>% slice(1160:eval(1160+forecast_period_samples))
samples %>% plot_time_series(Time,
Wind,
.smooth = F,
.title = 'Actual Wind Power in California Grid',
.y_lab = 'Power in MW')
windows <- samples %>% select(Wind) %>% rollsum(prediction_window_samples)
samples_with_windows <- samples %>%
add_column(sum_window=windows %>% append(rep(NaN,tail(nrow(samples)-nrow(windows)))))
samples_with_windows %>%
pivot_longer(cols=c(Wind, sum_window)) %>%
plot_time_series(Time,
value,
.color_var = name,
.smooth = F,
.title = '30-min Rolling Window of Actual Wind Power in California Grid',
.y_lab = 'Power in MW/Energy in MW6h')
## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
Now, let’s calculate the top 3 windows with the most wind energy in the grid. We use the top 3 because this helps us get a clearer picture of the models’ performance down the line, in contrast to just comparing to a single actual value alone.
For every window, we calculate the starting point and the amount of energy included. Since we want to calculate energy in MWh, we need to divide the rolling sum of the 5-min actual generation by 12.
top_3_actual <- order(windows, decreasing = T) %>% head(3)
top_3_actual_energy <- windows[top_3_actual] / 12
print(top_3_actual)
## [1] 1 2 3
## [1] "2020-06-30 02:05:00 UTC" "2020-06-30 02:10:00 UTC"
## [3] "2020-06-30 02:15:00 UTC"
## [1] 1795.000 1792.750 1789.167
Now, in accordance with our work above, let’s use the model to predict the energy in the grid for the same time interval.
predictions <- modeltime_table(model) %>%
modeltime_forecast(
new_data = samples,
actual_data = samples
)
Let’s calculate the predictions and their rolling windows and plot them alongside the actual data.
samples_predicted <- predictions %>%
filter(.key=='prediction') %>%
select(.value) %>%
pull()
windows_predicted <- samples_predicted %>% rollsum(prediction_window_samples)
samples_with_windows_predicted <- samples_with_windows %>%
add_column(sum_window_predicted=
windows_predicted %>%
append(rep(NaN,tail(nrow(samples)-length(windows_predicted))))
) %>%
add_column(
Wind_predicted=samples_predicted
)
samples_with_windows_predicted %>%
pivot_longer(cols=c(Wind, sum_window, Wind_predicted, sum_window_predicted)) %>%
plot_time_series(Time,
value,
.color_var = name,
.smooth = F,
.title = '30-min Rolling Window of Wind Power in California Grid',
.y_lab = 'Power in MW/Energy in MW6h')
Let’s find the top 3 predicted window positions as well as the energy they contain.
top_3_predicted <- order(windows_predicted, decreasing = T) %>% head(3)
top_3_predicted_energy <- windows_predicted[top_3_predicted] / 12
actual_energy_top_3_predicted <- windows[top_3_predicted] / 12
print(top_3_predicted)
## [1] 189 188 187
## [1] "2020-06-30 17:45:00 UTC" "2020-06-30 17:40:00 UTC"
## [3] "2020-06-30 17:35:00 UTC"
## [1] 1566.144 1565.394 1563.965
## [1] 1611.750 1624.000 1629.667
This particular example already exposes a flaw of the model: The relative size of the peaks to each other is of concern in the domain application at hand. During training of the model, the mean absolute error between model and actual data has been minimized. This may well be an inferior proxy for solving the peak identification problem we are facing here. Nevertheless, let’s work with what we have and put the visual analysis we did above into formulas, so we can analyze the performance numerically.
We would like to understand the performance of the model both with regards to the peak hours it selects as well with regards to the energy that it over- or underpredicts in those peak hours. Thus, we can analyze the performance of the model in several ways:
Let’s put these thoughts into code. For all _h
variables, positive numbers mean that the predicted peak is after the actual peak, and vice-versa. Similarly, for all _MWh
variables, a positive value means that the model overpredicts the actual value, and vice-versa.
prediction_error_h <- (top_3_predicted-top_3_actual)/12
prediction_error_h_relative <- prediction_error_h/prediction_window_width_h
prediction_error_MWh <- (top_3_predicted_energy-top_3_actual_energy)/12
prediction_error_MWh_relative <- prediction_error_MWh/(top_3_actual_energy/12)
prediction_error_MWh_actual_energy <- (actual_energy_top_3_predicted-top_3_actual_energy) / 12
prediction_error_MWh_relative_actual_energy <- prediction_error_MWh_actual_energy / (top_3_actual_energy / 12)
print('prediction_error_h')
## [1] "prediction_error_h"
## [1] 15.67 15.50 15.33
## [1] "prediction_error_h_relative [%]"
## [1] 3133.33 3100.00 3066.67
## [1] "prediction_error_MWh"
## [1] -19.07 -18.95 -18.77
## [1] "prediction_error_MWh_relative [%]"
## [1] -12.75 -12.68 -12.59
## [1] "prediction_error_MWh_actual_energy"
## [1] -15.27 -14.06 -13.29
## [1] "prediction_error_MWh_relative_actual_energy [%]"
## [1] -10.21 -9.41 -8.91
The metrics above capture what we have already understood from the chart above. Our predicted peak is about 15 1/2 hours after the actual peak (see prediction_error_h
). As prediction_error_MWh_relative_actual_energy
shows, assuming our peak is where it was predicted, we miss out on the energy contained in the actual peak by about 10%.
As prediction_error_MWh_relative
shows, the model underpredicts the actual energy in the peak, even when set at the model’s chosen peak location. This also backs what we see in the plot above - at least for the inspected sample, the model tends to underpredict the actual wind power in the grid. One may ask the question why that is the case.
As noted, the model was only trained on past data and the validation data is the most recent data. It is conceivable that in the most recent period, more wind capacity has been added to the grid, thus invalidating some of the model’s assumptions about the scale of the impact of features on the overall wind energy amount. There are several techniques to correct for this flaw, for example training on more recent data or overweighting more recent samples in comparison to older samples. All of this tuning work is outside of the scope of this workbook.
Now that we know how to assess a single sample, let’s repeat the assessment for all 40 samples and multiple window widths.
We choose the following window widths:
The next cell runs for about 1.5 hrs and calculates the validation metrics developed above for all 40 samples.
if (!file.exists('./data/processed/prediction_stats.rds')) {
prediction_stats <- tibble(
window_width_h = numeric(),
window_start = integer(),
choice_rank = integer(),
time_predicted = as.POSIXct(NA),
error_h = numeric(),
error_h_relative = numeric(),
error_MWh = numeric(),
error_MWh_relative = numeric(),
error_MWh_actual_energy = numeric(),
error_MWh_relative_actual_energy = numeric()
)
progress_ct <- 0
print('Starting calculation...')
for (sample_nx in sample_nxs) {
samples <- data %>% slice(eval(sample_nx):eval(sample_nx + forecast_period_samples))
predictions <- modeltime_table(model) %>%
modeltime_forecast(
new_data = data %>% slice(eval(sample_nx):eval(sample_nx + forecast_period_samples)),
actual_data = data %>% slice(eval(sample_nx):eval(sample_nx + forecast_period_samples))
)
for (prediction_window_width_h in prediction_window_width_hs) {
prediction_window_samples <- prediction_window_width_h * 12
windows <- samples %>%
select(Wind) %>%
rollsum(prediction_window_samples)
top_3_actual <- order(windows, decreasing = T) %>% head(3)
top_3_actual_energy <- windows[top_3_actual]
samples_predicted <- predictions %>%
filter(.key == 'prediction') %>%
select(.value)
windows_predicted <- samples_predicted %>% rollsum(prediction_window_samples)
top_3_predicted <- order(windows_predicted, decreasing = T) %>% head(3)
top_3_predicted_energy <- windows_predicted[top_3_predicted]
actual_energy_top_3_predicted <- windows[top_3_predicted]
prediction_error_h <- (top_3_predicted - top_3_actual) * 5 / 60
prediction_error_h_relative <- prediction_error_h / prediction_window_width_h
prediction_error_MWh <- (top_3_predicted_energy - top_3_actual_energy) / 12
prediction_error_MWh_relative <- prediction_error_MWh / (top_3_actual_energy / 12)
prediction_error_MWh_actual_energy <- (actual_energy_top_3_predicted - top_3_actual_energy) / 12
prediction_error_MWh_relative_actual_energy <- prediction_error_MWh_actual_energy / (top_3_actual_energy / 12)
prediction_stats <- prediction_stats %>%
add_row(
window_width_h = rep(prediction_window_width_h, 3),
window_start = rep(sample_nx, 3),
choice_rank = c(1, 2, 3),
time_predicted = data %>%
slice(sample_nx + top_3_predicted) %>%
select(Time) %>%
pull(),
error_h = prediction_error_h,
error_h_relative = prediction_error_h_relative,
error_MWh = prediction_error_MWh,
error_MWh_relative = prediction_error_MWh_relative,
error_MWh_actual_energy = prediction_error_MWh_actual_energy,
error_MWh_relative_actual_energy = prediction_error_MWh_relative_actual_energy
)
}
print(paste(progress_ct, '/', length(sample_nxs)))
progress_ct <- progress_ct + 1
}
saveRDS(prediction_stats, './data/processed/prediction_stats.rds')
} else {
print('File prediction_stats.rds already exists, so skipped the long-running assessment cell.')
}
## [1] "File prediction_stats.rds already exists, so skipped the long-running assessment cell."
Let’s look at the validation metrics:
## # A tibble: 6 x 10
## window_width_h window_start choice_rank time_predicted error_h
## <dbl> <int> <dbl> <dttm> <dbl>
## 1 0.5 1160 1 2020-06-30 17:50:00 15.7
## 2 0.5 1160 2 2020-06-30 17:45:00 15.5
## 3 0.5 1160 3 2020-06-30 17:40:00 15.3
## 4 1 1160 1 2020-06-30 17:25:00 15.2
## 5 1 1160 2 2020-06-30 17:30:00 15.2
## 6 1 1160 3 2020-06-30 17:35:00 15.2
## # … with 5 more variables: error_h_relative <dbl>, error_MWh <dbl>,
## # error_MWh_relative <dbl>, error_MWh_actual_energy <dbl>,
## # error_MWh_relative_actual_energy <dbl>
We can now begin to slice and dice the validation metrics and get an idea of how good our model is at determining wind energy peaks in the California grid.
fig <- plot_ly(x = prediction_stats %>%
filter(choice_rank == 1) %>%
select(window_width_h) %>%
pull(),
y = prediction_stats %>%
filter(choice_rank == 1) %>%
select(error_h) %>%
pull(),
type = 'box')
fig <- fig %>%
layout(title = 'Peak Time Prediction Error by Window Width',
xaxis = list(title = 'Peak Window Width in h'),
yaxis = list(title = 'Prediction Error in Hours'))
fig
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
The good news first: For all windows, the median error is close to zero which means that the model does not systematically over- or underpredict the location of the peak. Furthermore, about 50% of all sampled predictions are between -2.5 hours (predicted peak too early) and 1.3 hours (predicted peak too late) for window widths < 6, and smaller for greater window widths. This does not look too bad (but also not too good) for a 26-hour assessment window. To find out how good (or bad) these numbers really are, they need to be put in context with the energy that we are missing out on, more on that below.
Now to the not-so-good news: There are many outliers. It looks as if some predictions are straightout off. In fact, 50% of sampled predictions are outside the bounds specified above, an interval of 3.8 hours. While 3.8 hours does not seem like a particularly small and precise interval, having 50% of samples outside of this range is definitely not great. But there is the chance that we may just be finding another big peak, so we need to look better at what the impact on the lost energy is.
Before moving on, let’s look at an actual histogram of the data and to get a better picture of the distribution outside of the boxplot above. We arbitrarily choose the 1-hr timeframe.
fig <- plot_ly(x = prediction_stats %>%
filter(choice_rank == 1, window_width_h == 1.0) %>%
select(error_h) %>%
pull(),
type = 'histogram',
nbinsx = 20)
fig <- fig %>%
layout(title = 'Peak Time Prediction Error Histogram for 1-hr Window',
xaxis = list(title = 'Peak Time Prediction Error in h'),
yaxis = list(title = 'Count'))
fig
Now, finally, let’s look at the errors in terms of energy lost.
The chart belows shows the relative energy error between predicted and actual peak by window width.
fig <- plot_ly(x = prediction_stats %>%
filter(choice_rank == 1) %>%
select(window_width_h) %>%
pull(),
y = prediction_stats %>%
filter(choice_rank == 1) %>%
select(error_MWh_relative_actual_energy) %>%
pull(),
type = 'box')
fig <- fig %>%
layout(title = 'Relative Energy Error Between Predicted and Actual Peak by Window Width',
width = 800,
xaxis = list(title = 'Peak Window Width in h'),
yaxis = list(title = 'Relative Energy Error Between Predicted and Actual Peak', tickformat = '%'))
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()
The first thing jumping out when looking at the chart is that the relative energy error is never greater than 0%. This is a logical consequence of what we are assessing though: By definition, energy in the window our model predicts the peak is in in the actual data cannot be larger than the actual peak in the actual data.
The median of all prediction errors is always at least -8%, a number that, overall does not seem too low. In practice, this means that about 50% of all predictions do not miss more than 8% of the energy contained in the actual peak. In accordance with the 75% percentile, 75% of all predictions do not miss the energy in the peak by more than 14%.
Another reassuring (although logical) trend is that the greater the peak window width gets, the smaller the prediction error gets.
For completeness, once again, let’s plot a histogram shining a light on the distribution. We pick the 1-hr peak window width for consistency with the histogram above.
fig <- plot_ly(x = prediction_stats %>%
filter(choice_rank == 1, window_width_h == 1.0) %>%
select(error_MWh_relative_actual_energy) %>%
pull(),
type = 'histogram',
nbinsx = 20)
fig <- fig %>%
layout(title = 'Peak Relative Energy Error Histogram for 1-hr Window',
xaxis = list(title = 'Peak Relative Energy Error', tickformat = '%'),
yaxis = list(title = 'Count'))
fig
Now that we have looked at the peak time and the peak energy prediction performance separately, let’s see if we can find any interesting pattern when we look at them together.
fig <- plot_ly(x = prediction_stats %>%
filter(choice_rank == 1) %>%
select(error_h) %>%
pull(),
y = prediction_stats %>%
filter(choice_rank == 1) %>%
select(error_MWh_relative_actual_energy) %>%
pull(),
marker = list(color = prediction_stats %>%
filter(choice_rank == 1) %>%
select(window_width_h) %>%
pull(),
colorscale = 'Bluered',
colorbar = list(
title = 'Window width in h'
)),
mode = 'markers',
type = 'scatter')
fig <- fig %>%
layout(title = 'Peak Time Error vs. Relative Energy Error',
xaxis = list(title = 'Peak Time Prediction Error in h'),
yaxis = list(title = 'Relative Energy Error Between Predicted and Actual Peak', tickformat = '%'))
fig
However, we need to be aware that this kind of conclusion may be biased by the starting points of the 26-hour validation windows. In this context, it may be interesting to look at if the starting point of the validation window has an impact on the model performance.
Let’s evaluate whether the start time of the validation window has an influence on the peak energy prediction error.
start_hours <- data %>%
slice(prediction_stats %>%
filter(choice_rank == 1) %>%
select(window_start) %>%
pull()) %>%
select(Time) %>%
pull() %>%
hour()
fig <- plot_ly(x = start_hours,
y = prediction_stats %>%
filter(choice_rank == 1) %>%
select(error_MWh_relative_actual_energy) %>%
pull(),
type = 'box')
fig <- fig %>%
layout(title = 'Relative Energy Error vs. Validation Window Start Time',
xaxis = list(title = 'Validation Window Start Time (Hour of Day)'),
yaxis = list(title = 'Relative Energy Error Between Predicted and Actual Peak', tickformat = '%'))
fig
fig <- plot_ly(x = start_hours,
type = 'histogram',
nbinsx = 24)
fig <- fig %>%
layout(title = 'Sample Sizes for Relative Energy Error vs. Validation Window Start Time Plot',
xaxis = list(title = 'Validation Window Start Time (Hour of Day)'),
yaxis = list(title = 'Count'),
width = 800)
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()
Particularly for the 22 hour bin, there is a similar amount of samples as we find in the other bins. Fundamentally however, the 22 hour bins only contains 14 samples, which means 2 distinct starting points. Thus, there may be a chance that we randomly selected 2 atypical days for this bin. Now, the 23 hour bin also just has a single sample and shows similarly odd performance like the 22 hour bin (see boxplot above). To conclude, we should assume that our algorithm may perform worse when a validation window starts after 10 pm.
We have talked before about assessing the performance of the model for predicting the actual energy in the peak. Now, let us actually carry out that assessment.
What we want to know is: Given the suggested peak time of the model, by how many MWh does that model’s peak energy # prediction differ from the MWh in the peak in the actual data? As always, let’s plot and investigate!
fig <- plot_ly(
x = prediction_stats %>%
filter(choice_rank == 1) %>%
select(window_width_h) %>%
pull(),
y = prediction_stats %>%
filter(choice_rank == 1) %>%
select(error_MWh) %>%
pull(),
type = 'box')
fig <- fig %>%
layout(title = 'Absolute Energy Error Between Predicted and Actual Peak',
xaxis = list(title = 'Peak Window Width in h'),
yaxis = list(title = 'Absolute Energy Error in MWh'),
width = 600)
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()
The plot clearly shows that our model tends to underpredict the actual peak, although it overpredicts it in some cases. Also, not surprisingly, the underprediction gets larger as the peak window grows. This is not of particular importance for our mission to identify the time window of a specified width within the next 26 hours in which there is the most wind energy in the grid. Nevertheless, it is relevant when one would like to use the model for predicting the actual peak energy.
There are other great ways of assessing the model performance in this regard. In the training of all models, metrics such as mean absolute error, root mean squared error, and r squared, both for in-sample and out-of-sample data were evaluated. These metrics are not within the scope of this notebook, which takes a more domain-specific approach to evaluate the model’s performance for the specific peak prediction problem.
Based on the analysis above, we can now sketch out some ideas for improving the model.
The model in itself only forecasts the time series of wind energy in the California grid. However, the quantity that we are actually interested in is the location of the peaks. We could refactor our model to predict those locations specifically.
We have seen that the model does not perform well on the validation dataset, since it underestimates the amount of wind energy included in the peaks of the energy time series. While training the model (outside of this notebook), we have note paid attention to how the model differs between predicting more recent vs. older data. In this context, several things could be worth investigating:
As described in the “About the Model” section at the beginning, the development of the model was subject to great computational constraints. The modeling and iteration process could benefit from reducing the number and complexity of models.
Finally, a model is only as good as the features it has available. Given that we have ensembled 8 tuned models to look at all 188 features, we may have reached a point where we simply cannot extract more information from those features. Thus, to improve model performance, we may need additional features that provide additional information to the model. In the context of wind power in the California grid, we may want to differentiate between more than 5 clusters. We can also want to take information about curtailment of energy production into account.