-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy path08-predictions.qmd
More file actions
1309 lines (1085 loc) · 47.7 KB
/
08-predictions.qmd
File metadata and controls
1309 lines (1085 loc) · 47.7 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Predictive Modelling and Beyond {#sec-08-predictions}
```{r}
#| echo: false
library(ggplot2)
book_theme <- theme_minimal() +
theme(plot.title=element_text(face="bold"))
ggplot2::theme_set(book_theme)
```
:::::: solutionbox
:::: solutionbox-header
::: solutionbox-icon
:::
Learning Objectives
::::
::: solutionbox-body
- Use model predictions to make informed decisions on new or unseen
data
- Apply time series analysis to forecast the Socio-Demographic Index
(SDI)
- Estimate Years Lived with Disability (YLDs) using mixed-effects
models
:::
::::::
\
In the dynamic landscape of public health, the ability to forecast and
predict future trends is essential for effective decision-making and the
implementation of timely interventions. This chapter provides an
overview of **predictive modelling**\index{Predictive modelling},
focusing on the challenges of applying predictions to new data, the use
of **time series analysis**\index{Time series}, and **mixed
models**\index{Mixed models} to anticipate the trajectory of infectious
diseases and health metrics\index{Health metrics}. By exploring the
underlying patterns and analysing historical data, we estimate the
disease burden and evaluate the impact of interventions on population
health. We demonstrate how time series analysis and mixed
models\index{Mixed models} can be tailored to address specific research
questions, such as predicting disease outbreaks, estimating disease
burden, and assessing the impact of interventions on population health.
## Predictions About the Future
The previous chapters provided an attempt at estimating future outcomes
based on the application of various type of statistical and machine
learning techniques. Here we look at how to use these predictions on new
data.
When making predictions on new data, it is important to consider the
**generalisability** of the model. A model that performs well on the
training data may not necessarily generalise well to new, unseen data.
To evaluate the performance of the model we used the
**cross-validation** technique, but there are more such as **hold-out
validation**, **k-fold cross-validation**, **leave-one-out
cross-validation**, and **bootstrapping**. These methods help assess the
model's ability to make accurate predictions on data not used during
training, providing a more reliable estimate of the model's performance
in real-world scenarios.
Once the model has been trained and evaluated, it is used to make
predictions on new data. This process involves applying the model to the
new data to generate forecasts or estimates of the outcome of interest.
For example, in the case of infectious diseases, predictive models can
be used to forecast the trajectory of an outbreak, estimate the number
of cases, or evaluate the impact of interventions on disease
transmission. By leveraging historical data and the insights gained from
the model, we can effectively evaluate the model performance and make
informed decisions based on the predictions.
## Example: Dengue Test Predictions for 2017-2021
**Predictive modelling**\index{Predictive modelling} is a powerful tool
for forecasting future trends. Here we test the Dengue's\index{Dengue}
model made with `{mlr3}` meta-package in @sec-07-packages. The model was
trained on data from 1990 to 2016 and now tested for years 2017 to 2021.
First we load the original data and name it `old_data`.
```{r}
library(tidyverse)
library(data.table)
```
```{r}
old_data <- hmsidwR::infectious_diseases %>%
arrange(year)%>%
filter(cause_name == "Dengue",
year<=2017,
!location_name %in% c("Eswatini", "Lesotho")) %>%
drop_na() %>%
group_by(location_id) %>%
select(-location_name, -cause_name)
```
Then, we load Dengue data from 2017 to 2021 and name it `new_data`.
```{r}
new_data <- hmsidwR::infectious_diseases %>%
arrange(year)%>%
filter(cause_name == "Dengue",
year>=2017,
!location_name %in% c("Eswatini", "Lesotho")) %>%
drop_na() %>%
group_by(location_id) %>%
select(-location_name, -cause_name)
```
```{r}
#| echo: false
#| eval: true
rr1 <- readRDS("data/model-data/rr1.rds")
rr2 <- readRDS("data/model-data/rr2.rds")
```
In the next step, we use the trained models from @sec-07-packages to
make predictions on the new data. The predictions are stored in
`new_pred_regr.cv_glmnet` and `new_pred_regr.xgboost`. The resampling
results: `rr1$learners` and `rr2$learners`, contain the trained models,
in particular the first of 5 folds cross validation sample is used for
the predictions.
```{r}
new_pred_regr.cv_glmnet <-
rr1$learners[[1]]$predict_newdata(new_data,
task = rr1$task)
```
This object contains the predictions for the new data, and we can access
the response data by:
```
new_pred_regr.cv_glmnet$`data`$response
```
With the support of `data.table::as.data.table()` we can see the data,
and create a new column `ape` to evaluate the
`Absolute Percentage Error (APE)`\index{Absolute Percentage Error (APE)}.
Here are shown the first 6 rows of the data:
```{r}
data.table::as.data.table(new_pred_regr.cv_glmnet) %>%
mutate(ape = round(abs(truth - response)/truth,3)*100) %>%
head()
```
And calculate some metrics to evaluate the model performance on new
data. We calculate the **Mean Absolute Percent Error
(MAPE)**\index{Mean Absolute Percent Error (MAPE)}, the **Mean Squared
Error (MSE)**\index{Mean Squared Error (MSE)}, and the **Root Mean
Squared Error (RMSE)**\index{Root Mean Squared Error (RMSE)}. The MAPE
is a measure of prediction accuracy\index{Accuracy}, while the MSE and
RMSE are measures of the average squared difference between predicted
and actual values.
```{r}
data.table::as.data.table(new_pred_regr.cv_glmnet) %>%
mutate(ape = round(abs(truth - response)/truth,3)*100) %>%
summarise(mape = mean(ape),
mse = mean((truth - response)^2),
rmse = sqrt(mean((truth - response)^2))) %>%
round(3)
```
The model performs fairly well, with relatively low error:
- An average relative error of \~11% is typically considered
acceptable in many forecasting or prediction tasks.
- The small RMSE (0.098) and MSE (0.01) suggest that large errors are
rare or small.
Much more can be done to improve the models, such as **hyperparameter
tuning**\index{Hyperparameter
tuning}, **feature engineering**\index{Feature engineering}, and **model
ensembles**\index{Model ensemble}, but this is a good point to start. We
could also look at the confidence intervals of the predictions, to see
how certain we are about the predictions, and so on.
This is the result of the `cv_glmnet` application on new data:
```{r}
#| fig-cap: "Dengue Predictions for 2017-2021: cv_glmnet. The solid line shows the historical data, the brown line shows the new data, and the dashed line the predictions."
#| fig-alt: "Dengue Predictions for 2017-2021: cv_glmnet"
#| label: fig-dengue-predictions-cv-glmnet
#| fig-width: 8
#| fig-height: 4
regr.cv_glmnet_new <- as.data.table(new_pred_regr.cv_glmnet) %>%
mutate(learner = "regr.cv_glmnet") %>%
cbind(new_data)
regr.cv_glmnet_new %>%
ggplot(aes(x = year)) +
geom_line(data = old_data, aes(y = DALYs))+
geom_line(aes(y = DALYs), color = "brown") +
geom_line(aes(y = response), linetype = "dashed") +
facet_wrap(vars(location_id),
labeller = as_labeller(c(`182` = "Malawi",
`191` = "Zambia",
`169` = "Central African Republic")),
scales = "free_y") +
labs(title = "Dengue Predictions for 2017-2021: cv_glmnet",
x = "Time (Year)")
```
The same steps are taken for the `xgboost`, and the results are shown
below to compare the two models.
```{r}
new_pred_regr.xgboost <-
rr2$learners[[1]]$predict_newdata(new_data,
task = rr2$task)
```
```{r}
regr.xgboost_new <- as.data.table(new_pred_regr.xgboost) %>%
mutate(learner = "regr.xgboost") %>%
cbind(new_data)
```
```{r}
rbind(regr.cv_glmnet_new,
regr.xgboost_new) %>%
mutate(ape = round(abs(truth - response)/truth, 3)*100) %>%
group_by(learner) %>%
summarise(mape = mean(ape),
mse = mean((truth - response)^2),
rmse = sqrt(mean((truth - response)^2)))
```
The `glmnet` performs better than `xgboost`. The xgboost\index{xgboost}
model may be overfitting or it is just poorly tuned, given its higher
error despite being a more complex model.
```{r}
#| fig-cap: "Dengue Predictions for 2017-2021: xgboost. The solid line shows the historical data, the brown line shows the new data, and the dashed line the predictions."
#| fig-alt: "Dengue Predictions for 2017-2021: xgboost"
#| label: fig-dengue-predictions-xgboost
#| fig-width: 8
#| fig-height: 4
regr.xgboost_new %>%
ggplot(aes(x = year)) +
geom_line(data = old_data, aes(y = DALYs))+
geom_line(aes(y = DALYs), color = "brown") +
geom_line(aes(y = response), linetype = "dashed") +
facet_wrap(vars(location_id),
labeller = as_labeller(c(`182` = "Malawi",
`191` = "Zambia",
`169` = "Central African Republic")),
scales = "free_y") +
labs(title = "Dengue Predictions for 2017-2021: xgboost")
```
In conclusion, predictive modelling is a valuable tool for
forecasting\index{Forecasting} future trends and making informed
decisions based on historical data. By leveraging the insights gained
from the models, we can anticipate the trajectory of infectious
diseases, estimate disease burden, and evaluate the impact of
interventions on population health. The models can be further refined
and optimised to improve their predictive performance and provide more
accurate forecasts.
By combining predictive modelling with time series analysis and mixed
models, we can gain a comprehensive understanding of the complex
dynamics of public health and make data-driven decisions to improve
population health outcomes.
## Time Series Analysis
One important side of the analysis is to consider the evolution of the
phenomenon in time. We can do this by using `time` as a factor in the
model. **Time series**\index{Time series} analysis serves as a tool for
understanding and forecasting temporal patterns. Techniques such as
**AutoRegressive Integrated Moving Average
(ARIMA)**\index{AutoRegressive Integrated Moving Average (ARIMA)} models
offer a systematic approach to modelling time-dependent data, capturing
seasonal\index{Seasonal} variations, trends\index{Trend}, and
irregularities to generate accurate forecasts.
**Mixed models**\index{Mixed models}, on the other side, provide a
versatile framework for incorporating various sources of
variability\index{Variability} and correlation\index{Correlation} within
the data. Also known as **hierarchical
models**\index{Hierarchical models}, **multilevel
models**\index{Multilevel models}, or **random effects
models**\index{Random effects models}, are a type of statistical model
that include both fixed effects\index{Fixed effects} and random
effects\index{Random effects}. By combining fixed effects with random
effects, mixed models accommodate complex data structures, such as
hierarchical or longitudinal data\index{Longitudinal data}, while
accounting for individual-level and group-level factors that may
influence the outcome of interest.
To analyse temporal data\index{Temporal data}, where observations are
collected at regular intervals over time, time series
analysis\index{Time series} allows for studying the patterns, trends,
and dependencies present in the data to make forecasts or infer
relationships. Time series data often exhibit inherent characteristics
such as trend, seasonality\index{Seasonality}, cyclic
patterns\index{Cyclic patterns}, and irregular fluctuations, which can
be explored using various methods such as
**decomposition**\index{Decomposition}, **smoothing**\index{Smoothing},
and **modelling**. This analysis is widely used in fields like
economics, finance, epidemiology, and environmental science to
understand and predict future trends, identify anomalies, and make
informed decisions based on historical patterns.
**Decomposition methods**\index{Decomposition} play a crucial role by
pulling out the underlying structure of temporal data. The components of
a time series are `trend`\index{Trend},
`seasonality`\index{Seasonality}, and
`random fluctuations`\index{Random fluctuations}. Understanding the
trend allows for the identification of long-term changes or shifts in
the data, while detecting seasonal patterns helps uncover recurring
fluctuations that may follow seasonal or cyclical patterns.
By separating these components, through decomposition methods, it is
possible to discover the temporal dynamics within the data, facilitating
more accurate forecasting and predictive modelling. Moreover,
decomposing time series data can aid in anomaly
detection\index{Anomaly detection}, trend analysis, and seasonal
adjustment, making it a fundamental tool for interpreting complex
temporal phenomena.
Modelling time series data often requires a combination of techniques to
capture its complex nature. **Mixed models**\index{Mixed models},
**Splines**\index{Splines}, and
**ARIMA**\index{AutoRegressive Integrated Moving Average (ARIMA)} are
powerful tools commonly used for this purpose. Splines provide a
flexible way to capture nonlinear relationships and smooth out the data,
which is particularly useful for capturing trends or seasonal patterns.
ARIMA models, on the other hand, are best at capturing the
autocorrelation structure present in the data, including both short-term
and long-term dependencies.
Time series analysis can also be performed on estimated values produced
by a machine learning model. This involves using vectors of estimates,
actual values (ground truth), and time (such as years). By incorporating
these elements, the time series analysis can help identify trends,
seasonal patterns, and other temporal structures within the data. This
approach enhances the robustness and accuracy of the predictions by
combining the strengths of machine learning models with traditional time
series techniques.
## Example: SDI Time Series Analysis
In this example, we provide a brief overview of the **Socio-Demographic
Index (SDI)**\index{Socio Demographic Index (SDI)}, its components, and
how it can be predicted using time series analysis. From projecting the
spread of emerging pathogens to assessing the long-term effects of
public health policies, predictive modelling offers valuable insights
into the future of global health. As discussed in @sec-04-causes_risks,
incorporating indicators such as the **SDI** can offer a more
comprehensive understanding of how social and economic determinants
influence health outcomes.
The SDI is a **composite index** developed by the Global Burden of
Disease (GBD) Study\index{Global Burden of Disease (GBD)} to capture a
country’s level of socio-demographic development, that combines various
social and demographic factors to provide a comprehensive measure of a
population's health and well-being.
It is based on three components:
1. Total fertility rate (TFR)\index{Total Fertility Rate (TFR)} among
women under age 25 (TFU25)
2. Average educational attainment in the population over age 15
(EDU15+)\index{Education level over age 15 (EDU15+)}
3. Income per capita\index{Income per capita} (lag-distributed)
It ranges from 0 to 1 but it is usually multiplied by 100 for a scale of
0 to 100.
**Standard Calculation of SDI**
The SDI is calculated as the geometric mean of these three components:
$$
SDI = \sqrt[3]{TFU25 \times EDU15+ \times LDI}
$$ {#eq-sdi}
Each component TFU25, EDU15+, and LDI is normalised before taking the
geometric mean to ensure they are on a comparable scale.
In particular, **Total Fertility Rate (TFR)** defined as the average
number of children a woman would have over her lifetime given current
age-specific fertility rates—is a key demographic indicator with broad
implications for national well-being.
**Standard Calculation of TFR**
$$
\text{TFR} = \sum (\text{ASFR}_i \times 5)
$$ {#eq-tfr}
where $\text{ASFR}_i$ represents the age-specific fertility
rates\index{Age Specific Fertility Rates (ASFR)} for age group $i$ , and
the sum is typically calculated over all childbearing age groups (often
5-year intervals from ages 15-49).
The TFR formula is specific to calculating fertility rates, while the
SDI formula incorporates a specific **subset of the TFR (TFU25)** along
with education and income metrics to create a broader socio-demographic
index.
In general, the level of this index helps identify key drivers in the
development of health outcomes. SDI can be used as a predictor in
predictive models to estimate the burden of diseases, mortality rates,
and other health metrics.
For example, the relationship between SDI and the incidence of a disease
can be modelled using a logistic regression\index{Logistic regression}
model:
$$
\log \left( \frac{p}{1 - p} \right) = \beta_0 + \beta_1 \times \text{SDI} + \beta_2 \times X_2 + \cdots + \beta_k \times X_k
$$ {#eq-logit-sdi}
Where, $p$ is the probability of the incidence of the disease,
$\log \left( \frac{p}{1 - p} \right)$ is the log-odds (logit) of the
disease incidence, $\beta_0$ is the intercept term,
$\beta_1, \beta_2, \ldots, \beta_k$ are the coefficients for the
predictor variables, and $X_2, \ldots, X_k$ are other potential
predictor variables that might influence the disease incidence.
To convert the `log-odds`\index{Log-odds} back to the **probability of
disease incidence** given the value of SDI:
$$
p = \frac{1}{1 + e^{-(\beta_0 + \beta_1 \times \text{SDI})}}
$$ {#eq-prob-sdi}
### SDI Data and Packages
In this example we use the SDI values from 1990 to 2019, for 3 locations
plus the Global region and evaluate the difference in SDI average across
the time.
We start by loading the data and required libraries. We use the
`{hmsidwR}` package, which contains the SDI data from 1990 to 2019, the
data is available in the `sdi90_19` object, and the `{fpp3}` [@forecast]
package for time series analysis mentioned in @sec-07-timeseries. We
also show how to use the `|>` native pipe operator to filter and
manipulate the data, instead of the `%>%` pipe operator from the
`{dplyr}` package.
```{r}
library(tidyverse)
library(hmsidwR)
library(fpp3)
```
```{r}
hmsidwR::sdi90_19 |> head()
```
```{r}
#| label: fig-sdi-france
#| fig-cap: "Social Demographic Index (SDI) from 1990 to 2019"
#| fig-alt: "Social Demographic Index (SDI) from 1990 to 2019"
sdi90_19 |>
filter(location %in% c("Global", "Italy",
"France", "Germany")) |>
ggplot(aes(x = year, y = value,
linetype = location)) +
geomtextpath::geom_textline(aes(label = location),
hjust = 0, vjust = 0) +
labs(title = "Social Demographic Index (SDI) from 1990 to 2019")+
theme(legend.position = "none")
```
Grouping the data by location and calculating the average SDI, we can
note that the highest average value is found in Germany, followed by
France, Italy, and the Global average.
```{r}
sdi90_19 |>
group_by(location) |>
reframe(avg = round(mean(value), 3)) |>
arrange(desc(avg)) |>
filter(location %in% c("Global", "Italy", "France", "Germany"))
```
Focusing on France as the location of interest, we analyse the time
series by decomposing the series into its components and evaluating the
autocorrelation\index{Autocorrelation} function to assess underlying
temporal patterns, and apply an
`ARIMA(1,0,0)`\index{AutoRegressive Integrated Moving Average (ARIMA)}
model.
```{r}
sdi_fr <- sdi90_19 |>
filter(location == "France")
sdi_fr |>
head() |>
str()
```
The `{fpp3}` package is a meta-package and contains other packages such
as: `tsibble`, `tsibbledata`, `feasts`, `fable`, and `fabletools.`
The `{tsibble}` package is used to set the data ready to be used inside
the model. The `tsibble::as_tsibble()` function coerce our data to be a
`tsibble` object:
```{r}
library(tsibble)
sdi_fr_ts <- tsibble::as_tsibble(sdi_fr, index = year)
sdi_fr_ts |> head()
```
```{r}
#| fig-cap: "Social Demographic Index (SDI) in France"
#| fig-alt: "Social Demographic Index (SDI) in France"
#| label: fig-sdi-france2
#| fig-width: 5
#| fig-height: 3
sdi_fr_ts |> autoplot()
```
Then, the `fabletools::model()` function is used to train and estimate
models. In this case we use the `STL()` (Multiple seasonal decomposition
by Loess) function, which decomposes a time series into seasonal, trend
and remainder components.
```{r}
dcmp <- sdi_fr_ts |>
model(stl = STL(value))
components(dcmp) |> head()
```
There are three main types of time series patterns: trend, seasonality,
and cycles. When a time series is decomposed into components, the trend
and cycle are typically combined into a single trend-cycle component,
often referred to simply as the trend for the sake of simplicity. As a
result, a time series can be understood as comprising three components:
a trend-cycle component, a seasonal component, and a remainder
component, which captures any other variations in the series[@forecast].
The components of a time series can be additive or multiplicative,
depending on the nature of the data:
$$y_t=S_t+T_t+R_t$$ {#eq-additive}
$$y_t=S_t*T_t*R_t$$ {#eq-multiplicative}
$$log(y_t)=log(S_t)+log(T_t)+log(R_t)$$ {#eq-log-decomposition}
where:
- $y_t$ is the observed value of the time series at time t,
- $S_t$ represents the seasonal component, capturing regular
fluctuations that repeat over fixed periods (e.g., annually or
quarterly),
- $T_t$ is the trend component, reflecting the long-term progression
or direction in the data (such as gradual increases or declines),
- $R_t$ denotes the remainder or residual component, accounting for
short-term noise or random variation not explained by seasonality or
trend.
The equation @eq-additive assumes that the components combine additively
and are independent of one another, which is suitable when the seasonal
variation remains relatively constant over time. The equation
@eq-multiplicative is used when seasonal or residual effects vary
proportionally with the trend (e.g., higher variability during periods
of higher values). Finally, The equation @eq-log-decomposition
represents a log-transformed multiplicative model, which stabilizes
variance and allows for additive decomposition\index{Decomposition} in
the logarithmic scale—often preferred when dealing with exponential
growth or heteroscedasticity\index{Heteroscedasticity}.
The `autoplot()` function is from the `{fabletools}` package and allows
for the visualization of the components. In particular, the
**remainder** component shown in the bottom panel is what is left over
when the seasonal and trend-cycle components have been subtracted from
the data.
```{r}
#| layout-ncol: 2
#| label: fig-decomposition
#| fig-cap: "Components of SDI in France"
#| fig-subcap:
#| - "Components of SDI in France"
#| - "Trend line of SDI in France"
#| fig-alt: "Components of SDI in France"
components(dcmp) |> fabletools ::autoplot()
components(dcmp) |>
as_tsibble() |>
autoplot(value) +
geom_line(aes(y = trend),
linetype="dashed")+
labs(title = "Trend line of SDI in France")
```
### Autocorrelation and Stationarity
Autocorrelation\index{Autocorrelation} is very important in time series
analysis as it explains whether past values influence future values. If
a variable today is similar to what it was yesterday or last month, we
say it has autocorrelation at lag 1 or lag $k$.
$$
r_k = \frac{\sum_{t = k + 1}^{T} (y_t - \bar{y})(y_{t-k} - \bar{y})}{\sum_{t = 1}^{T} (y_t - \bar{y})^2}
$$ {#eq-autocorr}
where:
- $r_k$is the autocorrelation coefficient at lag $k$,
- $y_t$ is the value of the time series at time $t$,
- $\bar{y}$ is the mean of the series,
- $T$ is the total number of time points.
Time series that exhibit **no autocorrelation**, or no predictable
relationship between observations at different time points, are referred
to as **white noise series**\index{White noise}, where each value is
essentially a random draw from the same distribution, with constant mean
and variance, with no pattern recognition.
On the contrary, the presence of **autocorrelation**, or statistically
significant correlations between current and past values, indicates the
absence of white noise. This means the time series has temporal
structure, which can be modelled for prediction. For example, positive
autocorrelation suggests that high (or low) values tend to be followed
by similarly high (or low) values, while negative autocorrelation
implies an alternating pattern.
$$
r_k =
\begin{cases}
> 0 & \text{positive autocorrelation (e.g., trends)} \\
< 0 & \text{negative autocorrelation (e.g., alternating pattern)} \\
\approx 0 & \text{no autocorrelation (white noise)}
\end{cases}
$$ {#eq-autocorr-res}
To check for autocorrelation in our data, we can use the `ACF()`
function from the `{feasts}` package:
```{r}
#| fig-cap: "Autocorrelation Function of SDI in France. The blue dashed lines represent the 95% confidence interval for the null hypothesis: `There is no autocorrelation at lag k.`"
#| fig-alt: "Autocorrelation Function of SDI in France"
#| label: fig-acf
#| fig-width: 5
#| fig-height: 3
sdi_fr_ts |>
ACF(value) |>
autoplot()
```
The output clearly shows that our data displays autocorrelation, this
validates the use of time series models such as ARIMA, which rely on
autocorrelated structure.
To assess whether the series is **stationary**\index{Stationarity}, we
use a specific test, the **KPSS
test**\index{KPSS test (Kwiatkowski, Phillips, Schmidt, and Shin)}[@kpsstes2023]
or **Augmented Dickey-Fuller test**\index{Augmented Dickey-Fuller test}.
If this test, generally defined as **unit root test**, indicates
non-stationarity, we can apply a **first-order difference** to remove
the trend and convert it into a stationary series.
To apply the KPSS test we use the `features()` function from
`{fabletools}` package and specify the feature to be: `unitroot_kpss`.
The `{urca}` might be needed for this task; if needed
`install.packages("urca")`.
```{r}
sdi_fr_ts |>
features(value,
features = unitroot_kpss)
```
A $p-value = 0.01$, which is less than 0.05, indicates strong evidence
against the null hypothesis of stationarity.
In summary, the presence of autocorrelation combined with the
**non-stationarity** result from the KPSS test suggests that the series
has a trend or random walk component, and is not stationary.
::: {.callout .callout-info}
A fundamental concept in time series analysis is
**stationarity**\index{Stationarity}. A time series is said to be
stationary when its statistical properties do not change over time. This
means the mean, variance, and autocorrelation structure remain constant
throughout the series.
On the other hand, a **non-stationary** time series exhibits properties
that change over time—for example, a long-term trend, changing
variability, or evolving seasonal patterns.
:::
In this case we apply the first-order differencing (differences = 1) to
the data before applying models that assume stationarity, such as ARIMA.
$$
y^{\prime}_t=y_t-y_{t-1}
$$ {#eq-diff}
Differencing transforms the series by subtracting each observation from
its previous value, thereby removing trends and stabilizing the mean
over time. This transformation often makes the mean and variance more
stable, and is essential because non-stationary time series can lead to
unreliable or misleading model results.
```{r}
# Apply first-order differencing to remove trend
sdi_fr_diff <- sdi_fr_ts |>
mutate(diff_value = difference(value, differences = 1))
sdi_fr_diff |> head()
```
```{r}
#| fig-cap: "First-order Differenced SDI (France)"
#| fig-alt: "Differencing Function of SDI in France"
#| label: fig-diff
#| fig-width: 5
#| fig-height: 3
sdi_fr_diff |>
ggplot(aes(x = year, y = diff_value)) +
geom_line() +
labs(title = "First-order Differenced SDI (France)",
x = "Year", y = "Differenced SDI")
```
### Partial Autocorrelations
The **Partial Autocorrelation Function
(PACF)**\index{Partial Autocorrelation Function (PACF)} removes the
effects of intermediate lags. It is a measures of the direct
relationship between a time series and its own lagged values. The PACF
allows for a more specific identification of how many lags to include in
the $AR$ part of the `ARIMA` model. The number of autoregressive ($AR$)
terms ($p$) is determined by the number of significant lags in the PACF
plot.
```{r}
#| layout-ncol: 2
#| fig-cap: "Partial Autocorrelations of SDI in France"
#| fig-subcap:
#| - "PACF"
#| - "gg_tsdisplay"
#| fig-alt: "Partial Autocorrelations of SDI in France"
#| label: fig-pacf
sdi_fr_ts |>
PACF(value) |>
autoplot()
sdi_fr_ts |>
gg_tsdisplay(y = value, plot_type = "partial")
```
In this plot we can see that the PACF shows a significant spike at lag
1, indicating that the first lag is important for predicting the current
value. The subsequent lags are not significant, suggesting that only the
first lag should be included in the ARIMA model.
### ARIMA Model
The **ARIMA**\index{AutoRegressive Integrated Moving Average (ARIMA)}
model forecasts time series values based on past observations.
::: {.callout .callout-info}
In time series analysis, the terms forecast\index{Forecast} and
predict\index{Predict} are sometimes used interchangeably, but there’s a
subtle distinction:
- Forecast typically refers to estimating future values based on a
model trained on historical data.
- Predict can refer more generally to estimating outcomes, including
within-sample estimates (e.g., fitting values during model
training).
:::
**Auto-ARIMA** model is used to establish the best fit by combining
autoregressive, moving average, and differencing components. However,
careful analysis of stationarity, autocorrelations, and the residuals is
required to fine-tune the model and improve forecasting accuracy.
The `ARIMA(value)` function automatically selects the best p, d, and q
values.
In general, the ARIMA model is defined as:
$$
\text{ARIMA}(p,d,q) = \text{AR}(p) + \text{I}(d) + \text{MA}(q)
$$ {#eq-arima-model}
where:
- $\text{AR}(p)$ is the autoregressive part of the model, which uses
past values of the time series to predict future values,
- $\text{I}(d)$ is the integrated part of the model, which represents
the differencing of the time series to make it stationary,
- $\text{MA}(q)$ is the moving average part of the model, which uses
past forecast errors to predict future values.
The `ARIMA()` function from the `{fable}` package automatically selects
the appropriate degree of differencing needed to achieve stationarity
and fits the ARIMA model accordingly. The `report()` function provides a
summary of the fitted model, including parameter estimates and
diagnostic statistics.
```{r}
fit <- sdi_fr_ts |>
model(ARIMA(value))
fit |> report()
```
The model suggests that the differenced series follows a stable upward
trend, with moderate autocorrelation and low residual variance, implying
a strong and reliable forecast model for the observed data.
In particular, selected `ARIMA(1,1,0)` indicates:
- One autoregressive ($AR$) term ($ar1 = 0.5971$): There is moderate
persistence; the differenced value at time $t$ is positively
correlated with the previous time point
- First-order differencing ($d = 1$) was applied to achieve
stationarity
- No moving average ($MA$) term ($q = 0$)
- The model includes a drift term ($\text{constant} = 0.0013$), which
accounts for a consistent upward trend in the differenced series.
- Standard errors (s.e.) for both coefficients are small, suggesting
estimates are fairly precise.
- $\sigma^2 = 8.634e-07$: Very low variance of the residuals,
indicating a good fit.
- $\text{Log-likelihood} = 162.46$: Used for model comparison.
- $\text{AIC} = -318.92$, $\text{AICc} = -317.96$,
$\text{BIC} = -314.82$: All are low values, which generally indicate
a better-fitting model when compared to alternatives.
### ARIMA Forecast
The `forecast()` function from the `{fable}` package is used to generate
forecasts based on the fitted ARIMA model. The `h` parameter specifies
the forecast horizon, which is the number of future time points to
predict. In this case we used $h=10$ for the next 10 years.
```{r}
#| fig-width: 5
#| fig-height: 3
#| fig-cap: "Forecast of SDI in France"
#| fig-alt: "Forecast of SDI in France"
#| label: fig-forecast
fit |>
forecast(h = 10) |>
ggplot(aes(x = year, y = .mean)) +
geom_line(data = sdi_fr_ts,
aes(y = value)) +
geom_line(color = "purple", linetype = "dashed")+
labs(title = "Forecast of SDI in France")
```
### Model Ensembles
In time series analysis, an individual model trained on historical data
to generate forecasts is known as a **single
learner**\index{Single learner}, such as a single ARIMA model. However,
a single learner may not be sufficient to capture the full complexity of
the data, especially when dealing with intricate patterns, nonlinear
relationships, or multiple influencing factors, leading to poor
generalisation on unseen data.
Single approaches often fail to account for crucial factors like
external shocks, structural changes, or nonlinear relationships, making
them insufficient for capturing all aspects of the data, including trend
and seasonality.
This is where **ensemble learning**\index{Ensemble learning} comes into
play.
By aggregating the predictions from **multiple single learners**, we can
form a **model ensemble**\index{Model ensemble}, to improve predictive
performance and generate more robust forecasts.
**Ensemble learning**\index{Ensemble learning} is a machine learning
technique that leverages the diversity of individual models to make more
accurate predictions and reduce overfitting. Ensemble methods can reduce
the variance and bias of the predictions, examples are techniques such
as bagging\index{Bagging}, boosting\index{Boosting}, and
stacking\index{Stacking} that leverage the diversity of individual
models to create a stronger collective model that outperforms its
components.
These methods are widely used in machine learning and predictive
modelling to enhance the predictive power of the model and achieve
better generalisation performance on unseen data.
In this example, we will use the `ARIMA()` function to fit multiple
ARIMA models with different orders and compare their performance using
the `glance()` function from the `{broom}` package.
```{r}
caf_fit <- sdi_fr_ts |>
model(
arima210 = ARIMA(value ~ pdq(2, 1, 0)),
arima013 = ARIMA(value ~ pdq(0, 1, 3)),
stepwise = ARIMA(value),
search = ARIMA(value, stepwise = FALSE)
)
```
```{r}
caf_fit |>
pivot_longer(cols = everything(),
names_to = "Model name",
values_to = "Orders")
```
```{r}
glance(caf_fit) |>
arrange(AICc) |>
select(.model:BIC)
```
The results of the model ensemble analysis reveal that both the
**stepwise** and **search** ARIMA models outperform the others in terms
of model fit, as indicated by their lower AIC and AICc values. These two
models exhibit identical log-likelihood values and minimal residual
variance, suggesting that they provide the most accurate forecasts while
maintaining model simplicity. In contrast, the **arima210** and
**arima013** models show slightly higher AIC and AICc values, indicating
that they are less efficient at capturing the underlying patterns in the
data.
Based on these findings, we select the **search** model with
ARIMA(1,1,0), and visualize the residuals of the fitted model.
```{r}
#| layout-ncol: 2
#| fig-cap: "Residuals of ARIMA Models"
#| fig-subcap:
#| - "Residuals of ARIMA Models"
#| - "Forecast of SDI in France"
#| fig-alt: "Residuals of ARIMA Models"
#| label: fig-residuals
caf_fit |>
select(search) |>
gg_tsresiduals()
caf_fit |>
forecast(h = 5) |>
filter(.model == "search") |>
ggplot(aes(x = year, y = .mean)) +
geom_line(data = sdi_fr_ts,
aes(y = value)) +
geom_line(color = "purple", linetype = "dashed")
```
In conclusion, the analysis of the **Socio-Demographic Index (SDI)** in
France from 1990 to 2019 demonstrates the effectiveness of time series
analysis and mixed models in understanding and forecasting complex
health-related data. By employing techniques such as decomposition,
autocorrelation analysis, and ARIMA modelling, we can gain valuable
insights into the underlying patterns and trends in the data. The use of
ensemble learning further enhances the predictive performance of the
models, allowing for more accurate forecasts and better-informed
decision-making in public health. However, additional model validation
and diagnostic checks, such as residual analysis and cross-validation,
should be conducted to ensure that these models generalize well to
unseen data and are not overfitting.
## Mixed Models
**Mixed models**\index{Mixed models} are a powerful statistical tool for
analysing data with a hierarchical or nested structure, where
observations are grouped within higher-level units. By incorporating
both **fixed** and **random effects**, mixed models can account for the
variability within and between groups, providing a flexible framework
for modelling complex data structures. Mixed models are widely used in
various fields, and particularly useful in fields like healthcare and
infectious diseases where data might be collected across different
subjects or time points.
In the following section, we will discuss the application of mixed
models in estimating the **Years Lived with Disability
(YLDs)**\index{Years Lived with Disability (YLDs)} and the different
ratios used in forecasting non-fatal disease burden.
### Mixed-Effects Models in Estimating YLDs
Mixed-effects models are particularly useful in estimating YLDs because
they can handle data that is hierarchically structured data.
**Fixed Effects**\index{Fixed effects}: These are the primary effects of
interest and include variables such as age, gender, year, and specific
interventions or treatments.
**Random Effects**\index{Random effects}: These capture the variability