-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.Rmd
799 lines (613 loc) · 26.2 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
---
title: "Micro-Claims Analysis and Modelling: Claim Status and Payment Model"
author: "Jimmy Briggs"
date: "May 9, 2016"
output:
html_notebook:
toc: yes
toc_depth: 2
code_folding: hide
highlight: tango
theme: readable
html_document:
toc: yes
toc_float: yes
toc_depth: 2
code_folding: hide
theme: readable
highlight: tango
number_sections: yes
resource_files:
- index_files/figure-html/zm_plots-1.png
- index_files/figure-html/unnamed-chunk-4-1.png
- index_files/figure-html/sim_v_actual-1.png
- index_files/figure-html/cm_logit_plot-1.png
runtime: shiny
---
```{r setup, echo=FALSE, eval=TRUE}
knitr::opts_chunk$set(
warning = FALSE, message = FALSE, error = FALSE,
cache = TRUE, autodep = TRUE
)
```
[Source Code](https://github.com/jimbrig/claims-pm)
# Introduction
## Purpose and Scope
This report describes a model for predicting incremental paid losses on
an individual claim basis ("The Model"). The model uses a mixture of
predictive modeling and simulation techniques to mimic real world claim
development.
Note that this report is a much more simplified version of the full
ensemble of model's used in the full analysis performed and is merely
meant as a starting point for teaching/learning purposes.
## Background
Models, by definition, use simplified assumptions of reality to reveal
information or predict events based on the underlying data. Models which
closely mimic the fundamental forces driving the data have the best
chance of providing valuable insights.
Due to data and computing limitations, actuaries have traditionally
aggregated loss information by policy, accident, or calendar period to
project future losses. By aggregating losses, the actuary loses valuable
claim level information.
The model assumes that individual claims and their claim level
characteristics are the fundamental drivers of future payments.
Therefore, In accordance with the philosophy that the best models are
those which mimic reality most closely, the model uses information on an
individual claim level, and runs statistically rigorous techniques to
fit and simulate individual claim development.
## Overview
The model is meant to be a starting point for anyone looking to discover
new and advanced methods for performing micro-claims analysis and
machine-learning modelling techniques that provide insights beyond the
typical aggregated actuarial practices in P&C. Additionally, the model
is a showcase for the statistical power that the R Programming language
can provide, specifically for those with apriori statistical and
mathematical knowledge in applied predictive analytics and probability
theory.
I decided to use only a few very common predictor variables so the model
could easily be applied to other data sets. For transparency and to aid
interested individuals, I provide this report with access to the `R`
code used to fit and run predictions. The code can be viewed by clicking
the `code` boxes on the right side of the report. The `R` savvy reader
can run the `R` code to reproduce the output, apply the model to other
data sets, and expand and improve upon the model.
The model is only applicable to reported claims and their corresponding
incremental payments. IBNR claim predictions are beyond the scope of
this model.
## Vocabulary
For consistency and clarity I use the following terms:
- **Response Variable** The value being predicted by the model (claim
status and claim incremental payment in this report)
- **Predictor Variable** The values used to fit the model or to
predict the response variable (i.e. I use certain claim
characteristics as predictor variable to model and predict the
response variable)
## Data
In the spirit of mimicking the real world, this report communicates the
model through a working example using real auto-liability data supplied
mostly from the [insuranceData R
Package](https://cran.r-project.org/web/packages/insuranceData/insuranceData.pdf)
([GitHub Repo](https://github.com/cran/insuranceData)) as well as
publicly available data supplied by the **CAS**.
Note that this specific model has been tuned to form predictions related
specifically to *Bodily Injury* claims only, as these claims drive the
foundation of the risk behind Auto Liability reserving and rate-making.
That being said, although the results of the model are specific to Auto
Liability in this instance, the modeling techniques and machine-learned
tuning procedures can easily be generalized to other lines of coverage,
areas of business, and risk portfolios.
- R Code for Data Load and Ingestion:
```{r load_data, message=FALSE}
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(caret, warn.conflicts = FALSE)
library(lubridate)
library(ggplot2, warn.conflicts = FALSE)
library(knitr)
library(DiagrammeR)
library(scales)
library(shiny)
require(e1071)
require(bindrcpp)
library(qs)
library(webshot)
# turn off scientific notation in printing
options(scipen = 999)
# load data
claims <- qs::qread("model-claims.qs")
# development age to project from
# e.g. if I set this to 18 I will use information available
# as of the 18 month evaluation to predict stuff at age 30)
devt_period <- 30
# year to predict with model
# evaluations at or greater than this time will not be included in
# the model fit
predict_eval <- as.Date("2010-11-30")
# remove unneeded data
claims <- dplyr::filter(claims,
eval <= predict_eval,
eval >= predict_eval - years(5)) %>%
dplyr::select(eval, devt, claim_number,
status, tot_rx, tot_pd_incr,
status_act, tot_pd_incr_act)
devt_periods_needed <- seq(6, devt_period + 12, by = 12)
# for showing in triangle
claims_display <- dplyr::filter(claims, devt %in% devt_periods_needed)
# only need the claims at the selected `dev_period`
claims <- dplyr::filter(claims, devt %in% devt_period)
```
I am using data from fiscal years 2003 to
`r lubridate::year(predict_eval) + 1`.
Fiscal years begin at 6/1 of the year prior to the fiscal year and end
at 5/31 of the fiscal year (i.e. fiscal year 2003 includes claims which
occurred between 6/1/2002 and 5/31/2003).
The claims are evaluated at 11/30 of each year from 2002 to
`r lubridate::year(predict_eval)`.
I created a large data set containing all the data used in fitting the
model and making predictions from the model.
The original data and the script used to prepare the large data set used
in this report is located in the `data/` directory and details can be
viewed in this report's Appendix.
# The Model
The model uses several advanced statistical techniques. For compactness
and because I lack the expertise to explain everything in detail, a
comprehensive explanation of these techniques is beyond the scope of
this analysis.
Where-ever possible I have included links to additional resources for
diving into the statistics behind the model. The statistics will be only
very briefly touched upon as each technique is used in the model fit.
### Train and Test Data
The first step in fitting the model is to feed training data into the
model.
I am using all data from fiscal years prior to
`r lubridate::year(predict_eval) + 1` from development time
`r devt_period` months to `r devt_period + 12` months to fit the model.
Later I will pass the test data (i.e. claims from fiscal year
`r lubridate::year(predict_eval) + 1` at `r devt_period` months) to
predict the status of each of these claims at `r devt_period + 12`
months and the incremental payment per claim from `r devt_period` to
`r devt_period + 12` months.
## Model Overview Diagram
The following diagram illustrates how the model is fit:
```{r diagram1}
mermaid("
graph TD
A(Claim Train Data)-->B{Fit Closure Model}
A(Claim Train Data)-->C[Remove Closed-Closed]
A(Claim Train Data)-->D[Remove Zero Paid]
C-->E{Fit Zero Model}
D-->F{Fit Payment Model}
")
```
After fitting the models (pictured as a rhombus in the above diagram)
with the claims training data I can use the three models to predict a
probability for status and zero payment or a dollar value for
incremental payments on the test data.
The claims test data flows through the following diagram to arrive at
the final output:
```{r diagram2, echo=FALSE}
mermaid("
graph TD
A(Claim Test Data)-->B{Closure Model + Simulation}
B-->C(Claims Simulated to be Closed Closed)
B-->D(All Other Claims)
D-->E{Zero Model + Simulation}
E-->F(Claims Simulated with Zero Payment)
E-->G(Claims Simulated with Payment)
G-->H{Payment Model + Simulation}
H-->J(Output: Claims with Simulated Status and Payment)
C-->I[Simulated Payment set to Zero]
F-->I[Simulated Payment set to Zero]
I-->J
", height = 700)
```
At each model (pictured as a rhombus in the above diagram) the claim in
the test data is given a predicted value based on the model. I then run
a simulation based on this predicted value to model real world
variability.
At each step the simulated claims are passed to the next model based on
the results of the simulation in the previous model's simulated
results/probabilities.
## Claim Closure Model
### Assumptions
To predict whether a claim will close within a given period of time, I
use a logistic regression with *center, scale, and
[Yeo-Johnson](https://www.stat.umn.edu/arc/yjpower.pdf) transformations*
applied to all continuous predictor variables.
I am modeling the following variables:
**Response Variable**
- *status_act* Actual claim status at `r devt_period + 12` months.
**Predictor Variables**
- *status* Claim status at `r devt_period` months ("C" for Closed and
"O" for open).
- *tot_rx* Total case reserve dollar value at `r devt_period`.
- *tot_pd_incr* Total incremental paid loss dollar value between
`r devt_period - 12` months and `r devt_period` months.
### Data Preparation
```{r cm_data_prep}
# remove data the same valuation or newer than the prediction eval
# Only claims from valuations before the valuation I am predicting will be used
# to fit the model
model_data <- dplyr::filter(claims, eval < predict_eval)
```
### Model Fit
The model fit uses **10-fold cross validation** to optimize coefficient
estimation and a *stepwise Akaiki Information Critereon (AIC)* algorithm
for feature selection:
```{r cm_model_fit, message=FALSE, warning=FALSE}
cm_model <- caret::train(status_act ~ status + tot_rx + tot_pd_incr,
data = model_data,
method = "glmStepAIC",
trace = FALSE,
preProcess = c("center", "scale", "YeoJohnson"),
trControl = trainControl(method = "repeatedcv",
repeats = 2))
```
```{r cm_summary, results = "asis"}
cm_summary <- cm_model$results[, -1]
kable(cm_summary,
digits = 5,
row.names = FALSE)
```
For a more detailed statistical summary of the claim closure model fits
see Appendix `cm_summary`
```{r cm_prob, warning=FALSE, message=FALSE}
cm_probs <- cbind(model_data, predict(cm_model,
newdata = model_data,
type = "prob"))
# find the logit value
cm_probs$logits <- log(cm_probs$O / cm_probs$C)
```
In the plots below the blue line indicates the fitted probability of the
claim at age `r devt_period` months being open at age
`r devt_period + 12` months.
The red dots at the top and bottom are the actual status for the
training data at `r devt_period + 12` months (i.e. model fits the blue
line to the red dots).
```{r cm_logit_plot, fig.height=3, warning=FALSE}
cm_probs$status_act <- ifelse(cm_probs$status_act == "C", 0, 1)
ggplot(cm_probs, aes(x = logits, y = status_act)) +
geom_point(colour = "red",
position = position_jitter(height = 0.1, width = 0.1),
size = 0.5,
alpha = 0.2) +
geom_smooth(method = "glm", method.args = list(family = "binomial"),
size = 1) +
ylab("Probability Open") +
xlab("Logit Odds") +
ggtitle(paste0("Age ", devt_period, " to ", devt_period + 12, " Months Claim Open Probabilities"))
```
## Zero Payment Model
### Assumptions
The zero payment model is similar to the claim closure model in that I
am looking at a *binomial response variable*. I am modeling whether the
claim has zero or nonzero incremental payments.
I remove all claims that have a status at `r devt_period` months of
closed and a status of closed at `r devt_period + 12` (I refer to these
claims as `closed-closed` claims).
Additionally, I assume that all of these claims will ultimately have zero incremental payments in the final payment model.
**Reponse Variable**
- *zero* Factor indicating whether the claim had zero or nonzero
incremental payments between age `r devt_period` months and
`r devt_period + 12` months.
**Predictor Variables**
- *status_act* Actual claim status at `r devt_period + 12` months
("C"" for Closed and "O"" for open).
- *status* Claim status at `r devt_period` months
- *tot_rx* Total case reserve dollar value at `r devt_period` months.
- *tot_pd_incr* Total incremental paid loss dollar value between
`r devt_period - 12` months and `r devt_period` months.
Note: I could use `status_act` as a predictor variable here because for
the test data I will simulate the status at `r devt_period + 12` first
and then use that simulated status as a predictor variable in the zero
payment model.
### Data Prep
```{r zm_data_prep}
# remove all claims that have a closed closed status from the data
# these will be set to incremental payments of 0
zm_model_data <- filter(model_data, status == "O" | status_act == "O")
# Add in response variable for zero payment:
zm_model_data$zero <- factor(ifelse(zm_model_data$tot_pd_incr_act == 0,
"Zero", "NonZero"))
```
### Fit
I use the same data prepared for the claim closure model to fit the zero
payment model.
```{r zm_model_fit, cache=TRUE, message = FALSE, warning = FALSE}
zm_model <- caret::train(zero ~ status + status_act + tot_rx + tot_pd_incr,
data = zm_model_data,
method = "glmStepAIC",
trace = FALSE,
preProcess = c("center", "scale", "YeoJohnson"),
trControl = trainControl(method = "repeatedcv",
repeats = 2))
```
```{r zm_summary, results = "asis"}
zm_summary <- zm_model$results[, -1]
kable(zm_summary,
row.names = FALSE)
```
```{r zm_prob, warning=FALSE, message=FALSE}
zm_probs <- cbind(zm_model_data, predict(zm_model,
newdata = zm_model_data,
type = "prob"))
zm_probs$logits <- log(zm_probs$NonZero / zm_probs$Zero)
```
In the plots below, the blue line indicates the fitted probability of
the claim having a payments between age `r devt_period` and
`r devt_period + 12` months. The red dots at the top are the actual
claims with payments between age `r devt_period` and
`r devt_period + 12` months, and the dots at the bottom are the claims
with zero payments during this time period. (i.e. Zero Payment model
fits the blue line to the red dots)
```{r zm_plots, fig.height=3, warning=FALSE}
zm_probs$zero <- ifelse(zm_probs$zero == "Zero", 0, 1)
ggplot(zm_probs, aes(x = logits, y = zero)) +
geom_point(colour = "red",
position = position_jitter(height = 0.1, width = 0.1),
size = 0.5,
alpha = 0.2) +
geom_smooth(method = "glm", method.args = list(family = "binomial"),
size = 1) +
ylab("Payment Probability") +
xlab("Logit Odds") +
ggtitle(paste0("Age ", devt_period, " to ", devt_period + 12,
" Non-Zero Incremental Payment"))
```
## Incremental Payment Model
### Assumptions
The incremental payment model models incremental payments between
`r devt_period` and `r devt_period + 12`. The incremental payment model
uses a generalized additive model (GAM) with an integrated smoothness
estimation and a **quasi-poisson log link function** ;).
**Response Variable**
- *tot_pd_incr_act* Total incremental payment between `r devt_period`
and `r devt_period + 12` months.
**Predictor Variables**
- *status_act* The actual status at `r devt_period + 12` months.
- *tot_rx* Total case reserve dollar value at `r devt_period` months.
- *tot_pd_incr* Incremental payments between `r devt_period - 12` and
`r devt_period` months.
### Data Prep
```{r payment_data_prep}
#Take out zero pmnts:
nzm_model_data <- zm_model_data[zm_model_data$tot_pd_incr_act > 0, ]
```
### Fit
```{r nzm_model_fit}
# fit incremental payment model
nzm_model <- mgcv::gam(tot_pd_incr_act ~ status_act + s(tot_rx) + s(tot_pd_incr),
data = nzm_model_data,
family = quasipoisson(link = "log"))
```
```{r nzm_predict_training, warning=FALSE, message=FALSE}
nzm_fit <- cbind(nzm_model_data,
tot_pd_incr_sim = exp(predict(nzm_model, newdata = nzm_model_data)))
```
```{r nzm_plots, fig.height=3, warning=FALSE}
# plots to be determined
```
# Simulation
## Closure Status
```{r cm_predict, cache = TRUE}
set.seed(1234)
n_sims <- 2000
cm_pred_data <- dplyr::filter(claims, eval == predict_eval)
cm_probs <- cbind(cm_pred_data,
predict(cm_model, newdata = cm_pred_data, type = "prob"))
cm_pred <- lapply(cm_probs$O, rbinom, n = n_sims, size = 1)
cm_pred <- matrix(unlist(cm_pred), ncol = n_sims, byrow = TRUE)
cm_pred <- ifelse(cm_pred == 1, "O", "C")
cm_pred <- as.data.frame(cm_pred)
```
I use the probabilities returned from the closure model to simulate the
status of all of the claims.
I simulate each claim `r n_sims` times.
The table below shows selected age `r devt_period` claims after they had
their closure probability predicted by the closure model and their
status simulated using a simulated binomial random variable.
```{r cm_predict_table, results = "asis"}
cm_out <- cm_probs
cm_out <- dplyr::select(cm_out, claim_number, status, tot_rx, tot_pd_incr, O)
cm_out$status_sim <- cm_pred[, 1]
cm_out <- cm_out[c(1, 6, 24, 2, 10), ]
names(cm_out) <- c("Claim Number", "Status", "Case", "Paid Incre",
"Prob Open", "Sim Status")
kable(cm_out,
row.names = FALSE)
```
The `Prob Open` column is the probability that the age `r devt_period`
claim will be open at age `r devt_period + 12` as modeled in the closure
model. The `Sim Status` column is the result of a *Bernoulli simulation*
on each of those probabilities.
I am running this simulation `r n_sims` times to simulate `r n_sims`
closure scenarios.
The simulations allow me to determine the corresponding distribution's
confidence intervals.
## Zero Payment Model
Next the simulated claims with their simulated statuses have their
probability of having a non zero incremental payment simulated by the
zero payment model. This probability is then simulated using the same
random binomial simulation approach as used when simulating closure
status.
```{r zm_predict_data_prep}
# put closure model predictions together
cm_pred <- cbind(cm_probs[, c("claim_number"), drop = FALSE], cm_pred)
# gather `cm_pred` into a long data frame
cm_pred <- tidyr::gather(cm_pred, key = "sim_num",
value = "status_sim",
-claim_number)
# join `zm_pred_data` to predictions from closure model
# remove status_act and rename the simulated states as status_act
zm_pred_data <- left_join(cm_pred, cm_probs, by = "claim_number") %>%
dplyr::select(-status_act) %>%
dplyr::rename(status_act = status_sim)
# remove all claims that have a closed closed status from the data
# these will be set to incremental payments of 0
closed_closed_data <- dplyr::filter(zm_pred_data, status == "C" & status_act == "C")
zm_pred_data <- filter(zm_pred_data, status == "O" | status_act == "O")
```
```{r zm_predict, cache = TRUE}
zm_pred <- cbind(zm_pred_data,
predict(zm_model, newdata = zm_pred_data, type = "prob"))
zm_pred$zero_sim <- sapply(zm_pred$NonZero, rbinom, n = 1, size = 1)
zm_pred$zero_sim <- ifelse(zm_pred$zero_sim == 1, "NonZero", "Zero")
```
```{r zm_predict_table, results = "asis"}
zm_out <- zm_pred
zm_out <- dplyr::select(zm_out, claim_number, status, tot_rx, tot_pd_incr,
status_act, NonZero, zero_sim)
zm_out <- head(zm_out, 8)
names(zm_out) <- c("Claim Number", "Status", "Case", "Paid Incre",
"Sim Status", "Prob Non Zero", "Zero Sim")
kable(zm_out,
row.names = FALSE)
```
## Incremental Payment Simulation
Since I am only interested in predicting incremental payments for claims
that were simulated to have a non-zero incremental payment, all claims
that were closed at age `r devt_period` and were simulated to be closed
at `r devt_period + 12` will be given an incremental payment of zero.
Additionally, all claims that were simulated by the Zero Payment Model
to have a Zero payment will be given an incremental payment of zero.
```{r nzm_predict_data_prep}
# separate zeros from non zeros
zero_claims <- filter(zm_pred, zero_sim == "Zero")
nzm_pred <- filter(zm_pred, zero_sim == "NonZero")
```
Now for the final simulations I simulate all the claims that were
predicted to have a non-zero incremental payment.
```{r nzm_predict, cache = TRUE}
### Quasi Poisson Simulation
nzm_pred$tot_pd_incr_fit <- exp(predict(nzm_model, newdata = nzm_pred))
# use negative binomial to randomly disperse claims from predicted fit
nzm_pred$tot_pd_incr_sim <- sapply(nzm_pred$tot_pd_incr_fit,
function(x) {
rnbinom(n = 1, size = x ^ (1/5), prob = 1 / (1 + x ^ (4/5)))
})
```
```{r combine_predictions}
closed_closed_data$tot_pd_incr_sim <- 0
zero_claims$tot_pd_incr_sim <- 0
closed_closed_data$sim_type <- "Close_Close"
zero_claims$sim_type <- "Zero"
nzm_pred$sim_type <- "Non_Zero"
cols <- c("sim_num", "claim_number", "status_act", "tot_pd_incr_sim", "sim_type")
sim_1 <- closed_closed_data[, cols]
sim_2 <- zero_claims[, cols]
sim_3 <- nzm_pred[, cols]
full_sim <- rbind(sim_1, sim_2, sim_3)
kable(
full_sim[sample(1:nrow(full_sim), 20), ],
row.names = FALSE,
col.names = c("Sim Num", "Claim Num", "Sim Status", "Sim Payment", "Sim Type"))
```
## Results
### All Claims Aggregated by Simulation
```{r agg}
# find actual number of open claims and incremental payment dollars
pred_data_actuals <- mutate(cm_pred_data, status_act = ifelse(status_act == "C", 0, 1))
open_actual <- sum(pred_data_actuals$status_act)
payments_actual <- sum(pred_data_actuals$tot_pd_incr_act)
```
The blue dashed vertical line marks the actual number of open claims in
the test data at `r devt_period + 12` months development. The white
histogram shows the simulated distribution of open claims at
`r devt_period + 12` as determined from the simulation based on the
claim closure model.
```{r sim_v_actual, fig.height = 3, warning=FALSE, message=FALSE}
full_sim_agg <- mutate(full_sim, open = ifelse(status_act == "C", 0, 1)) %>%
group_by(sim_num) %>%
summarise(n = n(),
open_claims = sum(open),
incremental_paid = sum(tot_pd_incr_sim))
ggplot(full_sim_agg, aes(x = open_claims)) +
geom_histogram(fill = "white", colour = "black") +
ggtitle("Histogram of Simulated Open Claim Counts") +
ylab("Number of Observations") +
xlab("Open Claim Counts") +
geom_vline(xintercept = open_actual, size = 1,
colour = "blue", linetype = "longdash")
```
The blue dashed vertical line marks the actual incremental payments in
the test data between age `r devt_period` and `r devt_period + 12`. The
white histogram shows the simulated distribution of incremental payments
between age `r devt_period` and `r devt_period + 12` months for all
claims in the test data. The simulation is based on the incremental
payment model.
```{r plot}
ggplot(full_sim_agg, aes(x = incremental_paid)) +
geom_histogram(fill = "white", colour = "black") +
ggtitle("Histogram of Simulated Incremental Payments") +
ylab("Number of Observations") +
xlab("Incremental Payments") +
geom_vline(xintercept = payments_actual, size = 1,
colour = "blue", linetype = "longdash") +
scale_x_continuous(labels = dollar)
```
### Individual Claim
The blue dashed vertical line marks the actual incremental payments in
the test data for the claim in the `Select Claim Number` input box
between age `r devt_period` and `r devt_period + 12`.
```{r shiny, message=FALSE, warning=FALSE, cache=FALSE}
selectInput(
"sel_claim",
"Select Claim number",
choices = unique(claims$claim_number)[1:50],
selected = 2008146184
)
plot_data <- reactive({
indiv <- full_sim[full_sim$claim_number == input$sel_claim, ]
indiv_act <- claims[claims$claim_number == input$sel_claim, "tot_pd_incr_act"]
list(
indiv,
indiv_act
)
})
renderPlot({
ggplot(plot_data()[[1]], aes(x = tot_pd_incr_sim)) +
geom_histogram(fill = "white", colour = "black") +
ggtitle(paste0("Histogram of Simulated Incremental Payments for claim ", input$sel_claim)) +
ylab("Number of Observations") +
xlab("Incremental Payments") +
geom_vline(xintercept = plot_data()[[2]], size = 1,
colour = "blue", linetype = "longdash") +
scale_x_continuous(labels = dollar)
})
```
```{r table, cache=FALSE}
claim_stats <- reactive({
out <- claims[claims$claim_number == input$sel_claim, 3:8]
names(out) <- c("Claim Num", "Status", "Case", "Incemental Payment", "Actual Status", "Actual Payment")
out
})
renderTable({
claim_stats()
},
include.rownames = FALSE
)
```
# Conclusion
**WIP**
# Appendices
## A. Software
I used `R`, the free and open source statistical programming
environment, for all the data analysis, model fitting, simulations,
graphics, and data output.
The `caret` package was used extensively for the heavy lifting predictive modeling.
Detail of the `R` environment at the time this report is available below:
```{r session_code}
sessionInfo()
```
## Closure Model Summary Statistics
```{r}
summary(cm_model)
```
## Zero Payment Model Summary Statistics}
```{r }
summary(zm_model)
```
## Incremental Payment Model Summary Statistics
```{r }
summary(nzm_model)
```