generic_data_generation.Rmd

---
title: "Generic Data Generation"
author: "Gigi"
date: "`r Sys.Date()`"
output: 
  html_document: 
    toc: yes
---

```{r setup, include=FALSE}
rm( list = ls() )

knitr::opts_chunk$set(
  echo = TRUE,
  fig.height = 8,
  fig.width = 11,
  cache = FALSE
)

library( tidyverse )

source( 'lib.R' )
# source( 'utility.R' )
```

# Data Generation

The aim of this document is to help make data generation as easy as possible.

By using a set of generic functions hopefully it can be relatively easy to build up a complete data set from scratch.

## Setting initial parameters

```{r set_general_parameters}

```

## Building the initial table

Set any required variables and read in any tables that are required.

```{r set_tbl_options}
# Define basic column data types
col_types = cols(
    out_value = col_character(),
    odds = col_double()
)
```
```{r read_in_tbls}
# date related tables
years_odds_tbl <- read_csv( 
    'data_in/years.csv', 
    col_types = col_types 
) %>% 
    calcCumulative

months_odds_tbl <- read_csv( 
    'data_in/months.csv',
    col_types = col_types
) %>% 
    calcCumulative

year_month_adj_odds_tbl <- read_csv( 
    'data_in/year_month_override.csv',
    col_types = col_types 
) %>%
    calcCumulative

# policy type information
policy_type_odds_tbl <- read_csv( 
    'data_in/policy_type.csv',
    col_types = col_types
) %>% 
    calcCumulative

policy_type_2_odds_tbl <- read_csv( 
    'data_in/policy_type_level_2.csv',
    col_types = col_types
) %>% 
    calcCumulative

policy_type_3_odds_tbl <- read_csv( 
    'data_in/policy_type_level_3.csv',
    col_types = col_types
) %>% 
    calcCumulative
```

### Policies table

Policy table starting with the dates of the policy start.
 
```{r}  
# initialise the table with years
first_tbl <- 10000 %>% 
    level1Values( years_odds_tbl, . ) %>% 
    # convert to a better data object
    data.frame( year = . ) %>% 
    as.tibble %>% 
    # set the months
    group_by( year ) %>% 
    mutate(
        month = level1Values( months_odds_tbl, n() ),
        month = level1AdjValues(
            month,
            year,
            year_month_adj_odds_tbl,
            n()
        ),
        policy_type = level1Values( 
            policy_type_odds_tbl, 
            n() 
        )
    ) %>% 
    # assign a day for the policy to start
    group_by( month ) %>% 
    mutate(
        day = sample( 
            seq( 
                1, 
                months_odds_tbl %>% 
                    filter( out_value == month[1] ) %>% 
                    select( max_days ) %>% 
                    .[[1]] 
            ), 
            n(), 
            replace = TRUE 
        )
    ) %>% 
    # put the full policy date as a full string
    ungroup %>% 
    mutate(
        policy_date = as.Date( 
                paste0( year, '/', month, '/', day ), 
                "%Y/%m/%d" 
            )
    ) %>% 
    # second level policy type
    group_by( policy_type ) %>% 
    mutate(
        policy_type_level_2 = level2Values(
            policy_type_2_odds_tbl,
            policy_type,
            n()
        )
    ) %>% 
    # Add in third level policy type and policy reference
    group_by( policy_type, policy_type_level_2 ) %>% 
    mutate(
        policy_type_level_3 = level3Values(
            policy_type_3_odds_tbl,
            policy_type,
            policy_type_level_2,
            n(),
            default = "0"
        ),
        policy_code = plyr::mapvalues( 
            policy_type_level_3,
            from = c( policy_type_3_odds_tbl$out_value, "0" ),
            to = c( policy_type_3_odds_tbl$code, "ZZ" )
        ),
        policy_ref = paste0( 
            policy_code, 
            str_pad( row_number( year ), 6, "left", "0" ),
            sample( LETTERS, 1 ),
            sample( 1:10, 1 )
        )
    ) %>% 
    # Add in a premium amount and an incident count
    ungroup %>% 
    mutate(
        premium = numericValues(
            n(),
            type = "unif",
            min = 5,
            max = 15,
            multiplier = 20,
            round = 2
        ),
        incidents = numericValues(
            n(),
            type = "chisq",
            df = 4,
            multiplier = 1/5
        )
    ) %>% 
    # Remove intermediate columns
    select( -policy_code )

glimpse( first_tbl )
```

Create the incidents table.

```{r incidents-tbl}
# Based off the incident counts create a base incidents table
second_tbl <- first_tbl[ 
    rep( rownames( first_tbl ) %>% as.integer, first_tbl$incidents ), 
    
] %>% 
    # Remove intermediate columns
    select( -incidents )

glimpse( second_tbl )
```