If we are discussing a deployment architecture for ML batch scoring scenarios w/ R code, the core components of a deployable architecture could be:
- Azure Data Factory
- Azure Data Lake Storage
- Azure Databricks
Step 0: This is a nice community post to read if SparkR is a novelty to you:
Important quote:
- “The SparkR API presents a full R interface, supplemented with the {SparkR} package. As an experienced R user, you will be familiar with the R data.frame object. Here's the critical point - SparkR has its own DataFrame object, which is not the same thing as an R data.frame. You can convert between them easily (sometimes too easily), but you must respect which is which.”
Step 1: Create an Azure Databricks Workspace
Step 2: Create an ADLS (Azure Data Lake Storage)
- Obs.: Create the ADLS on the same region that you’ve provisioned Azure Databricks
Step 3: Create a cluster inside Databricks
Step 4: Execute and understand this sample code (SparkR + ADLS.r
) . Tasks performed on this sample:
- ADLS (Azure Data Lake Storage Gen1) Mount for usage with R and SparkR
- Usage of Databricks dbutils library
- R and SparkR read/write taks
DataFrame
/data.frame
mapping between R and SparkR
Here are some additional resources for understanding the orchestration of the R models execution: