copyright | lastupdated | keywords | subcollection | ||
---|---|---|---|---|---|
|
2019-10-14 |
r, tutorial, cloudyr, data science |
cloud-object-storage |
{{site.data.keyword.attribute-definition-list}}
{: #cloudyr-data-science}
When you use the R
programming language{: external} for your projects, get the most out of the features for supporting data science from {{site.data.keyword.cos_full}} by using cloudyr{: external}.
{: shortdesc}
This tutorial shows you how to integrate data from the {{site.data.keyword.cloud}} Platform within your R
project. Your project will use {{site.data.keyword.cos_full_notm}} for storage with S3-compatible connectivity in your project.
{: #cloudyr-prereqs}
We need to make sure that we have the prerequisites before continuing:
- {{site.data.keyword.cloud_notm}} Platform account
- An instance of {{site.data.keyword.cos_full_notm}}
R
installed and configured- S3-compatible authentication configuration
{: #cloudyr-hmac}
Before we begin, we might need to create a set of HMAC credentials as part of a Service Credential by using the configuration parameter {"HMAC":true}
when we create credentials. For example, use the {{site.data.keyword.cos_full_notm}} CLI as shown here.
ibmcloud resource service-key-create <key-name-without-spaces> Writer --instance-name "<instance name--use quotes if your instance name has spaces>" --parameters '{"HMAC":true}'
{: pre}
To store the results of the generated key, append the text, > cos_credentials
to the end of the command in the example. For the purposes of this tutorial you need to find the cos_hmac_keys
heading with child keys, access_key_id
, and secret_access_key
.
cos_hmac_keys:
access_key_id: 7xxxxxxxxxxxxxxa6440da12685eee02
secret_access_key: 8xxxx8ed850cddbece407xxxxxxxxxxxxxx43r2d2586
{: screen}
While it is best practices to set credentials in environment variables, you can also set your credentials inside your local copy of your R
script itself. Environment variables can alternatively be set before you start R
using an Renviron.site
or .Renviron
file, used to set environment variables in R
during startup.
You will need to set the actual values for the access_key_id
and secret_access_key
in your code along with the {{site.data.keyword.cos_full_notm}} endpoint for your instance.
{: note}
{: #cloudyr-credentials}
As it is beyond the scope of this tutorial, it is assumed you already installed the R
language and suite of applications. Before you add any libraries or code to your project, ensure that you have credentials available to connect to {{site.data.keyword.cos_full_notm}}. You will need the appropriate region for your bucket and endpoint.
Sys.setenv("AWS_ACCESS_KEY_ID" = "access_key_id",
"AWS_SECRET_ACCESS_KEY" = "secret_access_key",
"AWS_S3_ENDPOINT" = "myendpoint",
"AWS_DEFAULT_REGION" = "")
{: codeblock}
{: #cloudyr-s3-library}
We used a cloudyr
S3-compatible client{: external} to test our credentials resulting in listing your buckets. To get additional packages, we use the source code collective known as CRAN{: external} that operates through a series of mirrors{: external}.
For this example, we use aws.s3{: external} as shown in the example and added to the code to set or access your credentials.
library("aws.s3")
bucketlist()
{: codeblock}
{: #cloudyr-s3-library-methods}
You can learn a lot from working with sample packages. For example, the package for Cosmic Microwave Background Data Analysis{: external} presents a conundrum. The executable of the project for local compiling are small enough to work on one's personal machine, but working with the source data would be constrained due to the size of the data.
When using version 0.3.21
of the package, it is necessary to add region=""
in a request to connect to COS.
{: tip}
In addition to PUT, HEAD, and other compatible API commands, we can GET objects as shown with the S3-compatible client we included earlier.
# return object using 'S3 URI' syntax, with progress bar
get_object("s3://mybucketname-only/example.csv", show_progress = TRUE)
{: codeblock}
{: #cloudyr-add-data}
As you can guess, the library discussed earlier has a save_object()
method that can write directly to your bucket. While there are many ways to load data{: external}, we can use cloudSimplifieR{: external} to work with an open data set{: external}.
library(cloudSimplifieR)
d <- as.data.frame(csvToDataframe("s3://mybucket/example.csv"))
plot(d)
{: codeblock}
{: #cloudyr-next-steps}
In addition to creating your own projects, you can also use R Studio to analyze data{: external}.