RLink User Guide

This guide will show you how to use RLink for communication between the Wolfram Language and R.

Loading, Installing, and Uninstalling RLink

RLink Resources (RLinkRuntime)

The RLink application consists of two parts, RLink proper, and the RLinkRuntime paclet. The former contains the Wolfram Language code implementing the RLink API and Java code interfacing RLink with R, and comes with the Wolfram Language. The latter contains basically the R distribution, with a few additional libraries, and is packaged as a downloadable paclet.

In cases where you want to use the R distribution bundled with RLink (as opposed to using your own R installation), you will need to have the RLinkRuntime paclet downloaded and installed on your machine before you can use RLink. The default is set so that the InstallR function (described in the next section) used to install R runtime does that for you automatically if it finds no paclet installed.

Sometimes, however, it may be more convenient to download and install the paclet manually. You can do this with the RLinkResourcesInstall function.

First, you have to load RLink.

Then, you call the following.

If the result is $Failed, this means that the installation failed, which may happen for various reasons, such as if you are not connected to the internet or have disabled the Wolfram System's internet access. This has to be corrected before RLink can function properly (unless you use your own R distribution, in which case you do not need RLinkResourcesInstall at allthis is described in the next section).

Using R Distribution Bundled with RLink

Before you can start working with RLink, it must be loaded.

While the package has been loaded, the R runtime has not been started yet. To do that, you need to call the InstallR function.

When you use InstallR with the default R distribution (which comes with RLink), InstallR checks for the RLinkRuntime paclet (containing that R distribution). If it does not find it installed, it calls internally the RLinkResourcesInstall function, which attempts to download the paclet from the Wolfram paclet server and install it. This is a one-time operation, but you will need to have internet connectivity enabled for it to succeed. Once the paclet has been successfully downloaded and installed, it will be used by RLink for all your subsequent RLink sessions automatically (unless you give an explicit option to InstallR to point it to a different R distribution, in which case that one will be used).

If you need to start R with some specific command-line options, you can specify them with the option "RCommandLine", as a string.

To stop and uninstall the R runtime, you can use the UninstallR command. This will uninstall R.

Now, you can no longer use RLink functions that communicate with R, until you call InstallR again. For example, this attempts to send some data to R.

This installs R again, using InstallR.

After which, you can communicate with R normally.

Chances are that you will rarely need to use UninstallR. One situation where it is useful is when you need to relaunch the R runtime with a different set of command-line options. Another such situation is when you have several different R distributions installed on the same machine and would like to point RLink to a different one than currently used.

Using Your Own R Distribution (Windows only)

On Windows, it is currently possible for you to use your own R distribution with RLink, which may have a number of advantages in certain circumstances. For example, you already have a customized R distribution that you would like to continue using for everything, including work with Wolfram Language/RLink. Or, you would like to install extra packages.

Specify the Location of the R Distribution When Installing RLink

You can specify the location by using the "RHomeLocation" option to InstallR, calling it as follows.

The value of the option should correspond to the R_HOME variable you usually set for your R distribution, and point at the root of it.

Sending Data to R

To send data to R, you have to use the RSet function. Your data will have to be expressed in a form that RLink can understand. For most common data types, such as (multidimensional) arrays, you can use the usual Wolfram Language nested list representation of them. More details on this can be found in "R Data Types in RLink".

First, load RLink:

This will send an integer to R and assign a variable in the R workspace to its value.

You can test the assignment with the help of REvaluate.

Since scalars are interpreted as one-element vectors, the result is a list. More details on this can be also found in "R Data Types in RLink".

The following will send a nested ragged list to R, which will be interpreted as an R list.

You do not have to indicate the type of data you are sending to R, in the majority of cases. The data type is determined for you automatically by RLink, based on the form of your data. The procedure for the automatic type detection is described in more detail in the reference page for ToRForm, and also in "R Data Types in RLink".

You can also use RSet on expressions more general than variables. In particular, you can make part assignments to elements of lists and arrays.

You can test whether or not the assignment took place on the R side.

In general, the only requirement is that the R expression represented by the string passed as a first argument to RSet can be assigned a value (is an L-value in R).

Executing R Code and Getting Data from R

To execute any string of valid R code and get the results back to the Wolfram Language, you can use the REvaluate function. You have seen some examples of its use already.

First, load RLink.

Here, for example, this will be used to generate a vector of random integers.

In this case, the result was transferred back to the Wolfram Language, but not saved anywhere in the R workspace. If you wish to also save the result, you can assign it to some variable in the R workspace.

You can test now that the values were saved in the R variable.

Here is a little more involved computation: compute the number of frequencies with table and return the result as an R data frame.

Sometimes you may wish to also suppress the output on the R sidethat is, suppress the data transfer from R to the Wolfram Language. To do this, you have to put a semicolon at the end of your code.

You can now check that the assignment took place.

One reason why this may be needed is when the result is of the type that cannot be returned to the Wolfram Language via RLink, but can be worked with further in R.

You can execute multiline chunks of R code with REvaluate, but in this case, you have to enclose the code in curly braces.

Here also, the semicolon at the end of the code string would suppress the data transfer from R to the Wolfram Language, while the code will still be executed.

Many more examples can be found on the documentation page for REvaluate.

Defining Your Own R Functions

While the topic of this section is logically connected to the discussion in the previous section, it is important enough to have a separate discussion. If you go slightly beyond using the functions already available in R or its various extensions, one of the main things you may want to do is to define your own R functions, from the Wolfram Language.

This is perfectly possible with RLink. The details of how it is done are discussed in "Functions". Here only a few simple examples will be considered. Generally, functions in RLink are represented by opaque references, which point to functions defined in the R workspace. Such references can be stored in variables or used directly, to call R functions on Wolfram Language expressions as arguments and get the result back to the Wolfram Language.

First, load RLink.

Here is a simple example.

A function reference was just generated, and stored in a variable sq. But also, an assignment in the R workspace was performed. There are now a number of ways you can call this function. First, you can call it directly in R.

This, however, does not allow you to pass arguments from the Wolfram Language. This is when the function reference is handy.

It is not necessary to store the reference in a variable. You could use it in the same piece of code where it is generated.

However, constructing function references through REvaluate and using them in such a manner is often not the best option, in particular because a new copy of a function reference is generated at every call, and also because such references have a lifetime of only the current RLink session (this is explained in much detail in "Functions"). There is a special device for creating "better" function references, which are cached and have an indefinite lifespan, by using RFunction.

Here is how you would use it for the preceding example.

You can use it as before.

Or, just as before, it can be also used directly without being stored in a variable.

Such a use does not produce a new copy of a function reference on each call, since references produced by RFunction are cached.

You can use function references as you would other objects in RLink; in particular, you can send them to R and pass them as arguments to other (higher-order) functions, etc. For example, now a previous function reference will be assigned to a variable in R.

You can now use it on R side.

You can pass function references as arguments to other functions. For example, you can define an analog of the Wolfram Language's Select function for R as follows.

You can now use it with some custom filtering function, which you can also define with RFunction.

There are a number of more subtle points on using function references in RLink, discussed in "Functions".

Error Handling

RLink attempts to handle possible errors on several levels. First, on the Wolfram Language side, RLink attempts to catch the invalid Wolfram Language input and issue the relevant error messages.

First, load RLink.

For example, if you try to transfer to R some general symbolic expression, you get an error message telling you that RLink does not know how to convert this input to a data type that it can transfer to R.

To learn which inputs can and cannot be transferred to R, please see "R Data Types in RLink", which has a detailed discussion on this.

Some errors do not manifest themselves during the data transfer to R, but show up as R runtime errors. In such cases, RLink attempts to deliver the R error message generated in the R workspace to the Wolfram System.

For example, here is an attempt to reference an undefined variable.

Here, the array bounds are violated.

Here, a function is called with the wrong number of arguments.

Parse errors are usually detected, but rather indirectly, and the error message can both be somewhat cryptic and involve some implementation details.

PerformanceTuning

RLink is a rather high-level interface, built on top of JLink, which itself is built on top of the Wolfram Symbolic Transfer Protocol (WSTP), and RJava/JRI, which is a Java interface to R runtime (the latter used as standalone set of dynamic native libraries). Also, RLink often uses flexible means of data transfer and execution, involving sometimes run time R code generation and execution. This flexibility allows RLink to handle a rather large subset of possible R objects, and also things like part assignments to arrays and lists, in a uniform way. But the price to pay for this is an (often very considerable) overhead. In cases when this overhead is not acceptable, there are ways to optimize the data transfer between the Wolfram Language and R. Some of them are described in this section.

Vectors versus Lists

The main advice here would be to avoid sending and returning R lists whenever you can and send/return R vectors (arrays) instead. The reason for that is that using arrays will be much more efficient, in all stages of communication with R. It will be more efficiently transformed to the internal representation (since Wolfram Language packed arrays can often be utilized), it will be more efficiently transformed to R, it will be more efficiently processed by R, and the result will again be obtained by the Wolfram Language much faster.

First, load RLink.

Here, as an example, first transform a vector to RLink internal representation.

Now do the same for a vector interpreted as a list.

You can see that in the latter case, the time complexity is also linear, but with a much larger constant, which is about 40 times that of a vector (which is, more or less, the typical speed ratio between top-level iterationwhen done rightand the one using packed arrays).

Now look at the overhead of sending data to R. First, prepare internal forms of the vectors, so that you only count the time it takes to transfer the data.

Here are the timings to send these vectors to R.

The same now for lists.

The overhead is very noticeable (it was not even feasible go to the same larger number of elements as for vectors). While the future versions of RLink will likely have more efficient means of data transfer regarding lists, the current advice is to avoid sending back and forth lists of more than a few thousand entries. Since R lists are frequently used as an aggregate data structure, chances are that huge R lists will not often appear naturally.

Function Calls

While the syntax for function calls is convenient, it results in an overhead, which can often be unacceptably large.

First, load RLink.

This illustrates the point.

The more you can use vectorized operations and push the work to be done in R as a whole (minimizing data exchange), the faster your code will be.

The second run is much faster (the first one was necessary since some Java class loading and other events were triggered by the first call on a fresh kernel, which makes the measurement based on the first call inaccurate).

There is a constant overhead of a function call, which dominated the running time for the previous examples. Here, the number of points will be increased 100 times, but the time it takes to compute the result is almost the same.

General

One general piece of advice is that, whatever you do, you have to try minimizing both the amount of data being transferred both ways and, sometimes even more importantly, the number of times that functions like REvaluate, RSet, and RFunction are called.

The worst possible scenario here is a lightweight function defined in R (or a piece of R code doing very little), called a large number of times from the Wolfram Language. In such a case, you can be almost certain that the total running time will be dominated by the time spent on data transfer and other inner working of RLink, rather than time spent in R doing the actual computation.

The best scenario is when most of the hard, computationally intensive work is done in R, and data is transferred to R and back in an efficient manner (for example, using data structures containing mostly vectors). Lists are OK as long as they do not have a huge number of elements. Since lists are most frequently used as an aggregate data structure to hold together heterogeneous collections of vectors (and possibly other data structures), they do not usually have a huge number of elements unless used inappropriately in cases where vectors should be used. One exception that is quite problematic for RLink is when the result of computation in R is a large ragged array (for example, of integers), which can only be represented as an R list. In such cases, one way to speed up the transfer would be to pad such an array to a rectangular one, whenever this is possible.