RLink User Guide
This guide will show you how to use RLink for communication between the Wolfram Language and R.
Loading, Installing and Uninstalling RLink
Installing and Configuring R Distribution
RLink connects the Wolfram Language with one of the existing R distributions on your machine. Because RLink does not, by itself, carry out the R installation process, at least one R distribution should be pre-installed. If multiple R installations are present on the same machine, you will typically need to point RLink to the one you need to work with. Details on how to do that are provided below. Alternatively, RLink has a built-in R discovery system and will attempt to use the first R installation it finds—typically, that will be the default R installation on your system. Finally, RLink remembers the last successfully used R installation, and therefore will by default attempt to use the same installation again after the first use, unless instructed otherwise.
For RLink to be able to work with a given R installation, it must be configured. The configuration process essentially consists of installing the R package "rJava" and providing RLink with the path to that installed package (RLink needs this because it uses Java as an intermediate layer). In the majority of cases, RLink should be able to carry out this configuration process automatically; however, it is also possible, and in some cases necessary, to configure an R installation manually.
In either case, a successful R configuration process is subject to prerequisites in terms of build tools and dependencies that should be installed on your system. The information on these prerequisites can also be found in the same tutorial on the manual R configuration process. Whereas on Windows and macOS those are typically less important and usually automatically satisfied for most systems, on Linux they do matter.
Using the Default R Distribution with RLink
Before you can start working with RLink, it must be loaded:
Although the package has been loaded, the R runtime has not yet been started. To do that, you need to call the InstallR function:
The use of InstallR with no explicit arguments will prompt RLink to use the first R installation it finds (typically the default one on your system), unless some existing R installation has already been used before, in which case it will connect to that previously used one.
If you need to start R with some specific command-line options, you can specify them with the option "RCommandLine", as a string.
To stop and uninstall the R runtime, you can use the UninstallR command. This will uninstall R:
Now you can no longer use RLink functions that communicate with R until you call InstallR again. For example, this attempts to send some data to R:
This installs R again, using InstallR:
After this, you can communicate with R normally:
Chances are that you will rarely need to use UninstallR. One situation where it is useful is when you need to relaunch the R runtime with a different set of command-line options. Another such situation is when you have several different R distributions installed on the same machine and would like to point RLink to a different one than the one currently used.
Using a Specific R Distribution
RLink supports connecting to an external R distribution for all platforms where RLink and R are supported. In simple cases, no additional configuration steps will be required to establish the connection. However, in general, a one-time R configuration process will be required for a given R installation. In most such cases, RLink will be able to carry out this process automatically. But in some cases (particularly on Linux), you will need to go through the manual steps, described in detail in "Configuring an External R Installation to Work with RLink".
The only required argument for an external R installation is the option "RHomeLocation", which tells RLink where to look for the R home directory. The configuration process is necessary if InstallR fails when passed this option only. Once this configuration process has been completed for the specified R installation, one should be able to connect to that installation by calling InstallR, with some extra options passed in some cases.
Specify the Location of the R Distribution When Installing RLink
In simple cases, you can specify the location by using the "RHomeLocation" option to InstallR, calling it as follows:
The value of the option should correspond to the R_HOME variable you usually set for your R distribution and point at the root of it.
Specify Additional Options
In some of the cases when the configuration process is required to set up external R distributions for use with RLink, you will need to pass additional information to RLink—specifically, the location of the native JRI library built in the mentioned setup process (the "JRINativeLibraryLocation" option) and the version number for this specific R installation (the "RVersion" option) to InstallR. This can be done as follows:
More details and examples for all supported platforms can be found in "Configuring an External R Installation to Work with RLink".
Note that in practice, RLink in most cases should be able to automatically determine the values of these parameters for a given R distribution, after the configuration process has been successfully performed, so the need to explicitly pass these options to InstallR should only emerge in special situations.
Installing R Packages
Much of the useful functionality offered by the R language and ecosystem resides in R packages, which are not part of the standard R installation and must be installed separately. The R language offers a number of tools for package installation, the most widely known one being R's standard install.packages() function. RLink offers a specialized function RInstallPackage, which wraps the functionality of install.packages() and exposes a Wolfram Language interface to it.
Here is a simple example of how one can use it. First, load RLink:
Next, install the package of interest. For the sake of this example, it will be the R package named prettyunits, which offers various pretty-printing functionality for common R data types:
You have to load the package into the R session before you can use it:
Here are a few examples of use:
To uninstall the package, use RInstallPackage with the option "InstallationMethod" "Remove":
Note that a running RLink session is not required for RInstallPackage to work, but having a running session allows one to use RInstallPackage with a single argument (the name of the installed package) and omit the second, optional argument (the path to the R home directory, as it will be inferred from the running R session).
For more details on installing and uninstalling R packages, please consult the RInstallPackage documentation.
Sending Data to R
To send data to R, you have to use the RSet function. Your data will have to be expressed in a form that RLink can understand. For most common data types, such as (multidimensional) arrays, you can use the usual Wolfram Language nested list representation of them. More details on this can be found in "R Data Types in RLink".
This will send an integer to R and assign a variable in the R workspace to its value.
You can test the assignment with the help of REvaluate.
Since scalars are interpreted as one-element vectors, the result is a list. More details on this can be also found in "R Data Types in RLink".
The following will send a nested ragged list to R, which will be interpreted as an R list.
You do not have to indicate the type of data you are sending to R, in the majority of cases. The data type is determined for you automatically by RLink, based on the form of your data. The procedure for the automatic type detection is described in more detail in the reference page for ToRForm, and also in "R Data Types in RLink".
You can also use RSet on expressions more general than variables. In particular, you can make part assignments to elements of lists and arrays.
You can test whether or not the assignment took place on the R side.
In general, the only requirement is that the R expression represented by the string passed as a first argument to RSet can be assigned a value (is an L-value in R).
Executing R Code and Getting Data from R
To execute any string of valid R code and get the results back to the Wolfram Language, you can use the REvaluate function. You have seen some examples of its use already.
Here, for example, this will be used to generate a vector of random integers.
In this case, the result was transferred back to the Wolfram Language, but not saved anywhere in the R workspace. If you wish to also save the result, you can assign it to some variable in the R workspace.
You can test now that the values were saved in the R variable.
Here is a little more involved computation: compute the number of frequencies with table and return the result as an R data frame.
Sometimes you may wish to also suppress the output on the R side—that is, suppress the data transfer from R to the Wolfram Language. To do this, you have to put a semicolon at the end of your code.
You can now check that the assignment took place.
One reason why this may be needed is when the result is of the type that cannot be returned to the Wolfram Language via RLink, but can be worked with further in R.
You can execute multiline chunks of R code with REvaluate, but in this case, you have to enclose the code in curly braces.
Here also, the semicolon at the end of the code string would suppress the data transfer from R to the Wolfram Language, while the code will still be executed.
Many more examples can be found on the documentation page for REvaluate.
Defining Your Own R Functions
While the topic of this section is logically connected to the discussion in the previous section, it is important enough to have a separate discussion. If you go slightly beyond using the functions already available in R or its various extensions, one of the main things you may want to do is to define your own R functions, from the Wolfram Language.
This is perfectly possible with RLink. The details of how it is done are discussed in "Functions". Here only a few simple examples will be considered. Generally, functions in RLink are represented by opaque references, which point to functions defined in the R workspace. Such references can be stored in variables or used directly, to call R functions on Wolfram Language expressions as arguments and get the result back to the Wolfram Language.
A function reference was just generated, and stored in a variable sq. But also, an assignment in the R workspace was performed. There are now a number of ways you can call this function. First, you can call it directly in R.
This, however, does not allow you to pass arguments from the Wolfram Language. This is when the function reference is handy.
It is not necessary to store the reference in a variable. You could use it in the same piece of code where it is generated.
However, constructing function references through REvaluate and using them in such a manner is often not the best option, in particular because a new copy of a function reference is generated at every call, and also because such references have a lifetime of only the current RLink session (this is explained in much detail in "Functions"). There is a special device for creating "better" function references, which are cached and have an indefinite lifespan, by using RFunction.
Here is how you would use it for the preceding example.
Or, just as before, it can be also used directly without being stored in a variable.
Such a use does not produce a new copy of a function reference on each call, since references produced by RFunction are cached.
You can use function references as you would other objects in RLink; in particular, you can send them to R and pass them as arguments to other (higher-order) functions, etc. For example, now a previous function reference will be assigned to a variable in R.
You can pass function references as arguments to other functions. For example, you can define an analog of the Wolfram Language's Select function for R as follows.
You can now use it with some custom filtering function, which you can also define with RFunction.
There are a number of more subtle points on using function references in RLink, discussed in "Functions".
Error Handling
RLink attempts to handle possible errors on several levels. First, on the Wolfram Language side, RLink attempts to catch the invalid Wolfram Language input and issue the relevant error messages.
For example, if you try to transfer to R some general symbolic expression, you get an error message telling you that RLink does not know how to convert this input to a data type that it can transfer to R.
To learn which inputs can and cannot be transferred to R, please see "R Data Types in RLink", which has a detailed discussion on this.
Some errors do not manifest themselves during the data transfer to R, but show up as R runtime errors. In such cases, RLink attempts to deliver the R error message generated in the R workspace to the Wolfram System.
For example, here is an attempt to reference an undefined variable.
Here, the array bounds are violated.
Here, a function is called with the wrong number of arguments.
Parse errors are usually detected, but rather indirectly, and the error message can both be somewhat cryptic and involve some implementation details.
Performance—Tuning
RLink is a rather high-level interface, built on top of JLink, which itself is built on top of the Wolfram Symbolic Transfer Protocol (WSTP), and RJava/JRI, which is a Java interface to R runtime (the latter used as standalone set of dynamic native libraries). Also, RLink often uses flexible means of data transfer and execution, involving sometimes run time R code generation and execution. This flexibility allows RLink to handle a rather large subset of possible R objects, and also things like part assignments to arrays and lists, in a uniform way. But the price to pay for this is an (often very considerable) overhead. In cases when this overhead is not acceptable, there are ways to optimize the data transfer between the Wolfram Language and R. Some of them are described in this section.
Vectors versus Lists
The main advice here would be to avoid sending and returning R lists whenever you can and send/return R vectors (arrays) instead. The reason for that is that using arrays will be much more efficient, in all stages of communication with R. It will be more efficiently transformed to the internal representation (since Wolfram Language packed arrays can often be utilized), it will be more efficiently transformed to R, it will be more efficiently processed by R, and the result will again be obtained by the Wolfram Language much faster.
Here, as an example, first transform a vector to RLink internal representation.
Now do the same for a vector interpreted as a list.
You can see that in the latter case, the time complexity is also linear, but with a much larger constant, which is about 40 times that of a vector (which is, more or less, the typical speed ratio between top-level iteration—when done right—and the one using packed arrays).
Now look at the overhead of sending data to R. First, prepare internal forms of the vectors, so that you only count the time it takes to transfer the data.
Here are the timings to send these vectors to R.
The overhead is very noticeable (it was not even feasible go to the same larger number of elements as for vectors). While the future versions of RLink will likely have more efficient means of data transfer regarding lists, the current advice is to avoid sending back and forth lists of more than a few thousand entries. Since R lists are frequently used as an aggregate data structure, chances are that huge R lists will not often appear naturally.
Function Calls
While the syntax for function calls is convenient, it results in an overhead, which can often be unacceptably large.
The more you can use vectorized operations and push the work to be done in R as a whole (minimizing data exchange), the faster your code will be.
The second run is much faster (the first one was necessary since some Java class loading and other events were triggered by the first call on a fresh kernel, which makes the measurement based on the first call inaccurate).
There is a constant overhead of a function call, which dominated the running time for the previous examples. Here, the number of points will be increased 100 times, but the time it takes to compute the result is almost the same.
General
One general piece of advice is that, whatever you do, you have to try minimizing both the amount of data being transferred both ways and, sometimes even more importantly, the number of times that functions like REvaluate, RSet, and RFunction are called.
The worst possible scenario here is a lightweight function defined in R (or a piece of R code doing very little), called a large number of times from the Wolfram Language. In such a case, you can be almost certain that the total running time will be dominated by the time spent on data transfer and other inner working of RLink, rather than time spent in R doing the actual computation.
The best scenario is when most of the hard, computationally intensive work is done in R, and data is transferred to R and back in an efficient manner (for example, using data structures containing mostly vectors). Lists are OK as long as they do not have a huge number of elements. Since lists are most frequently used as an aggregate data structure to hold together heterogeneous collections of vectors (and possibly other data structures), they do not usually have a huge number of elements unless used inappropriately in cases where vectors should be used. One exception that is quite problematic for RLink is when the result of computation in R is a large ragged array (for example, of integers), which can only be represented as an R list. In such cases, one way to speed up the transfer would be to pad such an array to a rectangular one, whenever this is possible.