RLink User Guide
RLink connects the Wolfram Language with one of the existing R distributions on your machine. Because RLink does not, by itself, carry out the R installation process, at least one R distribution should be pre-installed. If multiple R installations are present on the same machine, you will typically need to point RLink to the one you need to work with. Details on how to do that are provided below. Alternatively, RLink has a built-in R discovery system and will attempt to use the first R installation it finds—typically, that will be the default R installation on your system. Finally, RLink remembers the last successfully used R installation, and therefore will by default attempt to use the same installation again after the first use, unless instructed otherwise.
For RLink to be able to work with a given R installation, it must be configured. The configuration process essentially consists of installing the R package "rJava" and providing RLink with the path to that installed package (RLink needs this because it uses Java as an intermediate layer). In the majority of cases, RLink should be able to carry out this configuration process automatically; however, it is also possible, and in some cases necessary, to configure an R installation manually.
In either case, a successful R configuration process is subject to prerequisites in terms of build tools and dependencies that should be installed on your system. The information on these prerequisites can also be found in the same tutorial on the manual R configuration process. Whereas on Windows and macOS those are typically less important and usually automatically satisfied for most systems, on Linux they do matter.
Although the package has been loaded, the R runtime has not yet been started. To do that, you need to call the InstallR function:
The use of InstallR with no explicit arguments will prompt RLink to use the first R installation it finds (typically the default one on your system), unless some existing R installation has already been used before, in which case it will connect to that previously used one.
To stop and uninstall the R runtime, you can use the UninstallR command. This will uninstall R:
Now you can no longer use RLink functions that communicate with R until you call InstallR again. For example, this attempts to send some data to R:
This installs R again, using InstallR:
Chances are that you will rarely need to use UninstallR. One situation where it is useful is when you need to relaunch the R runtime with a different set of command-line options. Another such situation is when you have several different R distributions installed on the same machine and would like to point RLink to a different one than the one currently used.
RLink supports connecting to an external R distribution for all platforms where RLink and R are supported. In simple cases, no additional configuration steps will be required to establish the connection. However, in general, a one-time R configuration process will be required for a given R installation. In most such cases, RLink will be able to carry out this process automatically. But in some cases (particularly on Linux), you will need to go through the manual steps, described in detail in "Configuring an External R Installation to Work with RLink".
The only required argument for an external R installation is the option "RHomeLocation", which tells RLink where to look for the R home directory. The configuration process is necessary if InstallR fails when passed this option only. Once this configuration process has been completed for the specified R installation, one should be able to connect to that installation by calling InstallR, with some extra options passed in some cases.
Specify the Location of the R Distribution When Installing RLink
In simple cases, you can specify the location by using the "RHomeLocation" option to InstallR, calling it as follows:
Specify Additional Options
In some of the cases when the configuration process is required to set up external R distributions for use with RLink, you will need to pass additional information to RLink—specifically, the location of the native JRI library built in the mentioned setup process (the "JRINativeLibraryLocation" option) and the version number for this specific R installation (the "RVersion" option) to InstallR. This can be done as follows:
More details and examples for all supported platforms can be found in "Configuring an External R Installation to Work with RLink".
Note that in practice, RLink in most cases should be able to automatically determine the values of these parameters for a given R distribution, after the configuration process has been successfully performed, so the need to explicitly pass these options to InstallR should only emerge in special situations.
To send data to R, you have to use the RSet function. Your data will have to be expressed in a form that RLink can understand. For most common data types, such as (multidimensional) arrays, you can use the usual Wolfram Language nested list representation of them. More details on this can be found in "R Data Types in RLink".
You can test the assignment with the help of REvaluate.
Since scalars are interpreted as one-element vectors, the result is a list. More details on this can be also found in "R Data Types in RLink".
You do not have to indicate the type of data you are sending to R, in the majority of cases. The data type is determined for you automatically by RLink, based on the form of your data. The procedure for the automatic type detection is described in more detail in the reference page for ToRForm, and also in "R Data Types in RLink".
You can also use RSet on expressions more general than variables. In particular, you can make part assignments to elements of lists and arrays.
In general, the only requirement is that the R expression represented by the string passed as a first argument to RSet can be assigned a value (is an L-value in R).
To execute any string of valid R code and get the results back to the Wolfram Language, you can use the REvaluate function. You have seen some examples of its use already.
In this case, the result was transferred back to the Wolfram Language, but not saved anywhere in the R workspace. If you wish to also save the result, you can assign it to some variable in the R workspace.
You can execute multiline chunks of R code with REvaluate, but in this case, you have to enclose the code in curly braces.
Many more examples can be found on the documentation page for REvaluate.
While the topic of this section is logically connected to the discussion in the previous section, it is important enough to have a separate discussion. If you go slightly beyond using the functions already available in R or its various extensions, one of the main things you may want to do is to define your own R functions, from the Wolfram Language.
This is perfectly possible with RLink. The details of how it is done are discussed in "Functions". Here only a few simple examples will be considered. Generally, functions in RLink are represented by opaque references, which point to functions defined in the R workspace. Such references can be stored in variables or used directly, to call R functions on Wolfram Language expressions as arguments and get the result back to the Wolfram Language.
A function reference was just generated, and stored in a variable sq. But also, an assignment in the R workspace was performed. There are now a number of ways you can call this function. First, you can call it directly in R.
However, constructing function references through REvaluate and using them in such a manner is often not the best option, in particular because a new copy of a function reference is generated at every call, and also because such references have a lifetime of only the current RLink session (this is explained in much detail in "Functions"). There is a special device for creating "better" function references, which are cached and have an indefinite lifespan, by using RFunction.
Such a use does not produce a new copy of a function reference on each call, since references produced by RFunction are cached.
You can use function references as you would other objects in RLink; in particular, you can send them to R and pass them as arguments to other (higher-order) functions, etc. For example, now a previous function reference will be assigned to a variable in R.
You can pass function references as arguments to other functions. For example, you can define an analog of the Wolfram Language's Select function for R as follows.
You can now use it with some custom filtering function, which you can also define with RFunction.
There are a number of more subtle points on using function references in RLink, discussed in "Functions".
For example, if you try to transfer to R some general symbolic expression, you get an error message telling you that RLink does not know how to convert this input to a data type that it can transfer to R.
To learn which inputs can and cannot be transferred to R, please see "R Data Types in RLink", which has a detailed discussion on this.
Some errors do not manifest themselves during the data transfer to R, but show up as R runtime errors. In such cases, RLink attempts to deliver the R error message generated in the R workspace to the Wolfram System.
RLink is a rather high-level interface, built on top of JLink, which itself is built on top of the Wolfram Symbolic Transfer Protocol (WSTP), and RJava/JRI, which is a Java interface to R runtime (the latter used as standalone set of dynamic native libraries). Also, RLink often uses flexible means of data transfer and execution, involving sometimes run time R code generation and execution. This flexibility allows RLink to handle a rather large subset of possible R objects, and also things like part assignments to arrays and lists, in a uniform way. But the price to pay for this is an (often very considerable) overhead. In cases when this overhead is not acceptable, there are ways to optimize the data transfer between the Wolfram Language and R. Some of them are described in this section.
Vectors versus Lists
The main advice here would be to avoid sending and returning R lists whenever you can and send/return R vectors (arrays) instead. The reason for that is that using arrays will be much more efficient, in all stages of communication with R. It will be more efficiently transformed to the internal representation (since Wolfram Language packed arrays can often be utilized), it will be more efficiently transformed to R, it will be more efficiently processed by R, and the result will again be obtained by the Wolfram Language much faster.
You can see that in the latter case, the time complexity is also linear, but with a much larger constant, which is about 40 times that of a vector (which is, more or less, the typical speed ratio between top-level iteration—when done right—and the one using packed arrays).
The overhead is very noticeable (it was not even feasible go to the same larger number of elements as for vectors). While the future versions of RLink will likely have more efficient means of data transfer regarding lists, the current advice is to avoid sending back and forth lists of more than a few thousand entries. Since R lists are frequently used as an aggregate data structure, chances are that huge R lists will not often appear naturally.
The second run is much faster (the first one was necessary since some Java class loading and other events were triggered by the first call on a fresh kernel, which makes the measurement based on the first call inaccurate).
There is a constant overhead of a function call, which dominated the running time for the previous examples. Here, the number of points will be increased 100 times, but the time it takes to compute the result is almost the same.
One general piece of advice is that, whatever you do, you have to try minimizing both the amount of data being transferred both ways and, sometimes even more importantly, the number of times that functions like REvaluate, RSet, and RFunction are called.
The worst possible scenario here is a lightweight function defined in R (or a piece of R code doing very little), called a large number of times from the Wolfram Language. In such a case, you can be almost certain that the total running time will be dominated by the time spent on data transfer and other inner working of RLink, rather than time spent in R doing the actual computation.
The best scenario is when most of the hard, computationally intensive work is done in R, and data is transferred to R and back in an efficient manner (for example, using data structures containing mostly vectors). Lists are OK as long as they do not have a huge number of elements. Since lists are most frequently used as an aggregate data structure to hold together heterogeneous collections of vectors (and possibly other data structures), they do not usually have a huge number of elements unless used inappropriately in cases where vectors should be used. One exception that is quite problematic for RLink is when the result of computation in R is a large ragged array (for example, of integers), which can only be represented as an R list. In such cases, one way to speed up the transfer would be to pad such an array to a rectangular one, whenever this is possible.