*RLink* User Guide

This guide will show you how to use *RLink* for communication between *Mathematica* and R.

### Loading, Installing, and Uninstalling *RLink*

*RLink* Resources (*RLinkRuntime*)

The *RLink* application consists of two parts, *RLink* proper, and the *RLinkRuntime *paclet. The former contains *Mathematica* code implementing the *RLink* API and Java code interfacing *RLink* with R, and comes with *Mathematica*. The latter contains basically the R distribution, with a few additional libraries, and is packaged as a downloadable paclet.

In cases where you want to use the R distribution bundled with *RLink* (as opposed to using your own R installation), you will need to have the *RLinkRuntime* paclet downloaded and installed on your machine before you can use *RLink*. The default is set so that the InstallR function (described in the next section) used to install R runtime does that for you automatically if it finds no paclet installed.

Sometimes, however, it may be more convenient to download and install the paclet manually. You can do this with the RLinkResourcesInstall function.

First, you have to load *RLink*.

In[6]:= |

In[7]:= |

Out[7]= |

If the result is $Failed, this means that the installation failed, which may happen for various reasons, such as if you are not connected to the internet or have disabled *Mathematica*'s internet access. This has to be corrected before *RLink* can function properly (unless you use your own R distribution, in which case you do not need RLinkResourcesInstall at all—this is described in the next section).

#### Using R Distribution Bundled with *RLink*

Before you can start working with *RLink*, it must be loaded.

While the package has been loaded, the R runtime has not been started yet. To do that, you need to call the InstallR function.

When you use InstallR with the default R distribution (which comes with *RLink*), InstallR checks for the *RLinkRuntime* paclet (containing that R distribution). If it does not find it installed, it calls internally the RLinkResourcesInstall function, which attempts to download the paclet from the Wolfram paclet server and install it. This is a one-time operation, but you will need to have internet connectivity enabled for it to succeed. Once the paclet has been successfully downloaded and installed, it will be used by *RLink* for all your subsequent *RLink* sessions automatically (unless you give an explicit option to InstallR to point it to a different R distribution, in which case that one will be used).

If you need to start R with some specific command-line options, you can specify them with the option "RCommandLine", as a string.

To stop and uninstall the R runtime, you can use the UninstallR command. This will uninstall R.

Now, you can no longer use *RLink* functions that communicate with R, until you call InstallR again. For example, this attempts to send some data to R.

This installs R again, using InstallR.

After which, you can communicate with R normally.

Chances are that you will rarely need to use UninstallR. One situation where it is useful is when you need to relaunch the R runtime with a different set of command-line options. Another such situation is when you have several different R distributions installed on the same machine and would like to point *RLink* to a different one than currently used.

#### Using Your Own R Distribution (Windows only)

On Windows, it is currently possible for you to use your own R distribution with *RLink*, which may have a number of advantages in certain circumstances. For example, you already have a customized R distribution that you would like to continue using for everything, including work with *Mathematica*/*RLink*. Or, you would like to install extra packages.

##### Specify the Location of the R Distribution When Installing *RLink*

You can specify the location by using the "RHomeLocation" option to InstallR, calling it as follows.

The value of the option should correspond to the R_HOME variable you usually set for your R distribution, and point at the root of it.

### Sending Data to R

To send data to R, you have to use the RSet function. Your data will have to be expressed in a form that *RLink* can understand. For most common data types, such as (multidimensional) arrays, you can use the usual *Mathematica* nested list representation of them. More details on this can be found in "R Data Types in *RLink*".

This will send an integer to R and assign a variable in the R workspace to its value.

You can test the assignment with the help of REvaluate.

Since scalars are interpreted as one-element vectors, the result is a list. More details on this can be also found in "R Data Types in *RLink*".

The following will send a nested ragged list to R, which will be interpreted as an R list.

You do not have to indicate the type of data you are sending to R, in the majority of cases. The data type is determined for you automatically by *RLink*, based on the form of your data. The procedure for the automatic type detection is described in more detail in the reference page for ToRForm, and also in "R Data Types in *RLink*".

You can also use RSet on expressions more general than variables. In particular, you can make part assignments to elements of lists and arrays.

You can test whether or not the assignment took place on the R side.

In general, the only requirement is that the R expression represented by the string passed as a first argument to RSet can be assigned a value (is an L-value in R).

### Executing R Code and Getting Data from R

To execute any string of valid R code and get the results back to *Mathematica*, you can use the REvaluate function. You have seen some examples of its use already.

Here, for example, this will be used to generate a vector of random integers.

In this case, the result was transferred back to *Mathematica*, but not saved anywhere in the R workspace. If you wish to also save the result, you can assign it to some variable in the R workspace.

You can test now that the values were saved in the R variable.

Here is a little more involved computation: compute the number of frequencies with table and return the result as an R data frame.

Sometimes you may wish to also suppress the output on the R side—that is, suppress the data transfer from R to *Mathematica*. To do this, you have to put a semicolon at the end of your code.

You can now check that the assignment took place.

One reason why this may be needed is when the result is of the type that cannot be returned to *Mathematica* via *RLink*, but can be worked with further in R.

You can execute multiline chunks of R code with REvaluate, but in this case, you have to enclose the code in curly braces.

Here also, the semicolon at the end of the code string would suppress the data transfer from R to *Mathematica*, while the code will still be executed.

Many more examples can be found on the documentation page for REvaluate.

### Defining Your Own R Functions

While the topic of this section is logically connected to the discussion in the previous section, it is important enough to have a separate discussion. If you go slightly beyond using the functions already available in R or its various extensions, one of the main things you may want to do is to define your own R functions, from *Mathematica*.

This is perfectly possible with *RLink*. The details of how it is done are discussed in "Functions". Here only a few simple examples will be considered. Generally, functions in *RLink* are represented by opaque references, which point to functions defined in the R workspace. Such references can be stored in variables or used directly, to call R functions on *Mathematica* expressions as arguments and get the result back to *Mathematica*.

A function reference was just generated, and stored in a variable sq. But also, an assignment in the R workspace was performed. There are now a number of ways you can call this function. First, you can call it directly in R.

This, however, does not allow you to pass arguments from *Mathematica*. This is when the function reference is handy.

It is not necessary to store the reference in a variable. You could use it in the same piece of code where it is generated.

However, constructing function references through REvaluate and using them in such a manner is often not the best option, in particular because a new copy of a function reference is generated at every call, and also because such references have a lifetime of only the current *RLink* session (this is explained in much detail in "Functions"). There is a special device for creating "better" function references, which are cached and have an indefinite lifespan, by using RFunction.

Here is how you would use it for the preceding example.

Or, just as before, it can be also used directly without being stored in a variable.

Such a use does not produce a new copy of a function reference on each call, since references produced by RFunction are cached.

You can use function references as you would other objects in *RLink*; in particular, you can send them to R and pass them as arguments to other (higher-order) functions, etc. For example, now a previous function reference will be assigned to a variable in R.

You can pass function references as arguments to other functions. For example, you can define an analog of *Mathematica*'s Select function for R as follows.

You can now use it with some custom filtering function, which you can also define with RFunction.

There are a number of more subtle points on using function references in *RLink*, discussed in "Functions".

### Error Handling

*RLink* attempts to handle possible errors on several levels. First, on the *Mathematica* side, *RLink* attempts to catch the invalid *Mathematica* input and issue the relevant error messages. For example, if you try to transfer to R some general symbolic expression, you get an error message telling you that *RLink* does not know how to convert this input to a data type that it can transfer to R.

To learn which inputs can and cannot be transferred to R, please see "R Data Types in *RLink*", which has a detailed discussion on this.

Some errors do not manifest themselves during the data transfer to R, but show up as R runtime errors. In such cases, *RLink* attempts to deliver the R error message generated in the R workspace to *Mathematica*.

For example, here is an attempt to reference an undefined variable.

Here, the array bounds are violated.

Here, a function is called with the wrong number of arguments.

Parse errors are usually detected, but rather indirectly, and the error message can both be somewhat cryptic and involve some implementation details.

### Performance—Tuning

*RLink* is a rather high-level interface, built on top of *JLink*, which itself is built on top of *MathLink,* and RJava/JRI, which is a Java interface to R runtime (the latter used as standalone set of dynamic native libraries). Also, *RLink* often uses flexible means of data transfer and execution, involving sometimes run time R code generation and execution. This flexibility allows *RLink* to handle a rather large subset of possible R objects, and also things like part assignments to arrays and lists, in a uniform way. But the price to pay for this is an (often very considerable) overhead. In cases when this overhead is not acceptable, there are ways to optimize the data transfer between *Mathematica* and R. Some of them are described in this section.

##### Vectors versus Lists

The main advice here would be to avoid sending and returning R lists whenever you can and send/return R vectors (arrays) instead. The reason for that is that using arrays will be much more efficient, in all stages of communication with R. It will be more efficiently transformed to the internal representation (since *Mathematica* packed arrays can often be utilized), it will be more efficiently transformed to R, it will be more efficiently processed by R, and the result will again be obtained by *Mathematica* much faster.

Here, as an example, first transform a vector to *RLink* internal representation.

Now do the same for a vector interpreted as a list.

You can see that in the latter case, the time complexity is also linear, but with a much larger constant, which is about 40 times that of a vector (which is, more or less, the typical speed ratio between top-level iteration—when done right—and the one using packed arrays).

Now look at the overhead of sending data to R. First, prepare internal forms of the vectors, so that you only count the time it takes to transfer the data.

Here are the timings to send these vectors to R.

The overhead is very noticeable (it was not even feasible go to the same larger number of elements as for vectors). While the future versions of *RLink* will likely have more efficient means of data transfer regarding lists, the current advice is to avoid sending back and forth lists of more than a few thousand entries. Since R lists are frequently used as an aggregate data structure, chances are that huge R lists will not often appear naturally.

##### Function Calls

While the syntax for function calls is convenient, it results in an overhead, which can often be unacceptably large.

The more you can use vectorized operations and push the work to be done in R as a whole (minimizing data exchange), the faster your code will be.

There is a constant overhead of a function call, which dominated the running time for the previous examples. Here, the number of points will be increased 100 times, but the time it takes to compute the result is almost the same.

##### General

One general piece of advice is that, whatever you do, you have to try minimizing both the amount of data being transferred both ways and, sometimes even more importantly, the number of times that functions like REvaluate, RSet, and RFunction are called.

The worst possible scenario here is a lightweight function defined in R (or a piece of R code doing very little), called a large number of times from *Mathematica*. In such a case, you can be almost certain that the total running time will be dominated by the time spent on data transfer and other inner working of *RLink*, rather than time spent in R doing the actual computation.

The best scenario is when most of the hard, computationally intensive work is done in R, and data is transferred to R and back in an efficient manner (for example, using data structures containing mostly vectors). Lists are OK as long as they do not have a huge number of elements. Since lists are most frequently used as an aggregate data structure to hold together heterogeneous collections of vectors (and possibly other data structures), they do not usually have a huge number of elements unless used inappropriately in cases where vectors should be used. One exception that is quite problematic for *RLink* is when the result of computation in R is a large ragged array (for example, of integers), which can only be represented as an R list. In such cases, one way to speed up the transfer would be to pad such an array to a rectangular one, whenever this is possible.