Functions in RLink
This tutorial will explain how to define and use R functions in RLink, including some more advanced forms of them, such as closures and higher-order functions.
Function References
R functions in RLink are represented by expressions with the head RFunction, which are opaque references to functions defined in the R workspace. Such references can be stored in variables or used directly, to call R functions on Mathematica expressions as arguments and return the result back to Mathematica.
Function References: How to Obtain and Use Them
To call any R function via RLink, passing to it Mathematica expressions as arguments, you first need to obtain a reference to that function. In cases where the function has not yet been defined in the R workspace (e.g. anonymous functions), the function is first defined and stored in the R workspace, as a part of the procedure of obtaining a Mathematica reference to it. This is done automatically, and should be of no concern to the user.
Function References Obtained via REvaluate
To proceed, you first need to load RLink.
There are several ways to generate such a reference. One way to do it is to call REvaluate on a string of code that evaluates to a function in R.
What is now stored in sum is a reference to a (built-in in this case) function defined in the R workspace. This reference can now be used.
You can also use such references directly.
However, for function references returned by REvaluate, such uses are discouraged, for reasons explained in the sections that follow (basically because a new copy of the reference is produced at each such invocation).
Obtaining a function reference via REvaluate will also work for user-defined functions, named or anonymous. For example, you can define a function that will subtract 1 from its argument.
Now you can obtain a reference.
You do not have to name the function on the R side to obtain the reference to it; anonymous functions will do just as well.
Function References Obtained via RFunction
First, load RLink, if you have not already done so.
Another way to obtain a function reference is to use RFunction[code].
Such references can be used in the same way.
Again, you can bypass the assignment to a Mathematica variable and use it directly, as in the following program, which mimics Mathematica's pure functions.
As opposed to the case where the reference is produced by REvaluate, in this case there are no issues with such a direct use, because references returned by RFunction are somewhat different internally from references returned by REvaluate, as will be explained.
This method of obtaining function references is the recommended one, whenever it is possible to do so (in particular, for user-defined R functions that you define through their implementation—code string—in Mathematica). The reasons for this will be explained, and have to do with function references' lifetime. You cannot obtain the references via RFunction for all functions (closures are one notable exception), but for many you can (basically, you can do this for any R function that either has a name or can be fully defined by its code string, but does not depend on any R environment except the default [global] one).
Function References Obtained as a Result of a Function Call
A somewhat more esoteric way of generating a function reference is in the case of closures, which are functions returned as a result of calls to other functions. Closures will be covered systematically later, but here is one example: a function that generates another function that would add a fixed number to its argument.
The following will generate a function that adds 10 to its argument and returns the result.
This is also a function reference. It can be used as follows.
Just like with references obtained via REvaluate (and unlike those obtained via RFunction), direct uses such as the following are allowed but not recommended (direct uses of those references obtained via RFunction are both possible and recommended), for the same reason that new copies of the same function reference will be produced by each such call.
It is better to store a reference in a variable in such cases, like in the previous example with add10. (While the generator function addNum was obtained via RFunction, the reference stored in add10 was produced via a function call to addNum—effectively, it was produced by a route similar to REvaluate.)
The Anatomy and Lifetime of Function References
The most important distinction between function references used in RLink is directly related to the way they were obtained. One class of function references includes those returned by REvaluate or as a result of R function calls (closures)—they will be called type I references here. Another class of references includes those obtained through RFunction, which will be called type II references. The references belonging to these classes have several differences, such as lifetime and caching.
The Full Internal Form of the Function Reference
Here is a simple example of a reference to a built-in R function.
You can inspect the FullForm of it.
The first element is a string, which can be either "builtin" or "closure". This corresponds to the two function types in R with the same names. R has a (comparatively) small number of built-in functions that are called primitive and are not themselves implemented in R; they have a type "builtin". This can be confirmed simply by evaluating.
All other functions have a type "closure" (there is a confusing clash of terminology here, since the R's "closure" type covers a wider class of functions than the more traditional definition of closure as being a result of another function's execution, having access to that function's environment. In particular, R's "closure" type also includes top-level R functions as well). This includes both built-in R functions that are implemented in the top-level R, and user-defined R functions. Again, this is easy to check for a given function.
The second element is a string of code defining that function, wrapped in an RCode wrapper. The code in RCode is there mostly for illustrative purposes, because it is not always enough to fully reconstruct the function (functions may depend on some particular R environment; they may also be byte-compiled, etc.—all such information is lost in this representation).
The third element is a reference index, used by RLink internally to relate the reference to the actual function defined in the R workspace. It can be a positive integer number (for type I references), or Automatic (for type II references).
The last element is an RAttributes[] container, which may be empty or not, depending on whether or not the corresponding R function has a nontrivial set of attributes.
Accessing the Code (Implementation) of R Functions Implemented in the Top-Level R
Functions that are built-in but implemented in top-level R will also have a type "closure".
This is a (trivial) example involving this function.
Often, for such functions, you can extract their code (implementation).
Such code will not be available if you create a (overall more advantageous) type II reference for factor.
The same is true for all named functions: type I references will carry at least some details of their internal implementation, while type II references will only carry their names. Of course, there are other ways to access the R functions implementation details that are available in R.
The Lifetime of the References
Function references of type I have a finite lifetime limited to a current RLink session. References of type II are permanent. This has a number of consequences that will be described here and in the sections that follow.
References Originated from Explicit R Code Execution (Type I)
All type I references are temporary references (which is reflected in the integer indices used to index them). Such references are only valid within the same RLink session in which they were defined. Technically, this is because functions returned from R will generally carry with them the R environment used to define them, and such environments are dynamic and tied to a given running R session.
Define a reference of this type.
You can test it to make sure that it works.
If you try using the same reference as before, you will get an error, since the reference has expired.
This situation is for references for R functions of types both "closure" and "builtin".
References Obtained via RFunction (Type II)
Such references are permanent, in the sense that they remain valid across several RLink and/or Mathematica sessions.
This can be illustrated via a simple example, as before.
If you look at the FullForm of the resulting reference, you can see that the reference index is now Automatic.
Such functions are unambiguously defined by their R code, and do not involve R environments other than the default global R environment. Such function references can therefore be used across several RLink (and Mathematica) sessions. On the other hand, such functions can only access the default (global) environment of R, which restricts the class of functions that can be defined in this way.
You can see that the reference still works.
As another way to say it, you can manually construct the FullForm for such a reference, as an alternative to using the short syntax RFunction[code].
It will still work, even when such a function has not been defined before in this RLink session.
Type I versus Type II References—Other Differences
It is important to realize that each call to REvaluate results in a new distinct type I reference to that function being generated.
This is so because there is no general way for RLink to tell that the function has already been defined, if it is a result of R code execution (REvaluate). In most cases, such new references will be redundant.
For functions defined with RFunction (type II), this is different, since the same reference is being produced by every call.
This happens because such references are cached. Caching is possible in this case, since functions defined via RFunction are entirely defined by their code string.
Given this distinction, code such as the following is strongly discouraged, mostly because you create as many copies of the function's reference as there are function calls.
All those references are then still stored on the R side, taking up memory. On top of that, it is slower than if you store the function's reference in a variable, or use RFunction[code] to define it from Mathematica (when possible).
Type I versus Type II References—When to Use Each Type
Given all the previous discussion, it should be clear that, whenever possible, you should try to use type II references, both because they are permanent, and because they are cached. This is possible to do in most cases, for user-defined functions.
The following defines such a reference.
It can be used instead of the following.
It is also possible in most cases for built-in functions; the following is one example.
It can be used instead of the following.
One case where this is not possible is when you create closures of one type or another. Generally, it is not possible to use type II references whenever the function in question refers to some R environment other than the default (global) one.
Sending Function References to R
Function references do not have nontrivial transformation rules between the longer internal form that RLink uses to represent R data types and the shorter form that usually is used by the end user. In other words, both forms are the same for function references.
Now, for example, create a function reference.
The three expressions following are the same.
You can send function references to R, as any other R objects that RLink understands, by assigning them to some variable in the R workspace, using RSet.
Now you can use the cube function on the R side.
If you want to, you can also get back the reference to this function, via REvaluate or RFunction.
You can use these references instead.
These references are, however, distinct from the original reference stored in the Mathematica variable cube.
Advanced Uses of Functions
Higher-Order Functions
Higher-order functions are functions that accept other functions as some of their arguments. Typical Mathematica built-in higher-order functions are, among others, Map and Apply. In R, like in Mathematica, functions are values, and can be directly passed as parameters to other functions.
RLink's RFunction construct fully supports the higher-order functions paradigm. As an example, a filtering function will now be constructed, similar to Mathematica's Select, that takes a vector and a test function, and selects only those elements of the vector that satisfy the test function.
You can now use it with some custom filtering function, which you also define with RFunction.
You could save the test function in a variable.
But this is not necessary. The following program is Mathematica's analog of the preceding, where a named-argument Function syntax is used to stress the analogy.
Closures
Closures are typically functions returned by other functions as results, such that they have access to an environment (variables, state) of an enclosing function, even after that function executes. R fully supports closures, and so does RLink.
Here is the standard example of a closure, which has already been considered: a function that will take some number and return a function (closure) that would add this number to an input argument.
You can now use it to generate a closure for a specific number.
Since closures are results of other functions' execution, their frequent use facilitates a different style of programming, which emphasizes functions as dynamic blocks used in composing other functions, on equal footing with other data types.
For example, you can create a list of such closures, if you want.
And use this, for example, as follows.
More interesting examples come from closures that access and possibly modify some mutable state. For example, you can create an iterator over some vector. The following function is one possible implementation of a generator of such iterators.
It takes a vector and produces an iterator function with that vector embedded in it. That iterator function has no parameters and a mutable state (variable i), which is changed upon every call to it. Once the end of the vector is reached, it will return R NULL on all subsequent calls. Note the special assignment operator "<<-", which is used to assign to a variable from an enclosing (parent) environment.
Here is a custom iterator over a vector (list) {5,6,7,8,9,10}.
It will return the vector's elements one by one.
Being used on the Mathematica side, it will not be efficient, since RLink function calls have an overhead that is not small. Here, however, the main concern was with illustrating that RLink supports closures as an abstraction in principle. In "Performance", this example will be considered again and it will be shown how its performance can be improved.
Error Handling
Here some common cases of errors, encountered when defining and using R functions in RLink, will be discussed.
Invalid R Code
When you try to define a function and get a reference, but provide a syntactically invalid R code, you will get an error message, and the result will be $Failed.
When you use RFunction, the error message will usually be rather informative (although it will not show an exact place where the syntax error occurs).
For REvaluate, the error message may be more cryptic.
Expired (or Otherwise Invalid) Function References
The following is an attempt to define a function's reference manually, with a very large index, to make it unlikely that a function reference with such an index has been already generated in a current RLink session.
This mimics a case of using a reference to a function defined in one of the past RLink sessions in a new RLink session, where such a reference does not correspond to any function currently defined in the R workspace.
The definition itself went through, since the reference expression by itself is inert. However, an attempt to use it will result in an error.
This can only happen for a function reference of type I. Function references of type II do not have this issue.
Reference Collisions across Multiple RLink Sessions
For references of type I, something worse than just a reference being expired may happen. It may be that in a new RLink session you do get a reference to some function that does happen to be assigned the same reference index as some reference that you stored from previous RLink sessions. In such a case, a new function will silently be used in a place where you might expect the old function to be used.
As an example, start with a fresh RLink session and define some function.
Now enter a new session and define another function.
If you now (erroneously) attempt to use testRef, here is what you get.
This is, of course, what the new function does, not the old one. Such errors can be hard to find, but they can only happen for type I references. This is another good argument in favor of using type II references whenever possible.
It is important to realize that storing type I function references and making it possible for them to extend their existence over the duration of a single RLink session (in the Mathematica code you write, which uses them) is an error, and you have to write your Mathematica code in a way to prevent this, in order to avoid problems like reference expiration or reference collisions.
This situation is somewhat similar to that of local variables generated by Module, with the difference that those are guaranteed to be unique within a single Mathematica session, while type I function references in RLink are guaranteed to be unique within a single RLink session (and you may have multiple RLink sessions within a single Mathematica session).
Wrong Number of Arguments
If you call a function reference with the wrong number of arguments, this will result in an error on the R side, although the error messages might be cryptic sometimes.
Performance
The current RLink implementation has several stages of data transfer between R and Mathematica, which result in an often very significant function call and data transfer overhead. This section will give some advice on how to minimize such an overhead, in the context of function calls.
Efficiency of Function Calls
While this syntax for function calls is convenient, it results in an overhead, which can often be unacceptably large.
This illustrates the overhead involved.
The more you can use vectorized operations and push the work to be done in R as a whole (minimizing data exchange, and, more importantly, maximizing the granularity of data exchange), the faster your code will be.
There is a constant overhead of a function call, which dominated the running time for the previous examples. Here, the number of points will be increased 1000 times, but the time it takes to compute the result is almost the same.
Sometimes, what may look like an inefficient function call is actually an overhead associated with the data transfer.
The following code is pretty slow.
But the reason for this is not the function call overhead but the overhead of transferring lists from R, which can be very significant. One way to estimate which is the dominant reason for the slowness of the code is to transfer some test data similar to what you expect as a result of calling the function of interest.
In this case, it is clear that the running time was dominated by the transfer of a large list from R to Mathematica.
Closures and Higher-Order Functions
As already explained, RLink allows you to return references to R closures and use them in Mathematica. However, often this will be slow, if used straightforwardly.
Consider again the iterator generator example, for which the code will be repeated for consistency.
This generates an iterator for a moderately sized list (vector).
If you use it, for example to iterate through each element and square it, it will be rather slow.
For comparison, here is an analog top-level Mathematica implementation.
If you now generate a similar iterator, it will be three orders of magnitude faster, and that is the top-level Mathematica code.
That means that in this form, iterL is unusable for anything except very small lists.
But you should keep in mind that if you define an iterator on the R side, it will make much more sense to also perform the iteration in R. To move the iteration process to R, you can take another route and define a higher-order function that will take a function to apply and an iterator and perform the iteration on the R side.
Now define a new instance of an iterator.
This is now two orders of magnitude faster, and that includes the time to transfer the data. The speedup is due to the fact that the iteration is done on the R side, while you have all the flexibility of Mathematica (in the sense that it looks "as if" you were calling Mathematica functions, passing in other Mathematica functions). This may still be slow in comparison with a purely imperative rigid implementation involving loops, but may be usable.
The iterator defined by iterGen is a "one-time" object, in the sense that it only allows a single full iteration.
There are a number of lessons to be learned from this example. The first one is that you create the worst possible scenario (performance-wise) when you maximize the number of lightweight function calls (for example by performing some large iteration in Mathematica), with each function call doing very little on the R side.
The second lesson is that even for constructs that are slow due to the RLink function call overhead, you can still prototype and test them in Mathematica, and then there is a natural optimization path, consisting of the gradual moving of computationally heavy parts (such as loops, iteration) to R from the initial Mathematica implementation. In other words, you can program functionally in a mixture of R and Mathematica, and you can treat Mathematica as pretty much as good a prototyping environment for R, as the R command prompt itself is, at least regarding this incremental code-development and prototyping aspect.