This tutorial will explain how to define and use R functions in RLink, including some more advanced forms of them, such as closures and higher-order functions. To proceed, you first need to load RLink.
R functions in RLink are represented by expressions with the head RFunction, which are opaque references to functions defined in the R workspace. Such references can be stored in variables or used directly, to call R functions on Mathematica expressions as arguments and return the result back to Mathematica.
To call any R function via RLink, passing to it Mathematica expressions as arguments, you first need to obtain a reference to that function. In cases where the function has not yet been defined in the R workspace (e.g. anonymous functions), the function is first defined and stored in the R workspace, as a part of the procedure of obtaining a Mathematica reference to it. This is done automatically, and should be of no concern to the user.
There are several ways to generate such a reference. One way to do it is to call REvaluate on a string of code that evaluates to a function in R.
However, for function references returned by REvaluate, such uses are discouraged, for reasons explained in the sections that follow (basically because a new copy of the reference is produced at each such invocation).
Obtaining a function reference via REvaluate will also work for user-defined functions, named or anonymous. For example, you can define a function that will subtract 1 from its argument.
Another way to obtain a function reference is to use RFunction[code].
As opposed to the case where the reference is produced by REvaluate, in this case there are no issues with such a direct use, because references returned by RFunction are somewhat different internally from references returned by REvaluate, as will be explained.
This method of obtaining function references is the recommended one, whenever it is possible to do so (in particular, for user-defined R functions that you define through their implementation—code string—in Mathematica). The reasons for this will be explained, and have to do with function references' lifetime. You cannot obtain the references via RFunction for all functions (closures are one notable exception), but for many you can (basically, you can do this for any R function that either has a name or can be fully defined by its code string, but does not depend on any R environment except the default [global] one).
A somewhat more esoteric way of generating a function reference is in the case of closures, which are functions returned as a result of calls to other functions. Closures will be covered systematically later, but here is one example: a function that generates another function that would add a fixed number to its argument.
Just like with references obtained via REvaluate (and unlike those obtained via RFunction), direct uses such as the following are allowed but not recommended (direct uses of those references obtained via RFunction are both possible and recommended), for the same reason that new copies of the same function reference will be produced by each such call.
It is better to store a reference in a variable in such cases, like in the previous example with add10. (While the generator function addNum was obtained via RFunction, the reference stored in add10 was produced via a function call to addNum—effectively, it was produced by a route similar to REvaluate.)
The most important distinction between function references used in RLink is directly related to the way they were obtained. One class of function references includes those returned by REvaluate or as a result of R function calls (closures)—they will be called type I references here. Another class of references includes those obtained through RFunction, which will be called type II references. The references belonging to these classes have several differences, such as lifetime and caching.
You can inspect the FullForm of it.
The first element is a string, which can be either "builtin" or "closure". This corresponds to the two function types in R with the same names. R has a (comparatively) small number of built-in functions that are called primitive and are not themselves implemented in R; they have a type "builtin". This can be confirmed simply by evaluating.
All other functions have a type "closure" (there is a confusing clash of terminology here, since the R's "closure" type covers a wider class of functions than the more traditional definition of closure as being a result of another function's execution, having access to that function's environment. In particular, R's "closure" type also includes top-level R functions as well). This includes both built-in R functions that are implemented in the top-level R, and user-defined R functions. Again, this is easy to check for a given function.
The second element is a string of code defining that function, wrapped in an RCode wrapper. The code in RCode is there mostly for illustrative purposes, because it is not always enough to fully reconstruct the function (functions may depend on some particular R environment; they may also be byte-compiled, etc.—all such information is lost in this representation).
The third element is a reference index, used by RLink internally to relate the reference to the actual function defined in the R workspace. It can be a positive integer number (for type I references), or Automatic (for type II references).
The last element is an RAttributes container, which may be empty or not, depending on whether or not the corresponding R function has a nontrivial set of attributes.
The same is true for all named functions: type I references will carry at least some details of their internal implementation, while type II references will only carry their names. Of course, there are other ways to access the R functions implementation details that are available in R.
Function references of type I have a finite lifetime limited to a current RLink session. References of type II are permanent. This has a number of consequences that will be described here and in the sections that follow.
References Originated from Explicit R Code Execution (Type I)
All type I references are temporary references (which is reflected in the integer indices used to index them). Such references are only valid within the same RLink session in which they were defined. Technically, this is because functions returned from R will generally carry with them the R environment used to define them, and such environments are dynamic and tied to a given running R session.
References Obtained via RFunction (Type II)
Such functions are unambiguously defined by their R code, and do not involve R environments other than the default global R environment. Such function references can therefore be used across several RLink (and Mathematica) sessions. On the other hand, such functions can only access the default (global) environment of R, which restricts the class of functions that can be defined in this way.
It is important to realize that each call to REvaluate results in a new distinct type I reference to that function being generated.
This is so because there is no general way for RLink to tell that the function has already been defined, if it is a result of R code execution (REvaluate). In most cases, such new references will be redundant.
For functions defined with RFunction (type II), this is different, since the same reference is being produced by every call.
This happens because such references are cached. Caching is possible in this case, since functions defined via RFunction are entirely defined by their code string.
All those references are then still stored on the R side, taking up memory. On top of that, it is slower than if you store the function's reference in a variable, or use RFunction[code] to define it from Mathematica (when possible).
Given all the previous discussion, it should be clear that, whenever possible, you should try to use type II references, both because they are permanent, and because they are cached. This is possible to do in most cases, for user-defined functions.
One case where this is not possible is when you create closures of one type or another. Generally, it is not possible to use type II references whenever the function in question refers to some R environment other than the default (global) one.
Function references do not have nontrivial transformation rules between the longer internal form that RLink uses to represent R data types and the shorter form that usually is used by the end user. In other words, both forms are the same for function references.
You can send function references to R, as any other R objects that RLink understands, by assigning them to some variable in the R workspace, using RSet.
Higher-order functions are functions that accept other functions as some of their arguments. Typical Mathematica built-in higher-order functions are, among others, Map and Apply. In R, like in Mathematica, functions are values, and can be directly passed as parameters to other functions.
RLink's RFunction construct fully supports the higher-order functions paradigm. As an example, a filtering function will now be constructed, similar to Mathematica's Select, that takes a vector and a test function, and selects only those elements of the vector that satisfy the test function.
You can now use it with some custom filtering function, which you also define with RFunction.
But this is not necessary. The following program is Mathematica's analog of the preceding, where a named-argument Function syntax is used to stress the analogy.
Closures are typically functions returned by other functions as results, such that they have access to an environment (variables, state) of an enclosing function, even after that function executes. R fully supports closures, and so does RLink.
Since closures are results of other functions' execution, their frequent use facilitates a different style of programming, which emphasizes functions as dynamic blocks used in composing other functions, on equal footing with other data types.
More interesting examples come from closures that access and possibly modify some mutable state. For example, you can create an iterator over some vector. The following function is one possible implementation of a generator of such iterators.
It takes a vector and produces an iterator function with that vector embedded in it. That iterator function has no parameters and a mutable state (variable i), which is changed upon every call to it. Once the end of the vector is reached, it will return R NULL on all subsequent calls. Note the special assignment operator "<<-", which is used to assign to a variable from an enclosing (parent) environment.
Being used on the Mathematica side, it will not be efficient, since RLink function calls have an overhead that is not small. Here, however, the main concern was with illustrating that RLink supports closures as an abstraction in principle. In "Performance", this example will be considered again and it will be shown how its performance can be improved.
When you try to define a function and get a reference, but provide a syntactically invalid R code, you will get an error message, and the result will be $Failed.
When you use RFunction, the error message will usually be rather informative (although it will not show an exact place where the syntax error occurs).
For REvaluate, the error message may be more cryptic.
The following is an attempt to define a function's reference manually, with a very large index, to make it unlikely that a function reference with such an index has been already generated in a current RLink session.
This mimics a case of using a reference to a function defined in one of the past RLink sessions in a new RLink session, where such a reference does not correspond to any function currently defined in the R workspace.
For references of type I, something worse than just a reference being expired may happen. It may be that in a new RLink session you do get a reference to some function that does happen to be assigned the same reference index as some reference that you stored from previous RLink sessions. In such a case, a new function will silently be used in a place where you might expect the old function to be used.
This is, of course, what the new function does, not the old one. Such errors can be hard to find, but they can only happen for type I references. This is another good argument in favor of using type II references whenever possible.
It is important to realize that storing type I function references and making it possible for them to extend their existence over the duration of a single RLink session (in the Mathematica code you write, which uses them) is an error, and you have to write your Mathematica code in a way to prevent this, in order to avoid problems like reference expiration or reference collisions.
This situation is somewhat similar to that of local variables generated by Module, with the difference that those are guaranteed to be unique within a single Mathematica session, while type I function references in RLink are guaranteed to be unique within a single RLink session (and you may have multiple RLink sessions within a single Mathematica session).
The current RLink implementation has several stages of data transfer between R and Mathematica, which result in an often very significant function call and data transfer overhead. This section will give some advice on how to minimize such an overhead, in the context of function calls.
The more you can use vectorized operations and push the work to be done in R as a whole (minimizing data exchange, and, more importantly, maximizing the granularity of data exchange), the faster your code will be.
There is a constant overhead of a function call, which dominated the running time for the previous examples. Here, the number of points will be increased 1000 times, but the time it takes to compute the result is almost the same.
But the reason for this is not the function call overhead but the overhead of transferring lists from R, which can be very significant. One way to estimate which is the dominant reason for the slowness of the code is to transfer some test data similar to what you expect as a result of calling the function of interest.
As already explained, RLink allows you to return references to R closures and use them in Mathematica. However, often this will be slow, if used straightforwardly. Consider again the iterator generator example, for which the code will be repeated for consistency.
But you should keep in mind that if you define an iterator on the R side, it will make much more sense to also perform the iteration in R. To move the iteration process to R, you can take another route and define a higher-order function that will take a function to apply and an iterator and perform the iteration on the R side.
This is now two orders of magnitude faster, and that includes the time to transfer the data. The speedup is due to the fact that the iteration is done on the R side, while you have all the flexibility of Mathematica (in the sense that it looks "as if" you were calling Mathematica functions, passing in other Mathematica functions). This may still be slow in comparison with a purely imperative rigid implementation involving loops, but may be usable.
There are a number of lessons to be learned from this example. The first one is that you create the worst possible scenario (performance-wise) when you maximize the number of lightweight function calls (for example by performing some large iteration in Mathematica), with each function call doing very little on the R side.
The second lesson is that even for constructs that are slow due to the RLink function call overhead, you can still prototype and test them in Mathematica, and then there is a natural optimization path, consisting of the gradual moving of computationally heavy parts (such as loops, iteration) to R from the initial Mathematica implementation. In other words, you can program functionally in a mixture of R and Mathematica, and you can treat Mathematica as pretty much as good a prototyping environment for R, as the R command prompt itself is, at least in this incremental code-development and prototyping aspect.