CUDA Programming

CUDA is a general C-like programming developed by NVIDIA to program Graphical Processing Units (GPUs). CUDALink provides an easy interface to program the GPU by removing many of the steps required. Compilation, linking, data transfer, etc. are all handled by Mathematica's CUDALink. This allows the user to write the algorithm rather than the interface and code.

This section describes how to start programming CUDA in Mathematica.

CUDAFunctionLoadloads a CUDA function into Mathematica

CUDA programming in Mathematica.

This document describes the GPU architecture and how to write a CUDA kernel. In the end, many applications written using CUDALink are demonstrated.

Introduction

The Common Unified Device Architecture (CUDA) was developed by NVIDIA in late 2007 as a way to make the GPU more general. While programming the GPU has been around for many years, difficulty in programming it had made adoption limited. CUDALink aims at making GPU programming easy and accelerating the adoption.

When using Mathematica, you need not worry about many of the steps. With Mathematica, you only need to write CUDA kernels. This is done by utilizing all levels of the NVIDIA architecture stack.

1.gif

CUDA Architecture

CUDA's programming is based on the data parallel model. From a high-level standpoint, the problem is first partitioned onto hundreds or thousands of threads for computation. If the following is your computation:

OutputData = Table[fun[InputData[[i, j]]], {i, 10000}, {j, 10000}]

then in the CUDA programming paradigm, this computation is equivalent to:

CUDALaunch[fun, InputData, OutputData, {10000, 10000}]

where is a computation function. The above launches 10000×10000 threads, passes their indices to each thread, applying the function to , and places the results in . CUDALink's equivalence to is CUDAFunction.

The reason CUDA can launch thousands of threads all lies in its hardware architecture. The following sections will discuss this, along with how threads are partitioned for execution.

CUDA Grid and Blocks

Unlike the message-passing or thread-based parallel programming models, CUDA programming maps problems on a one-, two-, or three-dimensional grid. The following shows a typical two-dimensional CUDA thread configuration.

6.gif

Each grid contains multiple blocks, and each block contains multiple threads. In terms of the above image, a grid, block, and thread are as follows.

7.gif

Choosing whether to have a one-, two-, or three-dimensional thread configuration is dependent on the problem. In the case of image processing, for example, you map the image onto the threads as shown in the following figure and apply a function to each pixel.

8.gif

The one-dimensional cellular automaton can map onto a one-dimensional grid.

9.gif

CUDA Program Cycle

The gist of CUDA programming is to copy data from the launch of many threads (typically in the thousands), wait until the GPU execution finishes (or perform CPU calculation while waiting), and finally, copy the result from the device to the host.

10.gif

The above figure details the typical cycle of a CUDA program.

1.  Allocate memory of the GPU. GPU and CPU memory are physically separate, and the programmer must manage the allocation copies.

2.  Copy the memory from the CPU to the GPU.

3.  Configure the thread configuration: choose the correct block and grid dimension for the problem.

4.  Launch the threads configured.

5.  Synchronize the CUDA threads to ensure that the device has completed all its tasks before doing further operations on the GPU memory.

6.  Once the threads have completed, memory is copied back from the GPU to the CPU.

7.  The GPU memory is freed.

When using Mathematica, you need not worry about many of the steps. With Mathematica, you only need to write CUDA kernels.

11.gif

Memory Hierarchy

CUDA memory is divided into different levels, each with its own advantages and limitations. The following figure depicts all types of memory available to CUDA.

12.gif

Global Memory

The most abundant (but slowest) memory available on the GPU. This is the memory advertised on the packaging—128, 256, or 512 MB. All threads can access elements in global memory, although for performance reasons these accesses tend to be kept to a minimum and have further constrictions on them.

The performance constrictions on global memory have been relaxed on recent hardware and will likely be relaxed even further. The general rule is that performance is deteriorated if global memory is accessed more than once.

Texture Memory

Texture memory resides in the same location as global memory, but it is read-only. Texture memory does not suffer for the performance deterioration found in global memory. On the flip side, only char, int, and float are supported types.

Constant Memory

Fast constant memory that is accessible from any thread in the grid. The memory is cached, but limited to 64 KB globally.

Shared Memory

Fast memory that is local to a specific block. On current hardware, the amount of shared memory is limited to 16 KB per block.

Local Memory

Local memory is local to each thread, but resides in global memory unless the compiler places the variables in registers. While a general performance consideration is to keep local memories to a minimum since the memory accesses are slow, they do not have the same problems as global memory.

Compute Capability

The compute capabilities determine what operations the device is capable of. Currently, only compute capabilities 1.1, 1.2, 1.3, and 2.0 exist, with the main differences listed below.

Compute CapabilityExtra Features
1.0base implementation
1.1atomic operations
1.2shared atomic operations and warp vote functions
1.3support for double-precision operations
2.0double-precision, L2 cache, and concurrent kernels

Information about the compute capability on the current system can be retrieved using CUDAInformation.

If you have not done so already, import CUDALink.

In[1]:=
Click for copyable input

This displays the names of all CUDA devices on the system along with their .

In[2]:=
Click for copyable input
Out[2]=

Multiple CUDA Devices

If the system hardware supports it, CUDA allows you to select which device computation is performed on. By default, the fastest is chosen, but this can be overridden by the user. Once a device is set (whether chosen automatically or by the user) the device cannot be changed in the kernel session.

Writing a CUDA Kernel

CUDA kernels are atomic functions that are called many times. Usually these are a few lines inside the program's For loop. The following adds two vectors together.

First Kernel

A CUDA kernel is a small piece of code that performs a computation on each element of an input list. Your first kernel will add 2 to each element.

__global__ void addTwo_kernel(mint * arry, mint len) {
    
    int index = threadIdx.x + blockIdx * blockDim.x;

    if (index >= len) return;

arry[index] += 2;
}

is a function qualifier that instructs the compiler that the function should be run on the GPU. functions can be called from C. The other function qualifier is , which denotes functions that can be called from other or functions, but cannot be called from C.

A CUDA kernel must have void output, so to get that result you must pass in pointer inputs and overwrite them. In this case, you pass and overwrite the elements (you can think of as an input/output parameter).

This gets the index. The CUDA kernel provides the following variables that are set depending on the launch grid and block size:

  • — index of current thread; the thread index is between 0 and
  • — the index of current block; the block index is between 0 and
  • — the block size dimensions
    • — the grid size dimensions

    These parameters are set by CUDA automatically, based on the kernel launch parameters (the block and grid dimensions). The higher dimensions are automatically set to 1 when launching lower-dimension computation, so when launching a 1D grid the and are set to 1.

    In most common cases in a 1D grid, you use to find the global offset, and you use to find the global position in 2D.

Since threads are launched in multiples of the block dimensions, the user needs to make sure not to overwrite the boundaries if the dimension of the input is not a multiple of the block dimension. This assures that.

This is the function applied to each element in the list, adding 2 to the input in this case.

Second Kernel

The second kernel implements a finite difference method (forward difference).

__device__ float f(float x) {
    return tanf(x);
}

__global__ void secondKernel(float * diff, float h, mint listSize) {
    mint index = threadIdx.x + blockIdx.x * blockDim.x;
float f_n = f(((float) index) / h);
float f_n1 = f((index + 1.0) / h);
    if( index < listSize) {
        diff[index] = (f_n1 - f_n) / h;
    }
}

This is a device function. functions can be called from the function, but cannot be called from Mathematica directly.

Perform the Tan of the floating-point input.

Function that returns the list , and takes two scalar inputs and .

Get the global thread offset.

Evaluate the function at . Since and are integers, you need to cast one to a floating-point number to avoid truncation. Call the device function .

Evaluate the function at .

Make sure not to overwrite the output buffer.

Calculate the forward difference and store the result in .

The next section shows how to load this CUDA program into Mathematica.

Loading a CUDA Kernel into Mathematica

If you have not done so already, import CUDALink.

In[1]:=
Click for copyable input

Place the code in the previous section into a string.

In[2]:=
Click for copyable input

This loads the CUDAFunction. Pass the kernel code as a first argument, the kernel name to be executed () as a second argument, the function parameters as a third argument, and the block size as the fourth argument. The result is stored in .

In[3]:=
Click for copyable input
Out[3]=

56.gif

Behind the scenes a few things are happening:

  • CUDA availability of the system is checked.
  • The CUDA code is being compiled to a binary file optimized for the GPU select.
  • The compiled code is being cached to avoid future compilation.
    • A check is performed as to whether the kernel exists in the compiled code.

    Once loaded, a CUDAFunction can be used like any Mathematica function. Before executing it, a buffer is needed to store the result. This creates the buffer of size 1000.

    In[4]:=
    Click for copyable input
    Out[4]=

    Here, you execute .

    In[5]:=
    Click for copyable input
    Out[5]=

    This gets the first 30 elements of the output buffer.

    In[6]:=
    Click for copyable input
    Out[6]=

    This plots the output buffer.

    In[7]:=
    Click for copyable input
    Out[7]=

    The buffer must be manually freed by the user.

    In[28]:=
    Click for copyable input

C Sequential to CUDA Programming

The following details the progression of a program from the serial CPU version to a CUDA version.

The program implements the moving average with radius 3, calculating the average value of an array based on its neighboring pixels.

Generally, you start with the serial implementation. This allows you to gauge the problem and find possible avenues of parallelization. The serial implementation can also be used as the reference implementation as well as determining whether CUDA is fit for the task.

In this case, the serial implementation is as follows.

void xAverage_cpu (mint * in, mint * out, mint width, mint height, mint pitch) {
    int xIndex, yIndex, idx;
    mint left, curr, right;
    
    for (yIndex = 0; yIndex < height; yIndex++) {
        for (xIndex = 0; xIndex < width; xIndex++) {
            idx = x + y*pitch;
            
            left = idx > 0 ? in[idx - 1] : in[yIndex*pitch];
            curr = in[idx];
            right = idx < width - 1 ? in[idx + 1] : in[width + yIndex*pitch];
            
            out[x + y*pitch] = (left + curr + right)/3;
        }
    }
}

It is a good idea to find possible areas of optimization. In this case you remove the two "if" statements, one to check if the element is the rightmost, and the other to check if it is the leftmost.

void xAverage_cpu(mint * in, mint * out, mint width, mint height, mint pitch) {
    int xIndex, yIndex, idx;
    
    for (yIndex = 0; yIndex < height; yIndex++) {
        out[yIndex*pitch] = (2*in[yIndex*pitch] + in[yIndex*pitch + 1])/3;
        
        for (xIndex = 1; xIndex < width-1; xIndex++) {
            idx = x + y*pitch;
            out[idx] = (in[idx-1] + in[idx] + in[idx+1])/3;
        }
        out[width - 1 + yIndex*pitch] = (in[width - 2 + yIndex*pitch] +
                                         2*in[width - 1 + yIndex*pitch])/3;
    }
}

Another thing to consider is avoiding code that depends on adjacent pixels.

void xAverage_cpu(mint * in, mint * out, mint width, mint height, mint pitch) {
    mint accum;
    mint left, curr, right;
    
    for (yIndex = 0; yIndex < height; yIndex++) {
         accum = 2*in[yIndex*pitch] + in[yIndex*pitch + 1]
        out[yIndex*pitch] = accum/3;
        
        accum -= in[yIndex*pitch];
        
        for (xIndex = 0; xIndex < width-1; xIndex++) {
            idx = x + y*pitch;
            
            accum += in[index+1];
            out[idx] = accum/3;
            accum -= in[idx-1];
        }
        accum += in[width - 2 + yIndex*pitch];
        out[width - 1 + yIndex*pitch] = accum/3;
    }
}

Finally, write a program that is based on the most optimized CPU implementation.

__global__ void xAverage_kernel(mint * in, mint * out, mint width, mint height, mint pitch) {
    int xIndex = blockDim.x*blockIdx.x + threadIdx.x;
    int yIndex = blockDim.y*blockIdx.y + threadIdx.y;
    int index = xIndex + yIndex*pitch;
    
    if (xIndex >= width || yIndex >= height) return;

    if (xIndex > 0) {
        out[index] = (2*in[index] + in[index + 1])/3;
    } else if (xIndex < width) {
        out[index] = (in[index - 1] + in[index] + in[index + 1])/3;
    } else {
        out[index] = (in[index - 1] + 2*in[index])/3;
    }
}

Notice that global memory copies are done three times for the same data. This can be avoided by placing the data temporarily in shared memory.

__global__ void xAverage_improved_kernel(mint * in, mint * out, mint width, mint height, mint pitch) {
    extern __shared__ smem[];
    
    int tx = threadIdx.x, dx = blockDim.x;
    int xIndex = dx*blockIdx.x + tx;
    int yIndex = blockDim.y*blockIdx.y + threadIdx.y;
    int index = xIndex + yIndex*pitch;
    
    if (yIndex >= height) return;
    
    smem[threadIdx.x+1] = index < width ? in[index] : in[yIndex*pitch + width - 1];
    
    if (tx == 0) {
        smem[0] = index > 0 ? in[index-1] : in[yIndex * pitch];
        smem[dx+1] = index+bx < width ? in[index+dx] : in[yIndex*pitch + width - 1];
    }
    
    __syncthreads();
    
    if (xIndex < width)
        out[index] = (smem[tx] + smem[tx+1] + smem[tx+2])/3;
}

This CUDA program can then be saved and loaded via CUDAFunctionLoad.

Compiling for CUDA

CUDALink provides a portable compiling mechanism for CUDA code. This is done through the NVCCCompiler, which registers the NVCC compiler found on the system with the CCompilers to compile DLLs and executables.

Before loading the NVCCCompiler in CUDALink, notice the available compilers on the system.

In[1]:=
Click for copyable input
Out[1]=

The output above is only seen if you have not loaded the NVCCCompiler. If that is the case, then load the CUDALink application.

In[2]:=
Click for copyable input

Now check the compilers again.

In[3]:=
Click for copyable input
Out[3]=

The NVCCCompiler can now be used.

Load an example test file.

In[4]:=
Click for copyable input
Out[4]=

You can see the contents of the file.

Compile the code into a library.

In[6]:=
Click for copyable input
Out[6]=

The library is now placed in the above location.

Generating PTX Files

PTX is CUDA bytecode that is able to run on different CUDA cards. The bytecode is then compiled on the fly (using a JIT mechanism).

This defines a simple CUDA kernel.

In[1]:=
Click for copyable input

Setting "CreatePTX"->True when creating the executable will compile to PTX.

In[2]:=
Click for copyable input
Out[2]=

Here, you print the first 500 characters of the compiled code.

In[3]:=
Click for copyable input
Out[3]=

Generating CUBIN Files

CUBIN files are CUDA binary files that are compiled for a specific CUDA architecture. CUDAFunctionLoad, for example, will compile the input code to a CUBIN file.

This defines a simple CUDA function.

In[4]:=
Click for copyable input

Setting "CreateCUBIN"->True when creating the executable will compile to CUBIN.

In[5]:=
Click for copyable input
Out[5]=

Here, you print the first 100 characters of the compiled code.

In[6]:=
Click for copyable input
Out[6]=

Note that, unlike a PTX file that contained ASCII instruction sets, a CUBIN file is an ELF binary.

Printing Debug and Register Information

When debugging or optimizing CUDA code, it is useful to find out what the compiler output is. It is customary, for example, to find the number of registers used up by a CUDA function.

This defines a simple CUDA function.

In[7]:=
Click for copyable input

Setting "Debug"->True and will instruct the compiler to print debug information and be verbose when generating the binary.

In[8]:=
Click for copyable input
Out[8]=

This simple kernel uses 2 registers, as indicated above.

Writing Kernels for CUDALink

Users familiar with CUDA programming might consider the following strategies when programming for CUDALink. The strategies help in making the CUDA code more portable, correct, easier to debug, and efficient.

Using Real Types

Because of the lack of double-precision support on most CUDA cards, a CUDA C programmer is forced to use floats to represent floating-point numbers. With CUDALink, you can use the type to define a floating-point number that uses the maximum precision found on the CUDA card.

Therefore, users are advised to use code like that below.

__device __ Real_t f(Real_t x) {
    return tanf(x);
}

__global __ void secondKernel(Real_t * diff, Real_t h, mint listSize) {
    int index = threadIdx.x + blockIdx.x * blockDim.x;
Real_t f_n = f(index) / h);
Real_t f_n1 = f((index + 1.0) / h);
    if( index < listSize) {
        diff[index] = (f_n1 - f_n) / h;
    }
}

Instead of code like this.

__device __ float f(float x) {
    return tanf(x);
}

__global __ void secondKernel(float * diff, float h, mint listSize) {
    int index = threadIdx.x + blockIdx.x * blockDim.x;
float f_n = f(((double) index) / h);
float f_n1 = f((index + 1.0) / h);
    if( index < listSize) {
        diff[index] = (f_n1 - f_n) / h;
    }
}

This would ensure maximum compatibility and use the maximum precision available.

Optimization

The compiler can optimize some loops if the user makes the loop parameters a constant. The following is a slightly more optimized version over passing channels as a variable, since the compiler can perform more code optimization.

__global __ void colorNegate(unsigned char * img, mint width, mint height) {
    int xIndex = threadIdx.x + blockIdx.x * blockDim.x;
    int yIndex = threadIdx.y + blockIdx.y * blockDim.y;
    int index = (xIndex + yIndex*width)*CHANNELS;
if (xIndex < width && yIndex < height) {
#pragma unroll
        for (mint ii = 0; ii < CHANNELS; ii++)
            img[index + ii] = 255 - img[index + ii];
    }
}

Rather than make constant in the code, the user can pass it as a using CUDAFunctionLoad.

CUDAFunctionLoad[...,"Defines"→{"CHANNELS"→3}]

Generate C Code and Load It as a Library

Once prototyped, and if code is used very often, the C code representation of the code can be exported using CUDACodeStringGenerate. The resulting C code can be compiled and loaded into CUDAFunctionLoad. Doing so will result in CUDAFunctionLoad taking a different execution path that does less error checking.

Off-Line Compilation

When passing source code into CUDAFunctionLoad, the code is compiled behind the scenes. To avoid the performance hit each time, the user can use CreateExecutable with the NVCCCompiler to compile the CUDA code into a binary and load the CUDAFunction from CUDA binary.

Porting OpenCL to CUDA

Since both CUDALink and OpenCLLink handle all the underlying bookkeeping for CUDA and OpenCL, the user need only concentrate on the kernel code. This section covers a CUDA-to-OpenCL translation. It is assumed that the user has used OpenCLLink to develop the function.

The following table summarizes some of the common changes in the kernel code needed to port OpenCL programs to CUDA.

This is a simple OpenCL function that takes input image and color negates it.

In[1]:=
Click for copyable input

Recall that to launch the OpenCL function you perform the following.

In[2]:=
Click for copyable input
Out[5]=

The following is changed in order to port to CUDA.

  • The was changed to .
  • The was changed to .
    • The was changed to , and the same for .

    In many cases, the above are the only required changes for OpenCL-to-CUDA porting in Mathematica. There are a handful of other one-to-one function translations that are readily accessible in an OpenCL or CUDA programming guide.

    This is the ported CUDA code.

    In[6]:=
    Click for copyable input

    This is the exact Mathematica functions done for OpenCL, but replacing OpenCLFunctionLoad load with CUDAFunctionLoad.

    In[7]:=
    Click for copyable input
    Out[10]=

    An alternative to find and replace the mechanism shown above is to generate the kernel code symbolically and manipulate it in Mathematica. The following section covers that approach.

Symbolic Code Generation

Mathematica provides SymbolicC, which permits a hierarchical view of C code as Mathematica's own language. This makes it well suited to creating, manipulating, and optimizing C code.

In conjunction with this capability, users can generate CUDA kernel code for several different targets, allowing greater portability, less platform dependency, and better code optimization.

An advantage of using SymbolicC is that you can ensure that the C code contains no syntactical errors, you can generate CUDA programs using SymbolicC and run them using CUDAFunctionLoad, and you can ease the porting of CUDA to OpenCL (or vice versa).

This section uses CUDALink's symbolic code-generation capabilities to write an "RGB"-to-"HSV" and "RGB"-to-"HSB" color converter. This conversion is a fairly common operation for image and video processing.

For this tutorial, the CUDALink application must first be loaded.

In[1]:=
Click for copyable input

Introduction

CUDALink includes symbolic tools that help in writing kernels using SymbolicC.

SymbolicCUDAFunctionsymbolic representation of a CUDA function
SymbolicCUDABlockIndexsymbolic representation of a block index CUDA call
SymbolicCUDAThreadIndexsymbolic representation of a thread index CUDA call
SymbolicCUDABlockDimensionsymbolic representation of a block dimension CUDA call
SymbolicCUDACalculateKernelIndexsymbolic representation of a CUDA index calculation
SymbolicCUDADeclareIndexBlocksymbolic representation of a CUDA index declaration

Symbolic representations of CUDA programs.

The following is an example kernel written symbolically.

In[2]:=
Click for copyable input
Out[2]=

This displays the C code generated using ToCCodeString.

In[3]:=
Click for copyable input
Out[3]=

Another reason for writing code symbolically is that you can rewrite the above kernel to OpenCL form by just changing the instances to .

In[4]:=
Click for copyable input
In[5]:=
Click for copyable input
Out[5]=

Again, you display the C code using ToCCodeString.

In[6]:=
Click for copyable input
Out[6]=

Simple Kernel

CUDALink provides utility functions that perform common GPU kernel programming tasks as discussed above. With them, you will write your first kernel function.

First, load the CUDALink application.

In[1]:=
Click for copyable input

Your first function will take an integer array and set all its elements to 0. You first need to define your function prototype. The function will be called , accepting an input list and the list size. Use the utility functions as follows.

In[2]:=
Click for copyable input
Out[2]=

Since you will be building this example incremental, and to avoid writing the above each time, you can define a helper function for this example.

In[3]:=
Click for copyable input

Test it.

In[4]:=
Click for copyable input
Out[4]=

Getting the Index

Using the SymbolicCUDADeclareIndexBlock, you can find the index of the list element based on the thread index, block index, and block dimensions.

In[5]:=
Click for copyable input
Out[5]=

Declaring the Body

Since you can launch more threads than you actually have elements, you need to place the function body inside an "if" guard. Not doing so might make your program unpredictable.

In[6]:=
Click for copyable input
Out[6]=

You can now define the body of the kernel. The body is trivial, setting each element in the list to 0.

In[7]:=
Click for copyable input
Out[7]=

Testing the First Kernel

Import the CUDALink package, if you have not already done so.

In[1]:=
Click for copyable input

This defines the inputs.

In[2]:=
Click for copyable input

This generates the source code.

In[4]:=
Click for copyable input
Out[4]=

This loads the function.

In[5]:=
Click for copyable input
Out[5]=

This runs the function.

In[6]:=
Click for copyable input
Out[6]=

Advanced Kernels

In this section you are shown how to write an RGB-to-HSV and RGB-to-HSB converter. These are common conversions in image and video processing.

Helper Functions

SymbolicC only includes C constructs. To make your code more Mathematica-like, define the following.

In[7]:=
Click for copyable input

rewrites your expression using .

In[9]:=
Click for copyable input
Out[9]=

This converts the above expression into C code.

In[10]:=
Click for copyable input
Out[10]=

Use this to reimplement the code found in FileNameJoin[{$CUDALinkPath, "SupportFiles", "colorConvert.cu"}].

Color Convert

Break up the problem into its components. First, define the RGB-to-Hue function. This function is common between RBG-to-HSV and RGB-to-HSB.

The following writes the RGB-to-H convert function.

In[11]:=
Click for copyable input

This transforms the expression into C.

In[12]:=
Click for copyable input
Out[12]=

This writes the saturation in case of HSB.

In[13]:=
Click for copyable input
Out[13]=

This defines the formula to find the brightness.

In[14]:=
Click for copyable input
Out[14]=

The same is done for .

In[15]:=
Click for copyable input
Out[15]=
In[16]:=
Click for copyable input
Out[16]=

Notice that you need to define what and are.

In[17]:=
Click for copyable input
In[18]:=
Click for copyable input
In[19]:=
Click for copyable input
In[20]:=
Click for copyable input

Putting everything together, is defined as follows.

In[21]:=
Click for copyable input
Out[21]=

Similarly, is defined as follows.

In[22]:=
Click for copyable input
Out[22]=

To generate the kernel function, you use Mathematica's functional capabilities. This defines a helper function.

In[27]:=
Click for copyable input

Using the above function, you can define the RGB-to-HSB operation to use single-precision arithmetic.

In[28]:=
Click for copyable input
Out[28]=

This generates the code for double precision.

In[29]:=
Click for copyable input
Out[29]=

The same can be done for the RGB-to-HSV operation.

In[30]:=
Click for copyable input
Out[30]=

Running the Program

The above code can be run using CUDAFunctionLoad. First, generate the source in single precision.

In[27]:=
Click for copyable input

Before running the code on the GPU, you need to define some parameters that are required. This defines the input and output image. Notice that the output is of type Real, since HSV and HSB are real valued.

In[28]:=
Click for copyable input

This loads the function into Mathematica.

In[29]:=
Click for copyable input
Out[29]=

This runs the function.

In[30]:=
Click for copyable input

This displays the result.

In[31]:=
Click for copyable input
Out[31]=
New to Mathematica? Find your learning path »
Have a question? Ask support »