{% include JB/setup %}

R Interpreter for Apache Zeppelin

Overview

R is a free software environment for statistical computing and graphics.

To run R code and visualize plots in Apache Zeppelin, you will need R on your zeppelin server node (or your dev laptop).

For Centos: yum install R R-devel libcurl-devel openssl-devel
For Ubuntu: apt-get install r-base

Validate your installation with a simple R command:

R -e "print(1+1)"

To enjoy plots, install additional libraries with:

devtools with

bash

R -e "install.packages('devtools', repos = 'http://cran.us.r-project.org')"

knitr with

bash

R -e "install.packages('knitr', repos = 'http://cran.us.r-project.org')"

ggplot2 with

bash

R -e "install.packages('ggplot2', repos = 'http://cran.us.r-project.org')"

Other visualization libraries:

bash

R -e "install.packages(c('devtools','mplot', 'googleVis'), repos = 'http://cran.us.r-project.org'); 
require(devtools); install_github('ramnathv/rCharts')"

We recommend you to also install the following optional R libraries for happy data analytics:

glmnet
pROC
data.table
caret
sqldf
wordcloud

Supported Interpreters

Zeppelin supports R language in 3 interpreters

<table class="table-configuration"> <tr> <th>Name</th> <th>Class</th> <th>Description</th> </tr> <tr> <td>%r.r</td> <td>RInterpreter</td> <td>Vanilla r interpreter, with least dependencies, only R environment and knitr are required. It is always recommended to use the fully qualified interpreter name <code>%r.r</code>, because <code>%r</code> is ambiguous, it could mean <code>%spark.r</code> when current note's default interpreter is <code>%spark</code> and <code>%r.r</code> when the default interpreter is <code>%r</code></td> </tr> <tr> <td>%r.ir</td> <td>IRInterpreter</td> <td>Provide more fancy R runtime via [IRKernel](https://github.com/IRkernel/IRkernel), almost the same experience like using R in Jupyter. It requires more things, but is the recommended interpreter for using R in Zeppelin.</td> </tr> <tr> <td>%r.shiny</td> <td>ShinyInterpreter</td> <td>Run Shiny app in Zeppelin</td> </tr> </table>

If you want to use R with Spark, it is almost the same via %spark.r, %spark.ir & %spark.shiny . You can refer Spark interpreter docs for more details.

Configuration

<table class="table-configuration"> <tr> <th>Property</th> <th>Default</th> <th>Description</th> </tr> <tr> <td>zeppelin.R.cmd</td> <td>R</td> <td>Path of the installed R binary. You should set this property explicitly if R is not in your <code>$PATH</code>(example: /usr/bin/R). </td> </tr> <tr> <td>zeppelin.R.knitr</td> <td>true</td> <td>Whether to use knitr or not. It is recommended to install [knitr](https://yihui.org/knitr/)</td> </tr> <tr> <td>zeppelin.R.image.width</td> <td>100%</td> <td>Image width of R plotting</td> </tr> <tr> <td>zeppelin.R.shiny.iframe_width</td> <td>100%</td> <td>IFrame width of Shiny App</td> </tr> <tr> <td>zeppelin.R.shiny.iframe_height</td> <td>500px</td> <td>IFrame height of Shiny App</td> </tr> <tr> <td>zeppelin.R.shiny.portRange</td> <td>:</td> <td>Shiny app would launch a web app at some port, this property is to specify the portRange via format 'start':'end', e.g. '5000:5001'. By default it is ':' which means any port.</td> </tr> <tr> <td>zeppelin.R.maxResult</td> <td>1000</td> <td>Max number of dataframe rows to display when using z.show</td> </tr> </table>

Play R in Zeppelin docker

For beginner, we would suggest you to play R in Zeppelin docker first. In the Zeppelin docker image, we have already installed R and lots of useful R libraries including IRKernel's prerequisites, so %r.ir is available.

Without any extra configuration, you can run most of tutorial notes under folder R Tutorial directly.

docker run -u $(id -u) -p 8080:8080 -p:6789:6789 --rm --name zeppelin apache/zeppelin:0.10.0

After running the above command, you can open http://localhost:8080 to play R in Zeppelin. The port 6789 exposed in the above command is for R shiny app. You need to make the following 2 interpreter properties to enable shiny app accessible as iframe in Zeppelin docker container.

zeppelin.R.shiny.portRange to be 6789:6789
Set ZEPPELIN_LOCAL_IP to be 0.0.0.0

Interpreter binding mode

The default interpreter binding mode is globally shared. That means all notes share the same R interpreter. So we would recommend you to ues isolated per note which means each note has own R interpreter without affecting each other. But it may run out of your machine resource if too many R interpreters are created. You can run R in yarn mode to avoid this problem.

How to use R Interpreter

There are two different implementations of R interpreters: %r.r and %r.ir.

Vanilla R Interpreter(%r.r) behaves like an ordinary REPL and use SparkR to communicate between R process and JVM process. It requires knitr to be installed.
IRKernel R Interpreter(%r.ir) behaves like using IRKernel in Jupyter notebook. It is based on jupyter interpreter. Besides jupyter interpreter's prerequisites, IRkernel needs to be installed as well.

Take a look at the tutorial note R Tutorial/1. R Basics for how to write R code in Zeppelin.

R basic expressions

R basic expressions are supported in both %r.r and %r.ir.

R base plotting

R base plotting is supported in both %r.r and %r.ir.

Other plotting

Besides R base plotting, you can use other visualization libraries in both %r.r and %r.ir, e.g. ggplot and googleVis

z.show

z.show() is only available in %r.ir to visualize R dataframe, e.g.

By default, z.show would only display 1000 rows, you can specify the maxRows via z.show(df, maxRows=2000)

Make Shiny App in Zeppelin

Shiny is an R package that makes it easy to build interactive web applications (apps) straight from R. %r.shiny is used for developing R shiny app in Zeppelin notebook. It only works when IRKernel Interpreter(%r.ir) is enabled. For developing one Shiny App in Zeppelin, you need to write at least 3 paragraphs (server type paragraph, ui type paragraph and run type paragraph)

Server type R shiny paragraph


%r.shiny(type=server)

# Define server logic to summarize and view selected dataset ----
server <- function(input, output) {

    # Return the requested dataset ----
    datasetInput <- reactive({
        switch(input$dataset,
        "rock" = rock,
        "pressure" = pressure,
        "cars" = cars)
    })

    # Generate a summary of the dataset ----
    output$summary <- renderPrint({
        dataset <- datasetInput()
        summary(dataset)
    })

    # Show the first "n" observations ----
    output$view <- renderTable({
        head(datasetInput(), n = input$obs)
    })
}

UI type R shiny paragraph

%r.shiny(type=ui)

# Define UI for dataset viewer app ----
ui <- fluidPage(

    # App title ----
    titlePanel("Shiny Text"),
    
    # Sidebar layout with a input and output definitions ----
    sidebarLayout(

        # Sidebar panel for inputs ----
        sidebarPanel(
        
        # Input: Selector for choosing dataset ----
        selectInput(inputId = "dataset",
        label = "Choose a dataset:",
        choices = c("rock", "pressure", "cars")),
        
        # Input: Numeric entry for number of obs to view ----
        numericInput(inputId = "obs",
        label = "Number of observations to view:",
        value = 10)
        ),

        # Main panel for displaying outputs ----
        mainPanel(
        
        # Output: Verbatim text for data summary ----
        verbatimTextOutput("summary"),
        
        # Output: HTML table with requested number of observations ----
        tableOutput("view")
        
        )
    )
)

Run type R shiny paragraph


%r.shiny(type=run)

After executing the run type R shiny paragraph, the shiny app will be launched and embedded as iframe in paragraph. Take a look at the tutorial note R Tutorial/2. Shiny App for how to develop R shiny app.

Run multiple shiny apps

If you want to run multiple shiny apps, you can specify app in paragraph local property to differentiate different shiny apps.

e.g.

%r.shiny(type=ui, app=app_1)

%r.shiny(type=server, app=app_1)

%r.shiny(type=run, app=app_1)

Run R in yarn cluster

Zeppelin support to run interpreter in yarn cluster. But there's one critical problem to run R in yarn cluster: how to manage the R environment in yarn container. Because yarn cluster is a distributed cluster which is composed of many nodes, and your R interpreter can start in any node. It is not practical to manage R environment in each node.

So in order to run R in yarn cluster, we would suggest you to use conda to manage your R environment, and Zeppelin can ship your R conda environment to yarn container, so that each R interpreter can have its own R environment without affecting each other.

To be noticed, you can only run IRKernel interpreter(%r.ir) in yarn cluster. So make sure you include at least the following prerequisites in the below conda env:

python
jupyter
grpcio
protobuf
r-base
r-essentials
r-irkernel

python, jupyter, grpcio and protobuf are required for jupyter interpreter, because IRKernel interpreter is based on jupyter interpreter. Others are for R runtime.

Following are instructions of how to run R in yarn cluster. You can find all the code in the tutorial note R Tutorial/3. R Conda Env in Yarn Mode.

Step 1

We would suggest you to use conda pack to create archive of conda environment.

Here's one example of yaml file which is used to generate a conda environment with R and some useful R libraries.

Create a yaml file for conda environment, write the following content into file r_env.yml

text

name: r_env
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.9 
  - jupyter
  - grpcio
  - protobuf
  - r-base=3
  - r-essentials
  - r-evaluate
  - r-base64enc
  - r-knitr
  - r-ggplot2
  - r-irkernel
  - r-shiny
  - r-googlevis

Create conda environment via this yaml file using either conda or mamba

bash


conda env create -f r_env.yml

bash


mamba env create -f r_env.yml

Pack the conda environment using conda

bash


conda pack -n r_env

Step 2

Specify the following properties to enable yarn mode for R interpreter via inline configuration

%r.conf

zeppelin.interpreter.launcher yarn
zeppelin.yarn.dist.archives hdfs:///tmp/r_env.tar.gz#environment
zeppelin.interpreter.conda.env.name environment

zeppelin.yarn.dist.archives is the R conda environment tar file which is created in step 1. This tar will be shipped to yarn container and untar in the working directory of yarn container. hdfs:///tmp/r_env.tar.gz is the R conda archive file you created in step 2. environment in hdfs:///tmp/r_env.tar.gz#environment is the folder name after untar. This folder name should be the same as zeppelin.interpreter.conda.env.name.

Step 3

Now you can use run R interpreter in yarn container and also use any R libraries you specify in step 1.