Back to Arrow

R tutorials

docs/source/developers/guide/tutorials/r_tutorial.rst

latest15.2 KB
Original Source

.. Licensed to the Apache Software Foundation (ASF) under one .. or more contributor license agreements. See the NOTICE file .. distributed with this work for additional information .. regarding copyright ownership. The ASF licenses this file .. to you under the Apache License, Version 2.0 (the .. "License"); you may not use this file except in compliance .. with the License. You may obtain a copy of the License at

.. http://www.apache.org/licenses/LICENSE-2.0

.. Unless required by applicable law or agreed to in writing, .. software distributed under the License is distributed on an .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY .. KIND, either express or implied. See the License for the .. specific language governing permissions and limitations .. under the License.

.. SCOPE OF THIS SECTION .. Concise tutorial on making a PR for a simple feature.

.. _r-tutorial:


R tutorials


R tutorial on adding a lubridate binding

In this tutorial, we will document the contribution of a binding to Arrow R package following the steps specified by the :ref:quick-ref-guide section of the guide and a more detailed :ref:step_by_step section. Navigate there whenever there is some information you may find is missing here.

The binding will be added to the expression.R file in the R package. But you can also follow these steps in case you are adding a binding that will live somewhere else.

This tutorial is different from the :ref:step_by_step as we will be working on a specific case. This tutorial is not meant as a step-by-step guide.

Let's start!

Set up

Let's set up the Arrow repository. We presume here that Git is already installed. Otherwise please see the :ref:set-up section.

Once the Apache Arrow repository <https://github.com/apache/arrow>_ is forked (see :ref:fork-repo-guide) we will clone it and add the link of the main repository to our upstream.

.. code:: console

$ git clone https://github.com/<your username>/arrow.git $ cd arrow $ git remote add upstream https://github.com/apache/arrow

Building R package

The steps to follow for building the R package differs depending on the operating system you are using. For this reason we will only refer to the instructions for the building process in this tutorial.

.. seealso::

For the introduction to the building process refer to the :ref:build-arrow-guide section.

For the instructions on how to build the R package refer to the R developer docs <https://arrow.apache.org/docs/r/articles/developing.html>_.

The issue

In this tutorial we will be tackling an issue for implementing a simple binding for mday() function that will match that of the existing R function from lubridate.

.. note::

If you do not have an issue and you need help finding one please refer to the :ref:finding-issues part of the guide.

Once you have an issue picked out and assigned to yourself, you can proceed to the next step.

Start the work on a new branch

Before we start working on adding the binding we should create a new branch from the updated main.

.. code:: console

$ git checkout main $ git fetch upstream $ git pull --ff-only upstream main $ git checkout -b ARROW-14816

Now we can start with researching the R function and the C++ Arrow compute function we want to expose or connect to.

Examine the lubridate mday() function

Going through the lubridate documentation <https://lubridate.tidyverse.org/reference/day.html>_ we can see that mday() takes a date object and returns the day of the month as a numeric object.

We can run some examples in the R console to help us understand the function better:

.. code-block:: R

library(lubridate) mday(as.Date("2000-12-31")) [1] 31 mday(ymd(080306)) [1] 6

Examine the Arrow C++ day() function

From the compute function documentation <https://arrow.apache.org/docs/cpp/compute.html#containment-tests>_ we can see that day is a unary function, which means that it takes a single data input. The data input must be a Temporal class and the returned value is an Integer/numeric type.

The Temporal class is specified as: Date types (Date32, Date64), Time types (Time32, Time64), Timestamp, Duration, Interval.

We can call an Arrow C++ function from an R console using call_function to see how it works:

.. code-block:: R

call_function("day", Scalar$create(lubridate::ymd("2000-12-31"))) Scalar 31

We can see that lubridate and Arrow functions operate on and return equivalent data types. lubridate's mday() function has no additional arguments and there are also no option classes associated with Arrow C++ function day().

Looking at the code in expressions.R we can see the day function is already specified/mapped on the R package side: <https://github.com/apache/arrow/blob/658bec37aa5cbdd53b5e4cdc81b8ba3962e67f11/r/R/expression.R#L63-L64>_

We only need to add mday() to the list of expressions connecting it to the C++ day function.

.. code-block:: R

second is defined in dplyr-functions.R

wday is defined in dplyr-functions.R

"mday" = "day", "yday" = "day_of_year", "year" = "year",

Adding a test

Now we need to add a test that checks if everything works well. If there are additional options or edge cases, we would have to add more. Looking at tests for similar functions (for example yday() or day()) we can see that a good place to add two tests we have is in test-dplyr-funcs-datetime.R:

.. code-block:: R

test_that("extract mday from timestamp", { compare_dplyr_binding( .input %>% mutate(x = mday(datetime)) %>% collect(), test_df ) })

And

.. code-block:: R

test_that("extract mday from date", { compare_dplyr_binding( .input %>% mutate(x = mday(date)) %>% collect(), test_df ) })

Now we need to see if the tests are passing or we need to do some more research and code corrections.

.. code-block:: R

devtools::test(filter="datetime")

devtools::test(filter="datetime") ℹ Loading arrow See arrow_info() for available features ℹ Testing arrow See arrow_info() for available features ✔ | F W S OK | Context ✖ | 1 230 | dplyr-funcs-datetime [1.4s] ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Failure (test-dplyr-funcs-datetime.R:187:3): strftime ``%>%(...) did not throw the expected error. Backtrace: 1. testthat::expect_error(...) test-dplyr-funcs-datetime.R:187:2 2. testthat:::expect_condition_matching(...) ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

══ Results ═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════ Duration: 1.4 s

[ FAIL 1 | WARN 0 | SKIP 0 | PASS 230 ]

There is a failure we get for the strftime function but looking at the code we see is not connected to our work. We can move on and maybe ask others if they are getting similar fail when running the tests. It could be we only need to rebuild the library.

Check styling

We should also run linters to check that the styling of the code follows the tidyverse style <https://style.tidyverse.org/>_. To do that we run the following command in the shell:

.. code:: console

$ make style R -s -e 'setwd(".."); if (requireNamespace("styler")) styler::style_file(setdiff(system("git diff --name-only | grep r/.*R$", intern = TRUE), file.path("r", source("r/.styler_excludes.R")$value)))' Loading required namespace: styler Styling 2 files: r/R/expression.R ✔ r/tests/testthat/test-dplyr-funcs-datetime.R ℹ ──────────────────────────────────────────── Status Count Legend ✔ 1 File unchanged. ℹ 1 File changed. ✖ 0 Styling threw an error. ──────────────────────────────────────────── Please review the changes carefully!

Creating a Pull Request

First let’s review our changes in the shell using git status to see which files have been changed and to commit only the ones we are working on.

.. code:: console

$ git status On branch ARROW-14816 Changes not staged for commit: (use "git add <file>..." to update what will be committed) (use "git restore <file>..." to discard changes in working directory) modified: R/expression.R modified: tests/testthat/test-dplyr-funcs-datetime.R

And git diff to see the changes in the files in order to spot any error we might have made.

.. code:: console

$ git diff diff --git a/r/R/expression.R b/r/R/expression.R index 37fc21c25..0e71803ec 100644 --- a/r/R/expression.R +++ b/r/R/expression.R @@ -70,6 +70,7 @@ "quarter" = "quarter", # second is defined in dplyr-functions.R # wday is defined in dplyr-functions.R

  • "mday" = "day", "yday" = "day_of_year", "year" = "year",

diff --git a/r/tests/testthat/test-dplyr-funcs-datetime.R b/r/tests/testthat/test-dplyr-funcs-datetime.R index 359a5403a..228eca56a 100644 --- a/r/tests/testthat/test-dplyr-funcs-datetime.R +++ b/r/tests/testthat/test-dplyr-funcs-datetime.R @@ -444,6 +444,15 @@ test_that("extract wday from timestamp", { ) })

+test_that("extract mday from timestamp", {

  • compare_dplyr_binding(
  • .input %>%
  •  mutate(x = mday(datetime)) %>%
    
  •  collect(),
    
  • test_df
  • ) +})
test_that("extract yday from timestamp", {
  compare_dplyr_binding(
    .input %>%

@@ -626,6 +635,15 @@ test_that("extract wday from date", { ) })

+test_that("extract mday from date", {

  • compare_dplyr_binding(
  • .input %>%
  •  mutate(x = mday(date)) %>%
    
  •  collect(),
    
  • test_df
  • ) +})
test_that("extract yday from date", {
  compare_dplyr_binding(
    .input %>%

Everything looks OK. Now we can make the commit (save our changes to the branch history):

.. code:: console

$ git commit -am "Adding a binding and a test for mday() lubridate" [ARROW-14816 ed37d3a3b] Adding a binding and a test for mday() lubridate 2 files changed, 19 insertions(+)

We can use git log to check the history of commits:

.. code:: console

$ git log commit ed37d3a3b3eef76b696532f10562fea85f809fab (HEAD -> ARROW-14816) Author: Alenka Frim [email protected] Date: Fri Jan 21 09:15:31 2022 +0100

   Adding a binding and a test for mday() lubridate

commit c5358787ee8f7b80f067292f49e5f032854041b9 (upstream/main, upstream/HEAD, main, ARROW-15346, ARROW-10643) Author: Krisztián Szűcs [email protected] Date: Thu Jan 20 09:45:59 2022 +0900

   ARROW-15372: [C++][Gandiva] Gandiva now depends on boost/crc.hpp which is missing from the trimmed boost archive

   See build error https://github.com/ursacomputing/crossbow/runs/4871392838?check_suite_focus=true#step:5:11762

   Closes #12190 from kszucs/ARROW-15372

   Authored-by: Krisztián Szűcs <[email protected]>
   Signed-off-by: Sutou Kouhei <[email protected]>

If we started the branch some time ago, we may need to rebase to upstream main to make sure there are no merge conflicts:

.. code:: console

$ git pull upstream main --rebase

And now we can push our work to the forked Arrow repository on GitHub called origin.

.. code:: console

$ git push origin ARROW-14816 Enumerating objects: 233, done. Counting objects: 100% (233/233), done. Delta compression using up to 8 threads Compressing objects: 100% (130/130), done. Writing objects: 100% (151/151), 35.78 KiB | 8.95 MiB/s, done. Total 151 (delta 129), reused 33 (delta 20), pack-reused 0 remote: Resolving deltas: 100% (129/129), completed with 80 local objects. remote: remote: Create a pull request for 'ARROW-14816' on GitHub by visiting: remote: https://github.com/AlenkaF/arrow/pull/new/ARROW-14816 remote: To https://github.com/AlenkaF/arrow.git * [new branch] ARROW-14816 -> ARROW-14816

Now we have to go to the Arrow repository on GitHub <https://github.com/apache/arrow>_ to create a Pull Request. On the GitHub Arrow page (main or forked) we will see a yellow notice bar with a note that we made recent pushes to the branch ARROW-14816. That’s great, now we can make the Pull Request by clicking on Compare & pull request.

.. figure:: ../../images/R_tutorial_create_pr_notice.jpeg :scale: 60 % :alt: GitHub page of the Apache Arrow repository showing a notice bar indicating change has been made in our branch and a Pull Request can be created.

Notice bar on the Apache Arrow repository.

First we need to change the Title to ARROW-14816: [R] Implement bindings for lubridate::mday() in order to match it with the issue. Note a punctuation mark was added!

Extra note: when this tutorial was created, we had been using the Jira issue tracker. As we are currently using GitHub issues, the title would be prefixed with GH-14816: [R] Implement bindings for lubridate::mday().

We will also add a description to make it clear to others what we are trying to do.

.. figure:: ../../images/R_tutorial_pr_descr.jpeg :scale: 50 % :alt: GitHub page of the Pull Request showing the editor for the title and a description.

Editing the title and the description of our Pull Request.

Once we click Create pull request our code can be reviewed as a Pull Request in the Apache Arrow repository.

.. figure:: ../../images/R_tutorial_pr.jpeg :scale: 50 % :alt: GitHub page of the Pull Request showing the title and a description.

Here it is, our Pull Request!

The pull request gets connected to the issue and the CI is running. After some time passes and we get a review we can correct the code, comment, resolve conversations and so on.

.. seealso::

For more information about Pull Request workflow see :ref:pr_lifecycle.

The Pull Request we made can be viewed here <https://github.com/apache/arrow/pull/12218>_.