Motivation

Lately, I, Edward, have been programming in Julia a lot. Two times in the last week, I’ve needed a function like R’s tapply(), but in Julia. With a little bit of searching around the interwebs, I hacked together a reasonable equivalent to R’s tapply() written in Julia. This blog post will explain the function tapply() and briefly introduce the two examples where I used tapply(). Last, I’ll provide my Julia code which replicates R’s tapply() function.

R’s tapply()

R’s function tapply() is a tricky little guy, but only at first. Let’s start with just a few examples to get an understanding of the function tapply().

First we define a vector x of some made-up data and another vector g that assigns elements of x to the groups in g. In our first example, g contains five groups, A, B, C, D, E.

x <- c( 1,   2,   3,   4,   5,   6,   7,   8,   9,   10)
g <- c("A", "A", "B", "B", "C", "C", "C", "D", "E", "E")

The first four elements of x are evenly split amongst the first two groups of g, namely groups A and B. The fifth through seventh elements of x belong to group C, while elements 9,10 belong to group E. Lonely element 8 belongs to its own group, D.

Let’s see tapply() in action.

tapply(x, g, length)

## A B C D E 
## 2 2 3 1 2

By picking simple vectors like x,g and a simpler function like length() we can start to see what tapply() does. In this case, tapply() calculates the length all of the elements of x as grouped by g.

Let’s try a more abstract phrasing: tapply() applies the function length() to the elements of x as grouped by g.

tapply() is quite general. It doesn’t only calculate lengths of grouped elements. One more step up in abstraction: tapply() applies a function to the elements of x as grouped by g. This arbitrary function gets passed in as the third argument to tapply().

Here is our last attempt at an abstract phrase to define what tapply() does: “Apply a function to each cell of a ragged array…”. Ok, this last one is a bit confusing because it introduces a new phrase, ragged array. In R, arrays are vectors. A ragged vector is just a (outer) vector of (inner) vectors, so a vector that holds vectors, where the inner vectors can be of different lengths. Outside of R, these are sometimes called jagged arrays.

I put the last phrase in quotes, because I stole it directly from the help file for tapply(). Try ?tapply() at the Console in R. It helps to know what tapply() does before reading this help file :)

Let’s consider the ragged vector that is implicitly defined by tapply() in the case of our simpler example above. The inner vectors looks like

xA <- c(1, 2) # the elements of x as defined by g's group A 
xB <- c(3, 4)
xC <- c(5, 6, 7)
xD <- c(8)
xE <- c(9, 10)

With these inner vectors, we can now “apply” the function length() to each.

length(xA)

## [1] 2

length(xB)

## [1] 2

length(xC)

## [1] 3

length(xD)

## [1] 1

length(xE)

## [1] 2

So tapply() is implicitly creating a ragged vector that holds xA, xB, xC, xD, xE. I say implicitly here, because such a ragged vector is never realized; inside tapply(), length() is applied to each inner vector without creating the outer vector.

Practice

Try replacing the function length() above with the function sum(). As a challenge, think about your answer before asking starting to type. What will the output of the following code be?

tapply(x, g, sum)

Two Examples

In the last week, I’ve needed a function like R’s tapply() while programming in Julia. The two example below are simplified versions of where this function tapply() showed up.

Example 1

In fact, the first time this came up for me was for a sub-problem that involved the practice problem above. I wanted to sum elements of a vector as grouped by another vector. My sub-problem involved groups as defined by integers instead of letters, but that’s no issue for tapply().

h <- c(1, 1, 2, 2, 3, 3, 3, 4, 5, 5)
tapply(x, h, sum)

##  1  2  3  4  5 
##  3  7 18  8 19

Since this example is no different than above, I’ll provide some more details of the broader problem I was working on. While working on a non-DSL version of Stan, I was attempting to take a derivative of something like a random effects linear model with respect to the random intercept. As it turns out, such a derivative needs the sum of the intercepts as grouped by their unique ID.

Example 2

This next example is a not obvious use-case of tapply(), even if the original task seemed easy. I was asked by a colleague to read through a bunch of Excel files (about 25) and print out some integers (about 17,000 integers) in a “pretty” format. For example, with the integers

y <- c(1, 2, 3, 6, 7, 9, 10, 11, 13)

my colleague wanted them printed out as “1 - 3, 6, 7, 9 - 11, 13”. Essentially the task was to hyphenate consecutive integers and comma-separate everything.

So here it goes, using tapply(). First, I define the function which will be applied to each inner vector. Notice that if my function f() simply returned the variable l then it would be equivalent to using length() in place of f(). Instead, my function f() returns appropriately formatted strings.

f <- function(x) {
  l <- length(x)
  if (l == 1) {
    return(paste0(x))
  } else if(l == 2) {
    return(paste0(x[1], ", ",  x[l]))
  } else {
    return(paste0(x[1], " - ", x[l]))
  }
}

groups <- cumsum(c(1, diff(y) != 1))
tapply(y, groups, f)

##        1        2        3        4 
##  "1 - 3"   "6, 7" "9 - 11"     "13"

So close! But let’s take a step back, because from here it’s not too bad. Consider the vector groups, printed alongside y.

y

## [1]  1  2  3  6  7  9 10 11 13

groups

## [1] 1 1 1 2 2 3 3 3 4

Notice that it’s doing exactly what we want; defining groups by consecutive numbers: 1, 2, 3 all get assigned to group 1, 6,7 get assigned to group 2, 9,10,11 are assigned to 3, and lonely 13 assigned to group 4. How?!

The function diff() calculates the difference between consecutive values in its argument. When applied to y specifically, it returns, in one vector, y[2] - y[1], y[3] - y[2], y[4] - y[3], and so on. That’s nearly what we want, but we need to account for two extra bits.

First, diff() alone is insufficient. We want to know, more specifically, when the differences are equal to 1, since this inherently defines consecutive integers. So diff(y) != 1 returns a vector of 1s and 0s. The cumulative sum of these is nearly what we want.

Second, diff(y) returns a vector of length length(y) - 1 since it starts working on y[2]. To account for this we prepend a 1 to the cumsum()ed vector.

Let’s store our output into a vector.

s <- tapply(y, groups, f)

The vector s in now a vector of the strings we want.

s

##        1        2        3        4 
##  "1 - 3"   "6, 7" "9 - 11"     "13"

Since we want only one string, not a vector of strings, we combine these into a single, comma-separated string.

paste(s, collapse = ", ")

## [1] "1 - 3, 6, 7, 9 - 11, 13"

For your sake, I leave out the details of repeatedly applying tapply() like this to some 25 Excel files. Your welcome.

tapply() in Julia

We build up to tapply() by solving a number of sub-problems.

tabulate values

In order to efficiently run a function like tapply(), I want a count of how many times each distinct value appears in a vector. R calls such a function table. Let’s first load up Julia and share with Julia some of the vectors we’ve been using in our examples.

library(JuliaCall)
julia_assign("x", x)

## Julia version 1.5.0 at location C:\Users\rache\AppData\Roaming\R\data\R\JULIAC~1\julia\V15~1.0\bin will be used.

## Loading setup script for JuliaCall...

## Finish loading setup script for JuliaCall.

julia_assign("g", g)
julia_assign("y", y)
julia_assign("groups", groups)

I called my version of this function table in Julia, too. Here’s the source code, much of which I borrowed from the Julia package StatsBase.jl,

function table(x::Vector)
    d = Dict{eltype(x), Int}()
    for v in x
        idx = Base.ht_keyindex2!(d, v)
        if idx > 0
            @inbounds d.vals[idx] += 1
        else
            @inbounds Base._setindex!(d, 1, v, -idx)
        end
    end
    return d
end

## table (generic function with 1 method)

So now we can call our Julia function.

table(g)

## Dict{String,Int64} with 5 entries:
##   "B" => 2
##   "A" => 2
##   "C" => 3
##   "D" => 1
##   "E" => 2

Because Julia doesn’t inherently have a table type, I used a dictionary.

Next, I wrote a function which I named group. This is equivalent to R’s split(). Since Julia already has a function named split, I chose a different name.

function group(x::Vector, g::Vector)
    gt = eltype(g)
    gt <: Number && gt != Int && (g = convert.(Int, g))
    t = table(g)
    d = Dict(k => Vector{eltype(x)}(undef, v) for (k, v) in t)
    ug = collect(keys(t))
    u = gt <: Number ? ug : Dict(k => i for (i, k) in pairs(ug))
    c = zeros(Int, length(d))
    @inbounds for i in eachindex(g, x)
        gi = g[i]
        ui = u[gi]
        c[ui] += 1
        d[gi][c[ui]] = x[i]
    end
    return d
end

## group (generic function with 1 method)

The function group() groups the elements of x into groups based on the groups defined in g.

group(x, g)

## Dict{String,Array{Float64,1}} with 5 entries:
##   "B" => [3.0, 4.0]
##   "A" => [1.0, 2.0]
##   "C" => [5.0, 6.0, 7.0]
##   "D" => [8.0]
##   "E" => [9.0, 10.0]

Let’s compare this to what R’s function split() does. The order is not the same, so take a second to convince yourself the grouped elements of x appropriately match.

split(x, g)

## $A
## [1] 1 2
## 
## $B
## [1] 3 4
## 
## $C
## [1] 5 6 7
## 
## $D
## [1] 8
## 
## $E
## [1]  9 10

Finally, here is an equivalent to tapply(), which I named mapg() (think map groups). I also added an option to first sort the groups in g, just in case that proves helpful (which it did, in the details of the second example). Notice that I changed the order of the arguments to better match the order of the arguments to Julia’s function map(), namely the function to be applied to the groups comes first.

function mapg(f, x, g; sortfirst = false)
    v = collect(group(x, g))
    z = sortfirst ? sort(v, by = x -> first(x)) : v
    return map(y -> f(last(y)), z)
end

## mapg (generic function with 1 method)

Let’s use mapg() to print out some integers in a “pretty” format.

function f(x)
  lx = length(x)
  out = ""
  if lx == 1
    out *= "$(x[1])"
  else
    out *= lx == 2 ? "$(x[1]), $(x[2])" : "$(x[1]) - $(x[end])"
  end
  return out
end

## f (generic function with 1 method)


s = mapg(f, y, groups, sortfirst = true)

## 4-element Array{String,1}:
##  "1.0 - 3.0"
##  "6.0, 7.0"
##  "9.0 - 11.0"
##  "13.0"

join(s, ", ")

## "1.0 - 3.0, 6.0, 7.0, 9.0 - 11.0, 13.0"

If you are following along, try calling mapg() without the sortfirst = true argument to better understand how sorting helps.

So What?

Well, I certainly learned some more about Julia by writing the function mapg(). And I certainly merged some relatively different problems down to one function, that’s always fun! But moreover, R’s function tapply() is a classic example of the two language problem; tapply() uses some underlying functions which are written in C, factor() and split() (possibly others). While my function mapg() might not be the fastest re-writing of tapply() that is possible in Julia, it is still fast. Recall, tapply() does most of the heavy lifting in C.

Here’s a crude and imperfect (aren’t they all?) benchmark. In each language, I find the minimum run time, across three runs, of tapply() and mapg(), applied to a randomly sampled vector of 10^{6} standard Normal data grouped first into 26 groups defined by uniformly sampling letters from the English alphabet and then grouped into the integers 1 to 50, also uniformly sampled. All times are reported in seconds. First, in R.

timeit <- function(gidx) {
  set.seed(54231184)
  N <- 1e6
  xx <- rnorm(N)
  gg <- sample(gidx, N, replace = TRUE)
  mintime <- Inf
  for (j in 1:3) {
    time <- system.time(tapply(xx, gg, sum))[1]
    if (time < mintime) mintime <- time
  }
  return(mintime)
}

timeit(letters)

## user.self 
##      0.04

timeit(1:50)

## user.self 
##      0.01

Next, in Julia.

using Random

function timeit(gidx)
  Random.seed!(54231184)
  N = convert(Int, 1e6)
  xx = randn(N)
  gg = rand(gidx, N)
  mintime = Inf
  for j in 1:3
    time = @elapsed mapg(sum, xx, gg)
    time < mintime && (mintime = time)
  end
  return mintime
end

## timeit (generic function with 1 method)


timeit('a':'z')

## 0.036550301

timeit(1:50)

## 0.0226102

My function mapg() appears to be on par with R’s tapply(), at least within this narrow benchmark.

An implementation using a radix sorting algorithm might prove faster than my use of a dictionary in table(). Though it might also depend on the number of groups. Maybe another blog post?

Conclusion

In this blog post, we discussed the function tapply(), showcased some examples, and then dove head first into some fairly serious Julia code. It all started out fun and games, until the end :) Hopefully you learned some use-cases of an occasionally helpful function, namely tapply(), and got a chance to think about its underlying implementation.

Image credit R and Julia