--- title: "Vector Names Hashing" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Vector Names Hashing} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: markdown: wrap: 72 --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` To speed up named-indexing, cppally implements hashing of vector names via a lazy cache-on-lookup approach. To better understand how this works, we will show what name-indexing in R looks like, how to improve it with hashing, as well as the benefits and limitations of cppally's inherent lazy-hashing method. Let's first load cppally ```{r setup} library(cppally) ``` Indexing into a vector by name is a useful idiom ```{r} # Name-value list x <- list( a = 10, b = 20, c = 30 ) x[["c"]] # Index-by-name x[[3]] # Index-by-location ``` It signifies to those reading the code the exact key we want. Compared to indexing by location, it is resilient to keys changing location. ```{r} # If we insert an element between 2 and 3, x[[3]] changes x <- c( x[1:2], list(d = 40), # New element x[3] ) x[[3]] # Now 40 x[["c"]] # Still the same value ``` ## Performance of indexing by location versus name For this we need a large name-value list ```{r} set.seed(42) large <- as.list(sample.int(10^5)) names(large) <- paste0("name_", seq_along(large)) ``` Lookup by location is O(1) and very fast in R, regardless of the index ```{r} library(bench) mark(large[[1]]) mark(large[[length(large)]]) ``` Since R scans the names each time, lookup by name is O(n) and can be slow if names happen to be near the end of the vector ```{r} mark( by_name = large[["name_1"]], by_index = large[[1]] ) mark( by_name = large[["name_100000"]], by_index = large[[100000]] ) ``` If we created a hash table of names-values, we could speedup repeated lookups ```{r} names_hashtab <- hashtab(size = length(large)) for (i in seq_along(names(large))){ nm <- names(large)[[i]] sethash(names_hashtab, nm, large[[i]]) } # Confirm it worked identical(gethash(names_hashtab, "name_10"), large[["name_10"]]) ``` Let's now compare all three methods: lookup by location, name and hashed name ```{r} mark( by_name = large[["name_100000"]], by_index = large[[100000]], by_hashed_name = gethash(names_hashtab, "name_100000") ) ``` That worked! Extracting the value associated with the last name using the hash table was almost as fast as looking it up by location. cppally uses this same approach internally, attaching a lazy hash map cache to each `r_vector`. Due to R/C++ limitations, the cache cannot survive the R/C++ boundary as we cannot easily attach the cache to the returned `SEXP`. Therefore instead of a cache-on-1st-lookup approach, which would be most efficient, we implement a cache-on-2nd-lookup approach. This brings certain limitations when writing R functions that use cppally code. To understand the limitation, it is first crucial to understand how cppally registers C++ functions to R. R can only handle `SEXP` objects, a generic type that represents any R object. cppally automatically handles all conversions between cppally classes like `r_vector` and R's `SEXP`. To illustrate this, suppose we have the identity function `foo()` that accepts an integer vector `x` and returns the same vector unchanged. ``` cpp [[cppally::register]] r_vector foo(r_vector x){ return x; } ``` The function is then registered to R (via `cppally::load_all()`) and the following R and C++ code is automatically generated. ### Generated R code ``` r foo function(x) { .Call(`_cppally_foo`, x) } ``` ### Generated C++ code ``` cpp r_vector foo(r_vector x); extern "C" SEXP _cppally_foo(SEXP x) { BEGIN_CPPALLY return cpp_to_r(::foo(as>(x))); END_CPPALLY } ``` Looking at the generated R function `foo()`, we see `.Call()` is called and is passed the input `x`. `.Call()` is the R interface that allows us to pass R objects to C++. Looking at the generated C++ function - `BEGIN_CPPALLY/END_CPPALLY` are helpers that run the main code in try/catch blocks to catch any C++ errors and promote them to R errors. In theory, all R errors thrown by cppally are done via `R_UnwindProtect` which safely runs all C++ destructors on error. Looking at the innermost function, we can see the expression `as>(x)` - this converts the input `SEXP` from R into a fresh cppally integer vector. The C++ function `foo()` is then called with that vector, which is then passed to `cpp_to_r()` - a function that converts C/C++ objects back to `SEXP`. The conversions look like this: ``` SEXP (R) ↓ r_vector (C++) ↓ r_vector (C++, via foo()) ↓ SEXP (R) ``` This round-trip from `SEXP` to `r_vector` and back to `SEXP` happens every single time `foo()` is called (which happens via `.Call()`). Because `r_vector` is the class that builds and stores the cache, that cache is lost when converting to `SEXP` and therefore, the hashing approach becomes difficult. To mitigate the negative performance impacts of this limitation, we implement a hash-on-2nd-lookup approach. This means the first lookup is a linear scan, the second lookup triggers the hash map build before doing a hash lookup. All lookups from the 2nd onwards use the hash map. Practically speaking this shouldn't negatively impact performance when doing lookups from R as one would have to write a function that performs exactly two lookups on each call to experience the highest cost penalty, something that while not impossible, is likely uncommon. In C++, the cache persists across calls since there is no `SEXP` round-trip. The first lookup is a linear scan, the second builds the hash map, and all subsequent lookups are O(1). ``` cpp r_int a = x.get("a"); // linear scan r_int b = x.get("b"); // hash map built + O(1) lookup r_int c = x.get("c"); // O(1) lookup ``` Let's illustrate this by visualising the per call cost as the number of lookups increases, comparing R and cppally. ```{r} cpp_source(code = ' #include using namespace cppally; [[cppally::register]] r_vec do_lookup(r_vec x, r_str name, int n_iterations){ r_vec out(n_iterations); for (int i = 0; i < n_iterations; ++i){ out.set(i, x.get(name)); } return out; } ') ``` R equivalent using `[[` ```{r} r_do_lookup <- function(x, name, n_iterations){ out <- vector("list", n_iterations) for (i in seq_along(out)){ out[[i]] <- x[[name]] } out } ``` ## The cost of first lookup Here we are simply comparing the cost of doing a 1st-lookup, in both R and C++ ```{r} nm <- names(large)[length(large)] mark( cppally_one_lookup = do_lookup(large, nm, 1), base_one_lookup = r_do_lookup(large, nm, 1) ) ``` While I'm not sure why cppally's linear scan is faster than R's, it may be due to the fact that direct data access is used to compare elements instead of `STRING_ELT()` ## The cost of many lookups We will now plot the per-call cost of doing many repeated lookups to see how many lookups we need to do before we start seeing the benefits of hashing ```{r, fig.width=11, fig.height=7, out.width="100%", echo=FALSE} cost_per_lookup <- numeric(10^3) measure_time <- function(expr, scale = 1){ unclass(bench::bench_time(expr))[["real"]] * scale } for (i in 1:10^3){ cost_per_lookup[i] <- measure_time(do_lookup(large, nm, i), scale = 10^6) / i } pt_cols <- ifelse( seq_along(cost_per_lookup) == 1, "orange", "black" ) pt_cols <- ifelse( seq_along(cost_per_lookup) == 2, "#0072B2", pt_cols ) plot( cost_per_lookup, xlab = "N lookups", ylab = "Cost per lookup (µs)", main = "Time per name lookup (microseconds) as lookups increase", col = pt_cols ) points(1, cost_per_lookup[1], pch = 19, col = "orange") points(2, cost_per_lookup[2], pch = 19, col = "#0072B2") abline(h = cost_per_lookup[1], lty = 2, col = "orange") legend( "topright", legend = c("1st lookup (linear scan)", "2nd lookup (hash build)"), col = c("orange", "#0072B2"), pch = 19 ) symbols(1, cost_per_lookup[1], circles = 1, add = TRUE, inches = 0.1, fg = "orange", bg = NA) symbols(2, cost_per_lookup[2], circles = 1, add = TRUE, inches = 0.1, fg = "#0072B2", bg = NA) ``` The plot shows an interesting story - the first lookup (linear scan) sets a baseline cost. The second lookup pays the cost of the hash map build, making it the most expensive per-lookup. From there, each additional lookup amortises the one-time hash build cost over more calls, so the average drops and stabilises well below the first-lookup baseline. Let's see how this compares to a pseudo hash-on-1st lookup method. To achieve this we will set the first lookup to be a fast linear scan of the first name which means the rest of real lookups use hashing. ```{r} cpp_source(code = ' #include using namespace cppally; [[cppally::register]] r_vec do_first_lookup_hashed(r_vec x, r_str name, int n_iterations){ r_vec out(n_iterations); // Initial lookup as fast linear scan to force all other lookups to be hashed static_cast(x.get(x.names().get(0))); for (int i = 0; i < n_iterations; ++i){ out.set(i, x.get(name)); } return out; } ') ``` ```{r, fig.width=11, fig.height=7, out.width="100%", echo=FALSE} cost_per_lookup1 <- numeric(250) cost_per_lookup2 <- numeric(250) for (i in 1:250){ cost_per_lookup1[i] <- measure_time(do_lookup(large, nm, i), scale = 10^6) / i cost_per_lookup2[i] <- measure_time(do_first_lookup_hashed(large, nm, i), scale = 10^6) / i } plot( cost_per_lookup1, xlab = "N lookups", ylab = "Cost per lookup (µs)", col = "darkblue", main = "Hash on 1st lookup vs hash on 2nd lookup", type = "l" ) lines(cost_per_lookup2, col = "darkorange") legend( "topright", legend = c("Hash on 2nd lookup", "Hash on 1st lookup"), col = c("darkblue", "darkorange"), lty = 1 ) ``` The two lines are overlapping after a few lookups, signifying that there is little penalty to waiting until the 2nd lookup to build the hash map. Let's also plot some comparisons comparing repeated hash lookups against repeated linear scan lookups. At the same time we will also use a larger list. ```{r} large <- as.list(sample.int(5e05)) names(large) <- paste0("name_", seq_along(large)) cpp_source(code = ' #include using namespace cppally; [[cppally::register]] r_vec do_linear_lookup(r_vec x, r_str name, int n_iterations){ r_vec out(n_iterations); r_vec names = as>(x.names()); int n = names.length(); auto *p = names.data(); // Use ptr to allow O2 optimisations here for fair comparison for (int i = 0; i < n_iterations; ++i){ for (int j = 0; j < n; ++j){ if (unwrap(name) == p[j]){ out.set(i, x.view(j)); break; } } } return out; } ') ``` ```{r, fig.width=11, fig.height=7, out.width="100%", echo=FALSE} measure_ms <- function(expr){ measure_time(expr, scale = 10^3) } cost_per_hash_lookup <- numeric(250) nm <- names(large)[length(large) %/% 2] for (i in 1:length(cost_per_hash_lookup)){ cost_per_hash_lookup[i] <- measure_ms(do_lookup(large, nm, i)) / i } nm <- names(large)[length(large)] cost_per_linear_lookup_worst_case <- numeric(250) for (i in 1:length(cost_per_linear_lookup_worst_case)){ cost_per_linear_lookup_worst_case[i] <- measure_ms(do_linear_lookup(large, nm, i)) / i } nm <- names(large)[1] cost_per_linear_lookup_best_case <- numeric(250) for (i in 1:length(cost_per_linear_lookup_best_case)){ cost_per_linear_lookup_best_case[i] <- measure_ms(do_linear_lookup(large, nm, i)) / i } nms <- sample(names(large)) cost_per_linear_lookup_mixed_case <- numeric(250) for (i in 1:length(cost_per_linear_lookup_mixed_case)){ cost_per_linear_lookup_mixed_case[i] <- measure_ms(do_linear_lookup(large, nms[i], i)) / i } # Plot from before plot( cost_per_hash_lookup, xlab = "N lookups", ylab = "Cost per lookup (ms)", main = "Hashed lookups compared against three linear scan scenarios", col = "#0072B2" ) lines(cost_per_linear_lookup_best_case, col = "purple") lines(cost_per_linear_lookup_worst_case, col = "red") lines(cost_per_linear_lookup_mixed_case, col = "brown") legend( "topright", legend = c( "Lazy-hash on 2nd lookup", "Linear scan: best case (name always found at start)", "Linear scan: worst case (name always found at end)", "Linear scan: mixed case (name at random location)" ), col = c("#0072B2", "purple", "red", "brown"), pch = c(19, NA, NA, NA), lty = c(NA, 1, 1, 1) ) ``` The best and worst case linear scans are presented for illustrative purposes even though they are not realistic with real data. The most informative comparison is between the trend of the hash lookup cost versus the linear scan (mixed case). We can see that for the first few lookups the linear scan dominates (as expected), but as the number of lookups increases, the hash approach begins to outperform the linear scan (mixed case) and stabilises well below it. In summary, cppally's lazy hashing trades a one-time hash build cost for fast repeated lookups. The approach works best in C++, where the cache persists across calls. From R, each call starts fresh, so the benefit is limited to functions that perform multiple lookups within a single call. ## Future development While not explored in this vignette, it is possible to preserve the cache across the R/C++ boundary. By attaching an `externalptr` to the `names_map` as an attribute of a custom R vector class, the cache lifetime is tied to the lifetime of the R object (provided that `externalptr` is kept as an attribute). This may require cppally support for injecting a pre-built cache into a fresh `r_vector` wrapper, but is otherwise a tractable extension.