Title: | Simple Functions to Save Time and Memory |
---|---|
Description: | Fast and memory-efficient (or 'cheap') tools to facilitate efficient programming, saving time and memory. It aims to provide 'cheaper' alternatives to common base R functions, as well as some additional functions. |
Authors: | Nick Christofides [aut, cre] |
Maintainer: | Nick Christofides <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.9.92.9000 |
Built: | 2025-01-06 08:32:06 UTC |
Source: | https://github.com/nicchr/cheapr |
In this package, 'cheap' means fast and efficient.
cheapr aims to provide a set of functions for programmers to write cheaper code, saving time and memory.
Maintainer: Nick Christofides [email protected] (ORCID)
Useful links:
Report bugs at https://github.com/NicChr/cheapr/issues
This is a cheapr version of cut.numeric()
which is more efficient and
prioritises pretty-looking breaks by default through
the use of get_breaks()
.
Out-of-bounds values can be included naturally through the
include_oob
argument. Left-closed (right-open) intervals are
returned by default in contrast to cut's default right-closed intervals.
Furthermore there is flexibility in formatting the interval bins,
allowing the user to specify formatting functions and symbols for
the interval close and open symbols.
as_discrete(x, ...) ## S3 method for class 'numeric' as_discrete( x, breaks = if (left_closed) get_breaks(x) else cheapr_rev(-get_breaks(-x)), left_closed = TRUE, include_endpoint = FALSE, include_oob = FALSE, ordered = FALSE, intv_start_fun = prettyNum, intv_end_fun = prettyNum, intv_closers = c("[", "]"), intv_openers = c("(", ")"), intv_sep = ",", inf_label = NULL, ... ) ## S3 method for class 'integer64' as_discrete(x, ...)
as_discrete(x, ...) ## S3 method for class 'numeric' as_discrete( x, breaks = if (left_closed) get_breaks(x) else cheapr_rev(-get_breaks(-x)), left_closed = TRUE, include_endpoint = FALSE, include_oob = FALSE, ordered = FALSE, intv_start_fun = prettyNum, intv_end_fun = prettyNum, intv_closers = c("[", "]"), intv_openers = c("(", ")"), intv_sep = ",", inf_label = NULL, ... ) ## S3 method for class 'integer64' as_discrete(x, ...)
x |
A numeric vector. |
... |
Extra arguments passed onto methods. |
breaks |
Break-points.
The default option creates pretty looking breaks.
Unlike |
left_closed |
Left-closed intervals or right-closed intervals? |
include_endpoint |
Include endpoint? Default is |
include_oob |
Include out-of-bounds values? Default is cut(10, c(9, 10, Inf), right = F, include.lowest = T) != as_discrete(10, c(9, 10), include_endpoint = T, include_oob = T) |
ordered |
Should result be an ordered factor? Default is |
intv_start_fun |
Function used to format interval start points. |
intv_end_fun |
Function used to format interval end points. |
intv_closers |
A length 2 character vector denoting the symbol to use for closing either left or right closed intervals. |
intv_openers |
A length 2 character vector denoting the symbol to use for opening either left or right closed intervals. |
intv_sep |
A length 1 character vector used to separate the start and end points. |
inf_label |
Label to use for intervals that include infinity.
If left |
A factor of discrete bins (intervals of start/end pairs).
library(cheapr) # `as_discrete()` is very similar to `cut()` # but more flexible as it allows you to supply # formatting functions and symbols for the discrete bins # Here is an example of how to use the formatting functions to # categorise age groups nicely ages <- 1:100 age_group <- function(x, breaks){ age_groups <- as_discrete( x, breaks = breaks, intv_sep = "-", intv_end_fun = function(x) x - 1, intv_openers = c("", ""), intv_closers = c("", ""), include_oob = TRUE, ordered = TRUE ) # Below is just renaming the last age group lvls <- levels(age_groups) n_lvls <- length(lvls) max_ages <- paste0(max(breaks), "+") attr(age_groups, "levels") <- c(lvls[-n_lvls], max_ages) age_groups } age_group(ages, seq(0, 80, 20)) age_group(ages, seq(0, 25, 5)) age_group(ages, 5) # To closely replicate `cut()` with `as_discrete()` we can use the following cheapr_cut <- function(x, breaks, right = TRUE, include.lowest = FALSE, ordered.result = FALSE){ if (length(breaks) == 1){ breaks <- get_breaks(x, breaks, pretty = FALSE, expand_min = FALSE, expand_max = FALSE) adj <- diff(range(breaks)) * 0.001 breaks[1] <- breaks[1] - adj breaks[length(breaks)] <- breaks[length(breaks)] + adj } as_discrete(x, breaks, left_closed = !right, include_endpoint = include.lowest, ordered = ordered.result, intv_start_fun = function(x) formatC(x, digits = 3, width = 1), intv_end_fun = function(x) formatC(x, digits = 3, width = 1)) } x <- rnorm(100) cheapr_cut(x, 10) identical(cut(x, 10), cheapr_cut(x, 10))
library(cheapr) # `as_discrete()` is very similar to `cut()` # but more flexible as it allows you to supply # formatting functions and symbols for the discrete bins # Here is an example of how to use the formatting functions to # categorise age groups nicely ages <- 1:100 age_group <- function(x, breaks){ age_groups <- as_discrete( x, breaks = breaks, intv_sep = "-", intv_end_fun = function(x) x - 1, intv_openers = c("", ""), intv_closers = c("", ""), include_oob = TRUE, ordered = TRUE ) # Below is just renaming the last age group lvls <- levels(age_groups) n_lvls <- length(lvls) max_ages <- paste0(max(breaks), "+") attr(age_groups, "levels") <- c(lvls[-n_lvls], max_ages) age_groups } age_group(ages, seq(0, 80, 20)) age_group(ages, seq(0, 25, 5)) age_group(ages, 5) # To closely replicate `cut()` with `as_discrete()` we can use the following cheapr_cut <- function(x, breaks, right = TRUE, include.lowest = FALSE, ordered.result = FALSE){ if (length(breaks) == 1){ breaks <- get_breaks(x, breaks, pretty = FALSE, expand_min = FALSE, expand_max = FALSE) adj <- diff(range(breaks)) * 0.001 breaks[1] <- breaks[1] - adj breaks[length(breaks)] <- breaks[length(breaks)] + adj } as_discrete(x, breaks, left_closed = !right, include_endpoint = include.lowest, ordered = ordered.result, intv_start_fun = function(x) formatC(x, digits = 3, width = 1), intv_end_fun = function(x) formatC(x, digits = 3, width = 1)) } x <- rnorm(100) cheapr_cut(x, 10) identical(cut(x, 10), cheapr_cut(x, 10))
.bincode()
When x
is an integer vector, bin()
is cheaper than .bincode()
as
no coercion to a double vector occurs. This alternative also has more
arguments that allow you to return the start values of the binned vector,
as well as including out-of-bounds intervals.
bin( x, breaks, left_closed = TRUE, include_endpoint = FALSE, include_oob = FALSE, codes = TRUE )
bin( x, breaks, left_closed = TRUE, include_endpoint = FALSE, include_oob = FALSE, codes = TRUE )
x |
A numeric vector. |
breaks |
A numeric vector of breaks. |
left_closed |
Should intervals be left-closed (and right-open)?
Default is |
include_endpoint |
Equivalent to |
include_oob |
Should out-of-bounds interval be included?
Default is |
codes |
Should an integer vector indicating which bin the values
fall into be returned? Default is |
Either an integer vector of codes indicating which bin the values fall into, or the start of the intervals for which each value falls into.
case
and val_match
are cheaper alternatives to dplyr::case_when
and
dplyr::case_match
respectively.
case(..., .default = NULL) val_match(.x, ..., .default = NULL)
case(..., .default = NULL) val_match(.x, ..., .default = NULL)
... |
Logical expressions or scalar values in the case of |
.default |
Catch-all value or vector. |
.x |
Vector used to switch values. |
val_match()
is a very efficient special case of the
case()
function when all lhs expressions are scalars,
i.e. length-1 vectors. RHS expressions can be vectors the
same length as .x
.
The below 2 expressions are equivalent.
val_match( x, 1 ~ "one", 2 ~ "two", .default = "Unknown" ) case( x == 1 ~ "one", x == 2 ~ "two", .default = "Unknown" )
A vector the same length as .x
or same length as the first condition in the
case of case
, unless the condition length is smaller than the
rhs, in which case the length of the rhs is used.
ifelse()
Cheaper version of ifelse()
cheapr_if_else(condition, true, false, na = false[NA_integer_])
cheapr_if_else(condition, true, false, na = false[NA_integer_])
condition |
logical A condition which will be used to evaluate the if else operation. |
true |
Value(s) to replace |
false |
Value(s) to replace |
na |
Catch-all value(s) to replace all other instances,
where |
A vector the same length as condition,
using a common type between true
, false
and default
.
This is not a 1-to-1 copy of base::table()
as some behaviours differ.
It is more flexible as it accepts inputs such as data frames and
vctrs_rcrd
objects.
cheapr_table( ..., names = TRUE, order = FALSE, na_exclude = FALSE, classed = FALSE )
cheapr_table( ..., names = TRUE, order = FALSE, na_exclude = FALSE, classed = FALSE )
... |
|
names |
Should level names be kept? Default is |
order |
Should result be ordered by level names? Default is
|
na_exclude |
Should |
classed |
Should a |
A named integer vector if one object is supplied, otherwise an array.
factor()
along with cheaper utilitiesA fast version of factor()
using the collapse package.
There are some additional utilities, most of which begin with the prefix
'levels_', such as
as_factor()
which is an efficient way to coerce both vectors and factors,
levels_factor()
which returns the levels of a factor, as a factor,
levels_used()
which returns the used levels of a factor,
levels_unused()
which returns the unused levels of a factor,
levels_add()
adds the specified levels onto the existing levels,
levels_rm()
removes the specified levels,
levels_add_na()
which adds an explicit NA
level,
levels_drop_na()
which drops the NA
level,
levels_drop()
which drops unused factor levels,
levels_rename()
for renaming levels,
levels_lump()
which returns top n levels and lumps all others into the
same category, levels_count()
which returns the counts of each level,
and finally levels_reorder()
which reorders the levels of x
based on y
using the ordered median values of y
for each level.
factor_( x = integer(), levels = NULL, order = TRUE, na_exclude = TRUE, ordered = is.ordered(x) ) as_factor(x) levels_factor(x) levels_used(x) levels_unused(x) used_levels(x) unused_levels(x) levels_rm(x, levels) levels_add(x, levels, where = c("last", "first")) levels_add_na(x, name = NA, where = c("last", "first")) levels_drop_na(x) levels_drop(x) levels_reorder(x, order_by, decreasing = FALSE) levels_rename(x, ..., .fun = NULL) levels_lump( x, n, prop, other_category = "Other", ties = c("min", "average", "first", "last", "random", "max") ) levels_count(x)
factor_( x = integer(), levels = NULL, order = TRUE, na_exclude = TRUE, ordered = is.ordered(x) ) as_factor(x) levels_factor(x) levels_used(x) levels_unused(x) used_levels(x) unused_levels(x) levels_rm(x, levels) levels_add(x, levels, where = c("last", "first")) levels_add_na(x, name = NA, where = c("last", "first")) levels_drop_na(x) levels_drop(x) levels_reorder(x, order_by, decreasing = FALSE) levels_rename(x, ..., .fun = NULL) levels_lump( x, n, prop, other_category = "Other", ties = c("min", "average", "first", "last", "random", "max") ) levels_count(x)
x |
A vector. |
levels |
Optional factor levels. |
order |
Should factor levels be sorted? Default is |
na_exclude |
Should |
ordered |
Should the result be an ordered factor? |
where |
Where should |
name |
Name of |
order_by |
A vector to order the levels of |
decreasing |
Should the reordered levels be in decreasing order?
Default is |
... |
Key-value pairs where the key is the new name and
value is the name to replace that with the new name. For example
|
.fun |
Renaming function applied to each level. |
n |
Top n number of levels to calculate. |
prop |
Top proportion of levels to calculate. This is a proportion of the total unique levels in x. |
other_category |
Name of 'other' category. |
ties |
Ties method to use. See |
This operates similarly to collapse::qF()
.
The main difference internally is that collapse::funique()
is used
and therefore s3 methods can be written for it.
Furthermore, for date-times factor_
differs in that it differentiates
all instances in time whereas factor
differentiates calendar times.
Using a daylight savings example where the clocks go back: factor(as.POSIXct(1729984360, tz = "Europe/London") + 3600 *(1:5))
produces 4 levels whereas factor_(as.POSIXct(1729984360, tz = "Europe/London") + 3600 *(1:5))
produces 5 levels.
levels_lump()
is a cheaper version of forcats::lump_n()
but returns
levels in order of highest frequency to lowest. This can be very useful
for plotting.
A factor
or character
in the case of levels_used
and levels_unused
.
levels_count
returns a data frame of counts and proportions for each level.
library(cheapr) x <- factor_(sample(letters[sample.int(26, 10)], 100, TRUE), levels = letters) x # Used/unused levels levels_used(x) levels_unused(x) # Drop unused levels levels_drop(x) # Top 3 letters by by frequency lumped_letters <- levels_lump(x, 3) levels_count(lumped_letters) # To remove the "other" category, use `levels_rm()` levels_count(levels_rm(lumped_letters, "Other")) # We can use levels_lump to create a generic top n function for non-factors too get_top_n <- function(x, n){ f <- levels_lump(factor_(x, order = FALSE), n = n) levels_count(f) } get_top_n(x, 3) # A neat way to order the levels of a factor by frequency # is the following: levels(levels_lump(x, prop = 1)) # Highest to lowest levels(levels_lump(x, prop = -1)) # Lowest to highest
library(cheapr) x <- factor_(sample(letters[sample.int(26, 10)], 100, TRUE), levels = letters) x # Used/unused levels levels_used(x) levels_unused(x) # Drop unused levels levels_drop(x) # Top 3 letters by by frequency lumped_letters <- levels_lump(x, 3) levels_count(lumped_letters) # To remove the "other" category, use `levels_rm()` levels_count(levels_rm(lumped_letters, "Other")) # We can use levels_lump to create a generic top n function for non-factors too get_top_n <- function(x, n){ f <- levels_lump(factor_(x, order = FALSE), n = n) levels_count(f) } get_top_n(x, 3) # A neat way to order the levels of a factor by frequency # is the following: levels(levels_lump(x, prop = 1)) # Highest to lowest levels(levels_lump(x, prop = -1)) # Lowest to highest
Fast greatest common divisor and smallest common multiple using the Euclidean algorithm.
gcd()
returns the greatest common divisor. scm()
returns the smallest common multiple. gcd2()
is a vectorised binary version of gcd
. scm2()
is a vectorised binary version of scm
.
gcd( x, tol = sqrt(.Machine$double.eps), na_rm = TRUE, round = TRUE, break_early = TRUE ) scm(x, tol = sqrt(.Machine$double.eps), na_rm = TRUE) gcd2(x, y, tol = sqrt(.Machine$double.eps), na_rm = TRUE) scm2(x, y, tol = sqrt(.Machine$double.eps), na_rm = TRUE)
gcd( x, tol = sqrt(.Machine$double.eps), na_rm = TRUE, round = TRUE, break_early = TRUE ) scm(x, tol = sqrt(.Machine$double.eps), na_rm = TRUE) gcd2(x, y, tol = sqrt(.Machine$double.eps), na_rm = TRUE) scm2(x, y, tol = sqrt(.Machine$double.eps), na_rm = TRUE)
x |
A numeric vector. |
tol |
Tolerance. This must be a single positive number strictly less than 1. |
na_rm |
If |
round |
If |
break_early |
This is experimental and
applies only to floating-point numbers.
When |
y |
A numeric vector. |
The GCD is calculated using a binary function that takes input
GCD(gcd, x[i + 1])
where the output of this function is passed as input
back into the same function iteratively along the length of x
.
The first gcd value is x[1]
.
Zeroes are handled in the following way: GCD(0, 0) = 0
GCD(a, 0) = a
This has the nice property that zeroes are essentially ignored.
This is calculated using the GCD and the formula is: SCM(x, y) = (abs(x) / GCD(x, y) ) * abs(y)
If you want to calculate the gcd & lcm for 2 values
or across 2 vectors of values, use gcd2
and scm2
.
A very common solution to finding the GCD of a vector of values is to use
Reduce()
along with a binary function like gcd2()
.
e.g. Reduce(gcd2, seq(5, 20, 5))
.
This is exactly identical to gcd(seq(5, 20, 5))
, with gcd()
being much
faster and overall cheaper as it is written in C++ and heavily optimised.
Therefore it is recommended to always use gcd()
.
For example we can compare the two approaches below, x <- seq(5L, length = 10^6, by = 5L)
bench::mark(Reduce(gcd2, x), gcd(x))
This example code shows gcd()
being ~200x faster on my machine than
the Reduce
+ gcd2
approach, even though gcd2
itself is written in C++
and has little overhead.
A number representing the GCD or SCM.
library(cheapr) library(bench) # Binary versions gcd2(15, 25) gcd2(15, seq(5, 25, 5)) scm2(15, seq(5, 25, 5)) scm2(15, 25) # GCD across a vector gcd(c(0, 5, 25)) mark(gcd(c(0, 5, 25))) x <- rnorm(10^5) gcd(x) gcd(x, round = FALSE) mark(gcd(x))
library(cheapr) library(bench) # Binary versions gcd2(15, 25) gcd2(15, seq(5, 25, 5)) scm2(15, seq(5, 25, 5)) scm2(15, 25) # GCD across a vector gcd(c(0, 5, 25)) mark(gcd(c(0, 5, 25))) x <- rnorm(10^5) gcd(x) gcd(x, round = FALSE) mark(gcd(x))
The distances between break-points are always equal in this implementation.
get_breaks(x, n = 10, ...) ## Default S3 method: get_breaks(x, n = 10, ...) ## S3 method for class 'numeric' get_breaks( x, n = 10, pretty = TRUE, expand_min = FALSE, expand_max = pretty, ... ) ## S3 method for class 'integer64' get_breaks(x, n = 10, ...)
get_breaks(x, n = 10, ...) ## Default S3 method: get_breaks(x, n = 10, ...) ## S3 method for class 'numeric' get_breaks( x, n = 10, pretty = TRUE, expand_min = FALSE, expand_max = pretty, ... ) ## S3 method for class 'integer64' get_breaks(x, n = 10, ...)
x |
A numeric vector. |
n |
Number of breakpoints. You may get less or more than requested. |
... |
Extra arguments passed onto methods. |
pretty |
Should pretty break-points be prioritised? Default is |
expand_min |
Should smallest break be extended beyond the
minimum of the data? Default is |
expand_max |
Should largest break be extended beyond the maximum
of the data? Default is |
A numeric vector of break-points.
library(cheapr) set.seed(123) ages <- sample(0:80, 100, TRUE) # Pretty get_breaks(ages, n = 10) # Not-pretty # bin-width is diff(range(ages)) / n_breaks get_breaks(ages, n = 10, pretty = FALSE) # `get_breaks()` is left-biased in a sense, meaning that # the first break is always <= `min(x)` but the last break # may be < `max(x)` # To get right-biased breaks we can use a helper like so.. right_breaks <- function(x, ...){ -get_breaks(-x, ...) } get_breaks(4:24, 10) right_breaks(4:24, 10) # Use `rev()` to ensure they are in ascending order rev(right_breaks(4:24, 10))
library(cheapr) set.seed(123) ages <- sample(0:80, 100, TRUE) # Pretty get_breaks(ages, n = 10) # Not-pretty # bin-width is diff(range(ages)) / n_breaks get_breaks(ages, n = 10, pretty = FALSE) # `get_breaks()` is left-biased in a sense, meaning that # the first break is always <= `min(x)` but the last break # may be < `max(x)` # To get right-biased breaks we can use a helper like so.. right_breaks <- function(x, ...){ -get_breaks(-x, ...) } get_breaks(4:24, 10) right_breaks(4:24, 10) # Use `rev()` to ensure they are in ascending order rev(right_breaks(4:24, 10))
is_na()
is a parallelised alternative to is.na()
. num_na(x)
is a faster and more efficient sum(is.na(x))
. which_na(x)
is a more efficient which(is.na(x))
which_not_na(x)
is a more efficient which(!is.na(x))
row_na_counts(x)
is a more efficient rowSums(is.na(x))
row_all_na()
returns a logical vector indicating which rows are empty
and have only NA
values. row_any_na()
returns a logical vector indicating which rows have at least
1 NA
value.
The col_
variants are the same, but operate by-column.
is_na(x) ## Default S3 method: is_na(x) ## S3 method for class 'POSIXlt' is_na(x) ## S3 method for class 'vctrs_rcrd' is_na(x) ## S3 method for class 'data.frame' is_na(x) num_na(x, recursive = TRUE) which_na(x) which_not_na(x) any_na(x, recursive = TRUE) all_na(x, recursive = TRUE) row_na_counts(x, names = FALSE) col_na_counts(x, names = FALSE) row_all_na(x, names = FALSE) col_all_na(x, names = FALSE) row_any_na(x, names = FALSE) col_any_na(x, names = FALSE)
is_na(x) ## Default S3 method: is_na(x) ## S3 method for class 'POSIXlt' is_na(x) ## S3 method for class 'vctrs_rcrd' is_na(x) ## S3 method for class 'data.frame' is_na(x) num_na(x, recursive = TRUE) which_na(x) which_not_na(x) any_na(x, recursive = TRUE) all_na(x, recursive = TRUE) row_na_counts(x, names = FALSE) col_na_counts(x, names = FALSE) row_all_na(x, names = FALSE) col_all_na(x, names = FALSE) row_any_na(x, names = FALSE) col_any_na(x, names = FALSE)
x |
A vector, list, data frame or matrix. |
recursive |
Should the function be applied recursively to lists?
The default is |
names |
Should row/col names be added? |
These functions are designed primarily for programmers, to increase the speed
and memory-efficiency of NA
handling.
Most of these functions can be parallelised through options(cheapr.cores)
.
To replicate complete.cases(x)
, use !row_any_na(x)
.
To find rows with any empty values,
use which_(row_any_na(df))
.
To find empty rows use which_(row_all_na(df))
or which_na(df)
.
To drop empty rows use na_rm(df)
or sset(df, which_(row_all_na(df), TRUE))
.
is_na
is_na
Is an S3 generic function. It will internally fall back on
using is.na
if it can't find a suitable method.
Alternatively you can write your own is_na
method.
For example there is a method for vctrs_rcrd
objects that simply converts it to a data frame and then calls row_all_na()
.
There is also a POSIXlt
method for is_na
that is much faster than is.na
.
When x
is a list, num_na
, any_na
and all_na
will recursively search
the list for NA
values. If recursive = F
then is_na()
is used to
find NA
values. is_na
differs to is.na
in 2 ways:
List elements are counted as NA
if either that value is NA
, or
if it's a list, then all values of that list are NA
.
When called on a data frame, it returns TRUE
for empty rows that contain
only NA
values.
Number or location of NA
values.
library(cheapr) library(bench) x <- 1:10 x[c(1, 5, 10)] <- NA num_na(x) which_na(x) which_not_na(x) row_nas <- row_na_counts(airquality, names = TRUE) col_nas <- col_na_counts(airquality, names = TRUE) row_nas col_nas df <- sset(airquality, j = 1:2) # Number of NAs in data num_na(df) # Which rows are empty? row_na <- row_all_na(df) sset(df, row_na) # Removing the empty rows sset(df, which_(row_na, invert = TRUE)) # Or na_rm(df) # Or sset(df, row_na_counts(df) < ncol(df))
library(cheapr) library(bench) x <- 1:10 x[c(1, 5, 10)] <- NA num_na(x) which_na(x) which_not_na(x) row_nas <- row_na_counts(airquality, names = TRUE) col_nas <- col_na_counts(airquality, names = TRUE) row_nas col_nas df <- sset(airquality, j = 1:2) # Number of NAs in data num_na(df) # Which rows are empty? row_na <- row_all_na(df) sset(df, row_na) # Removing the empty rows sset(df, which_(row_na, invert = TRUE)) # Or na_rm(df) # Or sset(df, row_na_counts(df) < ncol(df))
Fast lags and leads optionally using dynamic vectorised lags, ordering and run lengths.
lag_(x, n = 1L, fill = NULL, set = FALSE, recursive = TRUE) lag2_( x, n = 1L, order = NULL, run_lengths = NULL, fill = NULL, recursive = TRUE )
lag_(x, n = 1L, fill = NULL, set = FALSE, recursive = TRUE) lag2_( x, n = 1L, order = NULL, run_lengths = NULL, fill = NULL, recursive = TRUE )
x |
A vector or data frame. |
n |
Number of lags. Negative values are accepted. |
fill |
Value used to fill first n values. Default is |
set |
Should x be updated by reference? If |
recursive |
Should list elements be lagged as well?
If |
order |
Optionally specify an ordering with which to apply the lags. This is useful for example when applying lags chronologically using an unsorted time variable. |
run_lengths |
Optional integer vector of run lengths that defines the size of each
lag run. For example, supplying |
For most applications, it is more efficient and recommended to use lag_()
.
For anything that requires dynamic lags, lag by order of another variable,
or by-group lags, one can use lag2_()
.
To do cyclic lags, see the examples below for an implementation.
lag2_
lag2_
is a generalised form of lag_
that by default performs
simple lags and leads.
It has 3 additional features but does not support updating by reference or
long vectors.
These extra features include:
n
- This shares the same name as the n
argument in lag_
for consistency. The difference is that lag_
accepts a
lag vector of length 1 whereas
this accepts a vector of dynamic lags allowing for
flexible combinations of variable sized lags and leads.
These are recycled to the length of the data and will always align
with the data, meaning that if you supply a custom order
argument, this
ordering is applied both to x
and the recycled lag vector n
simultaneously.
order
- Apply lags in any order you wish. This can be useful for
reverse order lags, lags against unsorted time variables, and by-group lags.
run_lengths
- Specify the size of individual lag runs. For example, if you
specify run_lengths = c(3, 4, 2)
, this will apply your lags to the first 3 elements and
then reset, applying lags to the next 4 elements, to reset again and apply lags to the
final 2 elements. Each time the reset occurs, it treats each run length sized 'chunk' as
a unique and separate vector. See the examples for a showcase.
lag_
and lag2_
Description | lag_ |
lag2_ |
Lags | Yes | Yes |
Leads | Yes | Yes |
Long vector support | Yes | No |
Lag by reference | Yes | No |
Dynamic vectorised lags | No | Yes |
Data frame row lags | Yes | Yes |
Alternative order lags | No | Yes |
A lagged object the same size as x.
library(cheapr) library(bench) # A use-case for data.table # Adding 0 because can't update ALTREP by reference df <- data.frame(x = 1:10^5 + 0L) # Normal data frame lag sset(lag_(df), 1:10) # Lag these behind by 3 rows sset(lag_(df, 3, set = TRUE), 1:10) df$x[1:10] # x variable was updated by reference! # The above can be used naturally in data.table to lag data # without any copies # To perform regular R row lags, just make sure set is `FALSE` sset(lag_(as.data.frame(EuStockMarkets), 5), 1:10) # lag2_ is a generalised version of lag_ that allows # for much more complex lags x <- 1:10 # lag every 2nd element lag2_(x, n = c(1, 0)) # lag vector is recycled # Explicit Lag(3) using a vector of lags lags <- lag_sequence(length(x), 3, partial = FALSE) lag2_(x, n = lags) # Alternating lags and leads lag2_(x, c(1, -1)) # Lag only the 3rd element lags <- integer(length(x)) lags[3] <- 1L lag2_(x, lags) # lag in descending order (same as a lead) lag2_(x, order = 10:1) # lag that resets after index 5 lag2_(x, run_lengths = c(5, 5)) # lag with a time index years <- sample(2011:2020) lag2_(x, order = order(years)) # Example of how to do a cyclical lag n <- length(x) # When k >= 0 k <- min(3, n) lag2_(x, c(rep(-n + k, k), rep(k, n - k))) # When k < 0 k <- max(-3, -n) lag2_(x, c(rep(k, n + k), rep(n + k, -k))) # As it turns out, we can do a grouped lag # by supplying group sizes as run lengths and group order as the order set.seed(45) g <- sample(c("a", "b"), 10, TRUE) # NOTE: collapse::flag will not work unless g is already sorted! # This is not an issue with lag2_() collapse::flag(x, g = g) lag2_(x, order = order(g), run_lengths = collapse::GRP(g)$group.sizes) # For production code, we can of course make # this more optimised by using collapse::radixorderv() # Which calculates the order and group sizes all at once o <- collapse::radixorderv(g, group.sizes = TRUE) lag2_(x, order = o, run_lengths = attr(o, "group.sizes")) # Let's finally wrap this up in a nice grouped-lag function grouped_lag <- function(x, n = 1, g = integer(length(x))){ o <- collapse::radixorderv(g, group.sizes = TRUE, sort = FALSE) lag2_(x, n, order = o, run_lengths = attr(o, "group.sizes")) } # And voila! grouped_lag(x, g = g) # A method to extract this information from dplyr ## We can actually get this information easily from a `grouped_df` object ## Uncomment the below code to run the implementation # library(dplyr) # library(timeplyr) # eu_stock <- EuStockMarkets |> # ts_as_tibble() |> # group_by(stock_index = group) # groups <- group_data(eu_stock) # Group information # group_order <- unlist(groups$.rows) # Order of groups # group_sizes <- lengths_(groups$.rows) # Group sizes # # # by-stock index lag # lag2_(eu_stock$value, order = group_order, run_lengths = group_sizes) # # # Verifying this output is correct # eu_stock |> # ungroup() |> # mutate(lag1 = lag_(value), .by = stock_index) |> # mutate(lag2 = lag2_(value, order = group_order, run_lengths = group_sizes)) |> # summarise(lags_are_equal = identical(lag1, lag2)) # Let's compare this to data.table library(data.table) default_threads <- getDTthreads() setDTthreads(1) dt <- data.table(x = 1:10^5, g = sample.int(10^4, 10^5, TRUE)) bench::mark(dt[, y := shift(x), by = g][][["y"]], grouped_lag(dt$x, g = dt$g), iterations = 10) setDTthreads(default_threads)
library(cheapr) library(bench) # A use-case for data.table # Adding 0 because can't update ALTREP by reference df <- data.frame(x = 1:10^5 + 0L) # Normal data frame lag sset(lag_(df), 1:10) # Lag these behind by 3 rows sset(lag_(df, 3, set = TRUE), 1:10) df$x[1:10] # x variable was updated by reference! # The above can be used naturally in data.table to lag data # without any copies # To perform regular R row lags, just make sure set is `FALSE` sset(lag_(as.data.frame(EuStockMarkets), 5), 1:10) # lag2_ is a generalised version of lag_ that allows # for much more complex lags x <- 1:10 # lag every 2nd element lag2_(x, n = c(1, 0)) # lag vector is recycled # Explicit Lag(3) using a vector of lags lags <- lag_sequence(length(x), 3, partial = FALSE) lag2_(x, n = lags) # Alternating lags and leads lag2_(x, c(1, -1)) # Lag only the 3rd element lags <- integer(length(x)) lags[3] <- 1L lag2_(x, lags) # lag in descending order (same as a lead) lag2_(x, order = 10:1) # lag that resets after index 5 lag2_(x, run_lengths = c(5, 5)) # lag with a time index years <- sample(2011:2020) lag2_(x, order = order(years)) # Example of how to do a cyclical lag n <- length(x) # When k >= 0 k <- min(3, n) lag2_(x, c(rep(-n + k, k), rep(k, n - k))) # When k < 0 k <- max(-3, -n) lag2_(x, c(rep(k, n + k), rep(n + k, -k))) # As it turns out, we can do a grouped lag # by supplying group sizes as run lengths and group order as the order set.seed(45) g <- sample(c("a", "b"), 10, TRUE) # NOTE: collapse::flag will not work unless g is already sorted! # This is not an issue with lag2_() collapse::flag(x, g = g) lag2_(x, order = order(g), run_lengths = collapse::GRP(g)$group.sizes) # For production code, we can of course make # this more optimised by using collapse::radixorderv() # Which calculates the order and group sizes all at once o <- collapse::radixorderv(g, group.sizes = TRUE) lag2_(x, order = o, run_lengths = attr(o, "group.sizes")) # Let's finally wrap this up in a nice grouped-lag function grouped_lag <- function(x, n = 1, g = integer(length(x))){ o <- collapse::radixorderv(g, group.sizes = TRUE, sort = FALSE) lag2_(x, n, order = o, run_lengths = attr(o, "group.sizes")) } # And voila! grouped_lag(x, g = g) # A method to extract this information from dplyr ## We can actually get this information easily from a `grouped_df` object ## Uncomment the below code to run the implementation # library(dplyr) # library(timeplyr) # eu_stock <- EuStockMarkets |> # ts_as_tibble() |> # group_by(stock_index = group) # groups <- group_data(eu_stock) # Group information # group_order <- unlist(groups$.rows) # Order of groups # group_sizes <- lengths_(groups$.rows) # Group sizes # # # by-stock index lag # lag2_(eu_stock$value, order = group_order, run_lengths = group_sizes) # # # Verifying this output is correct # eu_stock |> # ungroup() |> # mutate(lag1 = lag_(value), .by = stock_index) |> # mutate(lag2 = lag2_(value, order = group_order, run_lengths = group_sizes)) |> # summarise(lags_are_equal = identical(lag1, lag2)) # Let's compare this to data.table library(data.table) default_threads <- getDTthreads() setDTthreads(1) dt <- data.table(x = 1:10^5, g = sample.int(10^4, 10^5, TRUE)) bench::mark(dt[, y := shift(x), by = g][][["y"]], grouped_lag(dt$x, g = dt$g), iterations = 10) setDTthreads(default_threads)
Functions to help work with lists.
lengths_(x, names = FALSE) unlisted_length(x) new_list(length = 0L, default = NULL)
lengths_(x, names = FALSE) unlisted_length(x) new_list(length = 0L, default = NULL)
x |
A list. |
names |
Should names of list elements be added? Default is |
length |
Length of list. |
default |
Default value for each list element. |
lengths_()
returns the list lengths. unlisted_length()
is an alternative to length(unlist(x))
. new_list()
is like vector("list", length)
but also allows you to specify
a default value for each list element. This can be useful for
initialising with a catch-all value so that when you unlist you're guaranteed
a list of length >= to the specified length.
library(cheapr) l <- list(1:10, NULL, list(integer(), NA_integer_, 2:10)) lengths_(l) # Faster lengths() unlisted_length(l) # length of vector if we unlist paste0("length: ", length(print(unlist(l)))) unlisted_length(l) - na_count(l) # Number of non-NA elements # We can create and initialise a new list with a default value l <- new_list(20, 0L) l[1:5] # This works well with vctrs_list_of objects
library(cheapr) l <- list(1:10, NULL, list(integer(), NA_integer_, 2:10)) lengths_(l) # Faster lengths() unlisted_length(l) # length of vector if we unlist paste0("length: ", length(print(unlist(l)))) unlisted_length(l) - na_count(l) # Number of non-NA elements # We can create and initialise a new list with a default value l <- new_list(20, 0L) l[1:5] # This works well with vctrs_list_of objects
...
) into a named listA fast and useful function for always returning a named list from ...
named_list(..., .keep_null = TRUE)
named_list(..., .keep_null = TRUE)
... |
Key-value pairs. |
.keep_null |
Should |
A named list.
Fast data frame constructor
new_df(..., .nrows = NULL, .recycle = FALSE, .name_repair = FALSE)
new_df(..., .nrows = NULL, .recycle = FALSE, .name_repair = FALSE)
... |
Key-value pairs. |
.nrows |
|
.recycle |
|
.name_repair |
|
A data.frame
summary()
inspired by the skimr packageA cheaper summary()
function, designed for larger data.
overview(x, hist = TRUE, digits = getOption("cheapr.digits", 2)) ## Default S3 method: overview(x, hist = TRUE, digits = getOption("cheapr.digits", 2)) ## S3 method for class 'logical' overview(x, hist = TRUE, digits = getOption("cheapr.digits", 2)) ## S3 method for class 'integer' overview(x, hist = TRUE, digits = getOption("cheapr.digits", 2)) ## S3 method for class 'numeric' overview(x, hist = TRUE, digits = getOption("cheapr.digits", 2)) ## S3 method for class 'integer64' overview(x, hist = TRUE, digits = getOption("cheapr.digits", 2)) ## S3 method for class 'character' overview(x, hist = TRUE, digits = getOption("cheapr.digits", 2)) ## S3 method for class 'factor' overview(x, hist = TRUE, digits = getOption("cheapr.digits", 2)) ## S3 method for class 'Date' overview(x, hist = TRUE, digits = getOption("cheapr.digits", 2)) ## S3 method for class 'POSIXt' overview(x, hist = TRUE, digits = getOption("cheapr.digits", 2)) ## S3 method for class 'ts' overview(x, hist = TRUE, digits = getOption("cheapr.digits", 2)) ## S3 method for class 'zoo' overview(x, hist = TRUE, digits = getOption("cheapr.digits", 2)) ## S3 method for class 'data.frame' overview(x, hist = TRUE, digits = getOption("cheapr.digits", 2))
overview(x, hist = TRUE, digits = getOption("cheapr.digits", 2)) ## Default S3 method: overview(x, hist = TRUE, digits = getOption("cheapr.digits", 2)) ## S3 method for class 'logical' overview(x, hist = TRUE, digits = getOption("cheapr.digits", 2)) ## S3 method for class 'integer' overview(x, hist = TRUE, digits = getOption("cheapr.digits", 2)) ## S3 method for class 'numeric' overview(x, hist = TRUE, digits = getOption("cheapr.digits", 2)) ## S3 method for class 'integer64' overview(x, hist = TRUE, digits = getOption("cheapr.digits", 2)) ## S3 method for class 'character' overview(x, hist = TRUE, digits = getOption("cheapr.digits", 2)) ## S3 method for class 'factor' overview(x, hist = TRUE, digits = getOption("cheapr.digits", 2)) ## S3 method for class 'Date' overview(x, hist = TRUE, digits = getOption("cheapr.digits", 2)) ## S3 method for class 'POSIXt' overview(x, hist = TRUE, digits = getOption("cheapr.digits", 2)) ## S3 method for class 'ts' overview(x, hist = TRUE, digits = getOption("cheapr.digits", 2)) ## S3 method for class 'zoo' overview(x, hist = TRUE, digits = getOption("cheapr.digits", 2)) ## S3 method for class 'data.frame' overview(x, hist = TRUE, digits = getOption("cheapr.digits", 2))
x |
A vector or data frame. |
hist |
Should in-line histograms be returned? Default is |
digits |
How many decimal places should the summary statistics be printed as? Default is 2. |
No rounding of statistics is done except in printing which can be controlled
either through the digits
argument in overview()
, or by setting the
option options(cheapr.digits)
.
To access the underlying data, for example the numeric summary,
just use $numeric
, e.g. overview(rnorm(30))$numeric
.
An object of class "overview". Under the hood this is just a list of data frames. Key summary statistics are reported in each data frame.
library(cheapr) overview(iris) # With histograms overview(airquality, hist = TRUE) # Round to 0 decimal places overview(airquality, digits = 0) # We can set an option for all overviews options(cheapr.digits = 1) overview(rnorm(100)) options(cheapr.digits = 2) # The default
library(cheapr) overview(iris) # With histograms overview(airquality, hist = TRUE) # Round to 0 decimal places overview(airquality, digits = 0) # We can set an option for all overviews options(cheapr.digits = 1) overview(rnorm(100)) options(cheapr.digits = 2) # The default
A convenience function to recycle R objects to either a common or specified size.
recycle(..., length = NULL)
recycle(..., length = NULL)
... |
Objects to recycle. |
length |
Optional length to recycle objects to. |
Data frames are recycled by recycling their rows. recycle()
is optimised to only recycle objects that need recycling. NULL
objects are ignored and not recycled or returned.
A list of recycled R objects.
library(cheapr) recycle(Sys.Date(), 1:10) # Any vectors of zero-length are all recycled to zero-length recycle(integer(), 1:10) # Data frame rows are recycled recycle(sset(iris, 1:3), length = 3 * 3) # To recycle list items, use `do.call()` my_list <- list(from = 1L, to = 10L, by = seq(0.1, 1, 0.1)) do.call(recycle, my_list)
library(cheapr) recycle(Sys.Date(), 1:10) # Any vectors of zero-length are all recycled to zero-length recycle(integer(), 1:10) # Data frame rows are recycled recycle(sset(iris, 1:3), length = 3 * 3) # To recycle list items, use `do.call()` my_list <- list(from = 1L, to = 10L, by = seq(0.1, 1, 0.1)) do.call(recycle, my_list)
sequence_
is an extension to sequence which
accepts decimal number increments. seq_id
can be paired with sequence_
to group individual sequences. seq_
is a vectorised version of seq. window_sequence
creates a vector of window sizes for rolling calculations. lag_sequence
creates a vector of lags for rolling calculations. lead_sequence
creates a vector of leads for rolling calculations.
sequence_(size, from = 1L, by = 1L, add_id = FALSE) seq_id(size) seq_(from = 1L, to = 1L, by = 1L, add_id = FALSE) seq_size(from, to, by = 1L) window_sequence(size, k, partial = TRUE, ascending = TRUE, add_id = FALSE) lag_sequence(size, k, partial = TRUE, add_id = FALSE) lead_sequence(size, k, partial = TRUE, add_id = FALSE)
sequence_(size, from = 1L, by = 1L, add_id = FALSE) seq_id(size) seq_(from = 1L, to = 1L, by = 1L, add_id = FALSE) seq_size(from, to, by = 1L) window_sequence(size, k, partial = TRUE, ascending = TRUE, add_id = FALSE) lag_sequence(size, k, partial = TRUE, add_id = FALSE) lead_sequence(size, k, partial = TRUE, add_id = FALSE)
size |
Vector of sequence lengths. |
from |
Start of sequence(s). |
by |
Unit increment of sequence(s). |
add_id |
Should the ID numbers of the sequences be added as names?
Default is |
to |
End of sequence(s). |
k |
Window/lag size. |
partial |
Should partial windows/lags be returned? Default is |
ascending |
Should window sequence be ascending? Default is |
sequence_()
works in the same way as sequence()
but can accept
non-integer by
values.
It also recycles from
and to
, in the same way as sequence()
.
If any of the sequences contain values > .Machine$integer.max
,
then the result will always be a double vector.
from
can be also be a date, date-time, or any object that supports
addition and multiplication.
seq_()
is a vectorised version of seq()
that strictly accepts
only the arguments from
, to
and by
.
A vector of length sum(size)
except for seq_
which
returns a vector of size sum((to - from) / (by + 1))
library(cheapr) sequence(1:3) sequence_(1:3) sequence(1:3, by = 0.1) sequence_(1:3, by = 0.1) # Add IDs to the sequences sequence_(1:3, by = 0.1, add_id = TRUE) # Turn this quickly into a data frame enframe_(sequence_(1:3, by = 0.1, add_id = TRUE)) sequence(c(3, 2), by = c(-0.1, 0.1)) sequence_(c(3, 2), by = c(-0.1, 0.1)) # Vectorised version of seq() seq_(1, 10, by = c(1, 0.5)) # Same as below c(seq(1, 10, 1), seq(1, 10, 0.5)) # Programmers may use seq_size() to determine final sequence lengths sizes <- seq_size(1, 10, by = c(1, 0.5)) print(paste(c("sequence sizes: (", sizes, ") total size:", sum(sizes)), collapse = " ")) # We can group sequences using seq_id from <- Sys.Date() to <- from + 10 by <- c(1, 2, 3) x <- seq_(from, to, by, add_id = TRUE) class(x) <- "Date" x # Utilities for rolling calculations window_sequence(c(3, 5), 3) window_sequence(c(3, 5), 3, partial = FALSE) window_sequence(c(3, 5), 3, partial = TRUE, ascending = FALSE) # One can for example use these in data.table::frollsum
library(cheapr) sequence(1:3) sequence_(1:3) sequence(1:3, by = 0.1) sequence_(1:3, by = 0.1) # Add IDs to the sequences sequence_(1:3, by = 0.1, add_id = TRUE) # Turn this quickly into a data frame enframe_(sequence_(1:3, by = 0.1, add_id = TRUE)) sequence(c(3, 2), by = c(-0.1, 0.1)) sequence_(c(3, 2), by = c(-0.1, 0.1)) # Vectorised version of seq() seq_(1, 10, by = c(1, 0.5)) # Same as below c(seq(1, 10, 1), seq(1, 10, 0.5)) # Programmers may use seq_size() to determine final sequence lengths sizes <- seq_size(1, 10, by = c(1, 0.5)) print(paste(c("sequence sizes: (", sizes, ") total size:", sum(sizes)), collapse = " ")) # We can group sequences using seq_id from <- Sys.Date() to <- from + 10 by <- c(1, 2, 3) x <- seq_(from, to, by, add_id = TRUE) class(x) <- "Date" x # Utilities for rolling calculations window_sequence(c(3, 5), 3) window_sequence(c(3, 5), 3, partial = FALSE) window_sequence(c(3, 5), 3, partial = TRUE, ascending = FALSE) # One can for example use these in data.table::frollsum
These functions transform your variable by reference, with no copies being made. It is advisable to only use these if you know what you are doing.
set_abs(x) set_floor(x) set_ceiling(x) set_trunc(x) set_exp(x) set_sqrt(x) set_change_sign(x) set_round(x, digits = 0) set_log(x, base = exp(1)) set_pow(x, y) set_add(x, y) set_subtract(x, y) set_multiply(x, y) set_divide(x, y)
set_abs(x) set_floor(x) set_ceiling(x) set_trunc(x) set_exp(x) set_sqrt(x) set_change_sign(x) set_round(x, digits = 0) set_log(x, base = exp(1)) set_pow(x, y) set_add(x, y) set_subtract(x, y) set_multiply(x, y) set_divide(x, y)
x |
A numeric vector. |
digits |
Number of digits to round to. |
base |
Logarithm base. |
y |
A numeric vector. |
These functions are particularly useful for situations
where you have made a copy and then
wish to perform further operations without creating more copies. NA
and NaN
values are ignored though in some instances NaN
values may
be replaced with NA
.
These functions will not work on any classed objects, meaning they
only work on standard integer and numeric vectors and matrices.
A copy is only made in certain instances, e.g. when passing an integer vector
to set_log()
. A warning will always be thrown in this instance alerting the user
to assign the output to an object because x
has not been updated by reference.
To ensure consistent and expected outputs, always assign the output to the same object,
e.g.
x <- set_log(x)
(do this) set_log(x)
(don't do this) x2 <- set_log(x)
(Don't do this either)
No copy is made here unless x is an integer vector.
The exact same object with no copy made, just transformed.
library(cheapr) library(bench) x <- rnorm(2e05) options(cheapr.cores = 2) mark( base = exp(log(abs(x))), cheapr = set_exp(set_log(set_abs(x))) ) options(cheapr.cores = 1)
library(cheapr) library(bench) x <- rnorm(2e05) options(cheapr.cores = 2) mark( base = exp(log(abs(x))), cheapr = set_exp(set_log(set_abs(x))) ) options(cheapr.cores = 1)
Extra utilities
setdiff_(x, y, dups = TRUE) intersect_(x, y, dups = TRUE) cut_numeric( x, breaks, labels = NULL, include.lowest = FALSE, right = TRUE, dig.lab = 3L, ordered_result = FALSE, ... ) x %in_% table x %!in_% table enframe_(x, name = "name", value = "value") deframe_(x) sample_(x, size = vector_length(x), replace = FALSE, prob = NULL) val_insert(x, value, n = NULL, prop = NULL) na_insert(x, n = NULL, prop = NULL) vector_length(x) cheapr_var(x, na.rm = TRUE) cheapr_rev(x) with_local_seed(expr, .seed = NULL, ...)
setdiff_(x, y, dups = TRUE) intersect_(x, y, dups = TRUE) cut_numeric( x, breaks, labels = NULL, include.lowest = FALSE, right = TRUE, dig.lab = 3L, ordered_result = FALSE, ... ) x %in_% table x %!in_% table enframe_(x, name = "name", value = "value") deframe_(x) sample_(x, size = vector_length(x), replace = FALSE, prob = NULL) val_insert(x, value, n = NULL, prop = NULL) na_insert(x, n = NULL, prop = NULL) vector_length(x) cheapr_var(x, na.rm = TRUE) cheapr_rev(x) with_local_seed(expr, .seed = NULL, ...)
x |
A vector or data frame. |
y |
A vector or data frame. |
dups |
Should duplicates be kept? Default is |
breaks |
See |
labels |
See |
include.lowest |
See |
right |
See |
dig.lab |
See |
ordered_result |
See |
... |
Further arguments passed onto |
table |
See |
name |
The column name to assign the names of a vector. |
value |
The column name to assign the values of a vector. |
size |
See |
replace |
See |
prob |
See |
n |
Number of scalar values (or |
prop |
Proportion of scalar values (or |
na.rm |
Should |
expr |
Expression that will be evaluated with a local seed that is independent and has absolutely no effect on the global RNG state. |
.seed |
A local seed to set which is only used inside
|
enframe()_
converts a vector to a data frame. deframe()_
converts a 1-2 column data frame to a vector. intersect_()
returns a vector of common values between x
and y
. setdiff_()
returns a vector of values in x
but not y
. cut_numeric()
places values of a numeric vector into buckets, defined
through the breaks
argument and returns a factor unless labels = FALSE
,
in which case an integer vector of break indices is returned. %in_%
and %!in_%
both return a logical vector signifying if the values of
x
exist or don't exist in table
respectively. sample_()
is an alternative to sample()
that natively samples
data frame rows through sset()
. It also does not have a special case when
length(x)
is 1. val_insert
inserts scalar values randomly into your vector.
Useful for replacing lots of data with a single value. na_insert
inserts NA
values randomly into your vector.
Useful for generating missing data. vector_length
behaves mostly like NROW()
except
for matrices in which it matches length()
.
cheapr_var
returns the variance of a numeric vector.
No coercion happens for integer vectors and so is very cheap. cheapr_rev
is a much cheaper version of rev()
. with_local_seed
offers no speed improvements but is extremely handy
in executing random number based expressions like rnorm()
without
affecting the global RNG state. It allows you to run these expressions in a
sort of independent 'container' and with an optional seed for that
'container' for reproducibility.
The rationale for including this in 'cheapr' is that it can reduce the need
to set many seed values,
especially for multiple output comparisons of RNG expressions.
Another way of thinking about it is that with_local_seed()
is a helper
that allows you to write reproducible code without side-effects, which
traditionally cannot be avoided when calling set.seed()
directly.
library(cheapr) # Using `with_local_seed()` # The below 2 statements are equivalent # Statement 1 set.seed(123456789) res <- rnorm(10) # Statement 2 res2 <- with_local_seed(rnorm(10), .seed = 123456789) # They are the same identical(res, res2) # As an example we can see that the RNG is unaffected by generating # random uniform deviates in batches between calls to `with_local_seed()` # and comparing to the first result set.seed(123456789) batch1 <- rnorm(2) with_local_seed(runif(10)) batch2 <- rnorm(2) with_local_seed(runif(10)) batch3 <- rnorm(1) with_local_seed(runif(10)) batch4 <- rnorm(5) # Combining the batches produces the same result # therefore `with_local_seed` did not interrupt the rng sequence identical(c(batch1, batch2, batch3, batch4), res) # It can be useful in multiple comparisons out1 <- with_local_seed(rnorm(5)) out2 <- with_local_seed(rnorm(5)) out3 <- with_local_seed(rnorm(5)) identical(out1, out2) identical(out1, out3)
library(cheapr) # Using `with_local_seed()` # The below 2 statements are equivalent # Statement 1 set.seed(123456789) res <- rnorm(10) # Statement 2 res2 <- with_local_seed(rnorm(10), .seed = 123456789) # They are the same identical(res, res2) # As an example we can see that the RNG is unaffected by generating # random uniform deviates in batches between calls to `with_local_seed()` # and comparing to the first result set.seed(123456789) batch1 <- rnorm(2) with_local_seed(runif(10)) batch2 <- rnorm(2) with_local_seed(runif(10)) batch3 <- rnorm(1) with_local_seed(runif(10)) batch4 <- rnorm(5) # Combining the batches produces the same result # therefore `with_local_seed` did not interrupt the rng sequence identical(c(batch1, batch2, batch3, batch4), res) # It can be useful in multiple comparisons out1 <- with_local_seed(rnorm(5)) out2 <- with_local_seed(rnorm(5)) out3 <- with_local_seed(rnorm(5)) identical(out1, out2) identical(out1, out3)
Cheaper alternative to [
that consistently subsets data frame
rows, always returning a data frame. There are explicit methods for
enhanced data frames like tibbles, data.tables and sf.
sset(x, ...) ## S3 method for class 'Date' sset(x, i, ...) ## S3 method for class 'POSIXct' sset(x, i, ...) ## S3 method for class 'factor' sset(x, i, ...) ## S3 method for class 'data.frame' sset(x, i, j, ...) ## S3 method for class 'tbl_df' sset(x, i, j, ...) ## S3 method for class 'POSIXlt' sset(x, i, j, ...) ## S3 method for class 'data.table' sset(x, i, j, ...) ## S3 method for class 'sf' sset(x, i, j, ...)
sset(x, ...) ## S3 method for class 'Date' sset(x, i, ...) ## S3 method for class 'POSIXct' sset(x, i, ...) ## S3 method for class 'factor' sset(x, i, ...) ## S3 method for class 'data.frame' sset(x, i, j, ...) ## S3 method for class 'tbl_df' sset(x, i, j, ...) ## S3 method for class 'POSIXlt' sset(x, i, j, ...) ## S3 method for class 'data.table' sset(x, i, j, ...) ## S3 method for class 'sf' sset(x, i, j, ...)
x |
Vector or data frame. |
... |
Further parameters passed to |
i |
A logical or vector of indices. |
j |
Column indices, names or logical vector. |
sset
is an S3 generic.
You can either write methods for sset
or [
. sset
will fall back on using [
when no suitable method is found.
To get into more detail, using sset()
on a data frame, a new
list is always allocated through new_list()
.
When i
is a logical vector, it is passed directly to which_()
.
This means that NA
values are ignored and this also means that i
is not recycled, so it is good practice to make sure the logical vector
matches the length of x. To return NA
values, use sset(x, NA_integer_)
.
When i
is an ALTREP compact sequence which can be commonly created
using e.g. 1:10
or using seq_len
, seq_along
and seq.int
,
sset
internally uses a range-based subsetting method which is faster and doesn't
allocate i
into memory.
A new vector, data frame, list, matrix or other R object.
library(cheapr) library(bench) # Selecting columns sset(airquality, j = "Temp") sset(airquality, j = 1:2) # Selecting rows sset(iris, 1:5) # Rows and columns sset(iris, 1:5, 1:5) sset(iris, iris$Sepal.Length > 7, c("Species", "Sepal.Length")) # Comparison against base x <- rnorm(10^4) mark(x[1:10^3], sset(x, 1:10^3)) mark(x[x > 0], sset(x, x > 0)) df <- data.frame(x = x) mark(df[df$x > 0, , drop = FALSE], sset(df, df$x > 0), check = FALSE) # Row names are different ## EXTRA: An easy way to incorporate cheapr into dplyr's filter() # cheapr_filter <- function(.data, ..., .by = NULL, .preserve = FALSE){ # filter_df <- .data |> # dplyr::mutate(..., .by = {{ .by }}, .keep = "none") # groups <- dplyr::group_vars(filter_df) # filter_df <- cheapr::sset(filter_df, j = setdiff(names(filter_df), groups)) # n_filters <- ncol(filter_df) # if (n_filters < 1){ # .data # } else { # dplyr::dplyr_row_slice(.data, cheapr::which_(Reduce(`&`, filter_df)), # preserve = .preserve) # } # }
library(cheapr) library(bench) # Selecting columns sset(airquality, j = "Temp") sset(airquality, j = 1:2) # Selecting rows sset(iris, 1:5) # Rows and columns sset(iris, 1:5, 1:5) sset(iris, iris$Sepal.Length > 7, c("Species", "Sepal.Length")) # Comparison against base x <- rnorm(10^4) mark(x[1:10^3], sset(x, 1:10^3)) mark(x[x > 0], sset(x, x > 0)) df <- data.frame(x = x) mark(df[df$x > 0, , drop = FALSE], sset(df, df$x > 0), check = FALSE) # Row names are different ## EXTRA: An easy way to incorporate cheapr into dplyr's filter() # cheapr_filter <- function(.data, ..., .by = NULL, .preserve = FALSE){ # filter_df <- .data |> # dplyr::mutate(..., .by = {{ .by }}, .keep = "none") # groups <- dplyr::group_vars(filter_df) # filter_df <- cheapr::sset(filter_df, j = setdiff(names(filter_df), groups)) # n_filters <- ncol(filter_df) # if (n_filters < 1){ # .data # } else { # dplyr::dplyr_row_slice(.data, cheapr::which_(Reduce(`&`, filter_df)), # preserve = .preserve) # } # }
These are primarily intended as very fast scalar-based functions
for developers.
They are particularly useful for working with NA
values in a fast
and efficient manner.
val_count(x, value, recursive = TRUE) count_val(x, value, recursive = TRUE) val_find(x, value, invert = FALSE) which_val(x, value, invert = FALSE) val_replace(x, value, replace, recursive = TRUE) na_replace(x, replace, recursive = TRUE) val_rm(x, value) na_count(x, recursive = TRUE) na_find(x, invert = FALSE) na_rm(x)
val_count(x, value, recursive = TRUE) count_val(x, value, recursive = TRUE) val_find(x, value, invert = FALSE) which_val(x, value, invert = FALSE) val_replace(x, value, replace, recursive = TRUE) na_replace(x, replace, recursive = TRUE) val_rm(x, value) na_count(x, recursive = TRUE) na_find(x, invert = FALSE) na_rm(x)
x |
A vector, list, data frame or matrix. |
value |
A scalar value to count, find, replace or remove. |
recursive |
Should values in a list be counted or replaced recursively?
Default is |
invert |
Should |
replace |
Replacement scalar value. |
The val_
functions allow you to very efficiently work with
scalars, i.e length 1 vectors. Many common common operations like
counting the occurrence of NA
or zeros, e.g. sum(x == 0)
or
sum(is.na(x))
can be replaced more efficiently with
val_count(x, 0)
and na_count(x)
respectively.
At the moment these functions only work for
integer, double and character vectors with the exception of the NA
functions.
They are intended mainly for developers who wish to write cheaper code
and reduce expensive vector operations.
val_count()
- Counts occurrences of a value
val_find()
Finds locations (indices) of a value
val_replace()
- Replaces value with another value
val_rm()
- Removes occurrences of value from an object
There are NA
equivalent convenience functions.
na_count()
== val_count(x, NA)
na_find()
== val_find(x, NA)
na_replace()
== val_replace(x, NA)
na_rm()
== val_rm(x, NA)
val_count()
and val_replace()
can work recursively. For example,
when applied to a data frame, na_replace
will replace NA
values across
the entire data frame with the specified replacement value.
In 'cheapr' function-naming conventions have not been consistent but
going forward
all scalar functions (including the NA
convenience functions) will be
prefixed with 'val_' and 'na_' respectively.
Functions named with the older naming scheme like which_na
may be
removed at some point in the future.
val_count()
returns the number of times a scalar value appears in a vector
or list. val_find()
returns the index locations of that scalar value. val_replace()
replaces a specified scalar value with a replacement scalar
value. If no instances of said value are found then the input x is returned
as is. na_replace()
is a convenience function
equivalent to val_replace(x, NA, ...)
. val_rm()
removes all instances of a specified scalar value.
If no instances are found, the original input x is returned as is.
which()
Exactly the same as which()
but more memory efficient.
which_(x, invert = FALSE)
which_(x, invert = FALSE)
x |
A logical vector. |
invert |
If |
This implementation is similar in speed to which()
but usually more memory efficient.
An unnamed integer vector.
library(cheapr) library(bench) x <- sample(c(TRUE, FALSE), 1e05, TRUE) x[sample.int(1e05, round(1e05/3))] <- NA mark(which_(TRUE), which(TRUE)) mark(which_(FALSE), which(FALSE)) mark(which_(logical()), which(logical())) mark(which_(x), which(x), iterations = 20) mark(base = which(is.na(match(x, TRUE))), collapse = collapse::whichv(x, TRUE, invert = TRUE), cheapr = which_(x, invert = TRUE), iterations = 20)
library(cheapr) library(bench) x <- sample(c(TRUE, FALSE), 1e05, TRUE) x[sample.int(1e05, round(1e05/3))] <- NA mark(which_(TRUE), which(TRUE)) mark(which_(FALSE), which(FALSE)) mark(which_(logical()), which(logical())) mark(which_(x), which(x), iterations = 20) mark(base = which(is.na(match(x, TRUE))), collapse = collapse::whichv(x, TRUE, invert = TRUE), cheapr = which_(x, invert = TRUE), iterations = 20)