| Title: | Fast Alternatives to 'tidyverse' Functions |
|---|---|
| Description: | A full set of fast data manipulation tools with a tidy front-end and a fast back-end using 'collapse' and 'cheapr'. |
| Authors: | Nick Christofides [aut, cre] (ORCID: <https://orcid.org/0000-0002-9743-7342>) |
| Maintainer: | Nick Christofides <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.9.91 |
| Built: | 2026-05-29 19:37:30 UTC |
| Source: | https://github.com/nicchr/fastplyr |
Add a column of useful IDs (group IDs, row IDs & consecutive IDs)
add_group_id(.data, ...) ## S3 method for class 'data.frame' add_group_id( .data, ..., .order = group_by_order_default(.data), .ascending = TRUE, .by = NULL, .cols = NULL, .name = NULL, as_qg = FALSE ) add_row_id(.data, ...) ## S3 method for class 'data.frame' add_row_id( .data, ..., .ascending = TRUE, .by = NULL, .cols = NULL, .name = NULL ) add_consecutive_id(.data, ...) ## S3 method for class 'data.frame' add_consecutive_id( .data, ..., .order = group_by_order_default(.data), .by = NULL, .cols = NULL, .name = NULL )add_group_id(.data, ...) ## S3 method for class 'data.frame' add_group_id( .data, ..., .order = group_by_order_default(.data), .ascending = TRUE, .by = NULL, .cols = NULL, .name = NULL, as_qg = FALSE ) add_row_id(.data, ...) ## S3 method for class 'data.frame' add_row_id( .data, ..., .ascending = TRUE, .by = NULL, .cols = NULL, .name = NULL ) add_consecutive_id(.data, ...) ## S3 method for class 'data.frame' add_consecutive_id( .data, ..., .order = group_by_order_default(.data), .by = NULL, .cols = NULL, .name = NULL )
.data |
A data frame. |
... |
Additional groups using tidy |
.order |
Should the groups be ordered? |
.ascending |
Should the order be ascending or descending?
The default is |
.by |
Alternative way of supplying groups using |
.cols |
(Optional) alternative to |
.name |
Name of the added ID column which should be a
character vector of length 1.
If |
as_qg |
Should the group IDs be returned as a
collapse "qG" class? The default ( |
A data frame with the requested ID column.
group_id row_id f_consecutive_id
An alternative to dplyr::desc() which is much faster
for character vectors and factors.
desc(x)desc(x)
x |
Vector. |
A numeric vector that can be ordered in ascending or descending order.
Useful in dplyr::arrange() or f_arrange().
collapse version of dplyr::arrange()
This is a fast and near-identical alternative to dplyr::arrange()
using the collapse package.
desc() is like dplyr::desc() but works faster when
called directly on vectors.
f_arrange( .data, ..., .by = NULL, .by_group = FALSE, .cols = NULL, .descending = FALSE, .in_place = FALSE )f_arrange( .data, ..., .by = NULL, .by_group = FALSE, .cols = NULL, .descending = FALSE, .in_place = FALSE )
.data |
A data frame. |
... |
Variables to arrange by. |
.by |
(Optional). A selection of columns to group by for this operation.
Columns are specified using |
.by_group |
If |
.cols |
(Optional) alternative to |
.descending |
|
.in_place |
Should data be sorted in-place?
This can be very efficient for large data frames and can be safely used
when overwriting a freshly allocated data frame.
If you're unsure whether the data frame is a freshly allocated object,
use Please note that no new vectors and no copies are created, data is directly sorted in-memory. This only works on data frames consisting of atomic vectors. |
A sorted data.frame.
Faster bind rows and columns.
f_bind_rows(...) f_bind_cols(..., .repair_names = TRUE, .recycle = TRUE)f_bind_rows(...) f_bind_cols(..., .repair_names = TRUE, .recycle = TRUE)
... |
Data frames to bind. |
.repair_names |
Should duplicate column names be made unique?
Default is |
.recycle |
Should inputs be recycled to a common row size?
Default is |
f_bind_rows() performs a union of the data frames specified via ... and
joins the rows of all the data frames, without removing duplicates.
f_bind_cols() joins the columns, creating unique column names if there are
any duplicates by default.
Near-identical alternative to dplyr::count().
f_count( .data, ..., wt = NULL, sort = FALSE, .order = group_by_order_default(.data), name = NULL, .by = NULL, .cols = NULL ) f_add_count( .data, ..., wt = NULL, sort = FALSE, .order = group_by_order_default(.data), name = NULL, .by = NULL, .cols = NULL )f_count( .data, ..., wt = NULL, sort = FALSE, .order = group_by_order_default(.data), name = NULL, .by = NULL, .cols = NULL ) f_add_count( .data, ..., wt = NULL, sort = FALSE, .order = group_by_order_default(.data), name = NULL, .by = NULL, .cols = NULL )
.data |
A data frame. |
... |
Variables to group by. |
wt |
Frequency weights.
Can be
|
sort |
If |
.order |
Should the groups be calculated as ordered groups?
If |
name |
The name of the new column in the output.
If there's already a column called |
.by |
(Optional). A selection of columns to group by for this operation. Columns are specified using tidy-select. |
.cols |
(Optional) alternative to |
This is a fast and near-identical alternative to dplyr::count() using the collapse package.
Unlike collapse::fcount(), this works very similarly to dplyr::count().
The only main difference is that anything supplied to wt
is recycled and added as a data variable.
Other than that everything works exactly as the dplyr equivalent.
f_count() and f_add_count() can be up to >100x faster than the dplyr equivalents.
A data.frame of frequency counts by group.
Like dplyr::distinct() but faster when lots of
groups are involved.
f_distinct( .data, ..., .keep_all = FALSE, .order = FALSE, .sort = deprecated(), .by = NULL, .cols = NULL )f_distinct( .data, ..., .keep_all = FALSE, .order = FALSE, .sort = deprecated(), .by = NULL, .cols = NULL )
.data |
A data frame. |
... |
Variables used to find distinct rows. |
.keep_all |
If |
.order |
Should the groups be calculated as ordered groups?
Setting to |
.sort |
|
.by |
(Optional). A selection of columns to group by for this operation. Columns are specified using tidy-select. |
.cols |
(Optional) alternative to |
A data.frame of distinct groups.
Find duplicate rows
f_duplicates( .data, ..., .keep_all = FALSE, .both_ways = FALSE, .add_count = FALSE, .drop_empty = FALSE, .order = FALSE, .sort = deprecated(), .by = NULL, .cols = NULL )f_duplicates( .data, ..., .keep_all = FALSE, .both_ways = FALSE, .add_count = FALSE, .drop_empty = FALSE, .order = FALSE, .sort = deprecated(), .by = NULL, .cols = NULL )
.data |
A data frame. |
... |
Variables used to find duplicate rows. |
.keep_all |
If |
.both_ways |
If |
.add_count |
If |
.drop_empty |
If |
.order |
Should the groups be calculated as ordered groups?
Setting to |
.sort |
|
.by |
(Optional). A selection of columns to group by for this operation. Columns are specified using tidy-select. |
.cols |
(Optional) alternative to |
This function works like dplyr::distinct() in its handling of
arguments and data-masking but returns duplicate rows.
In certain situations in can be much faster than data |> group_by()|> filter(n() > 1)
when there are many groups.
A data.frame of duplicate rows.
tidyr::expand() and tidyr::complete().Fast versions of tidyr::expand() and tidyr::complete().
f_expand(.data, ..., .sort = FALSE, .by = NULL, .cols = NULL) f_complete(.data, ..., .sort = FALSE, .by = NULL, .cols = NULL, fill = NA) crossing(..., .sort = FALSE) nesting(..., .sort = FALSE)f_expand(.data, ..., .sort = FALSE, .by = NULL, .cols = NULL) f_complete(.data, ..., .sort = FALSE, .by = NULL, .cols = NULL, fill = NA) crossing(..., .sort = FALSE) nesting(..., .sort = FALSE)
.data |
A data frame |
... |
Variables to expand. |
.sort |
Logical. If |
.by |
(Optional). A selection of columns to group by for this operation. Columns are specified using tidy-select. |
.cols |
(Optional) alternative to |
fill |
A named list containing value-name pairs to fill the named implicit missing values. |
crossing and nesting are helpers that are basically identical to
tidyr's crossing and nesting.
A data.frame of expanded groups.
NA values forwards and backwardsFill NA values forwards and backwards
f_fill( .data, ..., .by = NULL, .cols = NULL, .direction = c("forwards", "backwards"), .fill_limit = Inf, .new_names = "{.col}" )f_fill( .data, ..., .by = NULL, .cols = NULL, .direction = c("forwards", "backwards"), .fill_limit = Inf, .new_names = "{.col}" )
.data |
A data frame. |
... |
Cols to fill |
.by |
Cols to group by for this operation.
Specified through |
.cols |
(Optional) alternative to |
.direction |
Which direction should |
.fill_limit |
The maximum number of consecutive |
.new_names |
A name specification for the names of filled variables.
The default |
A data frame with NA values filled forward or backward.
dplyr::filter()
Alternative to dplyr::filter()
f_filter(.data, ..., .by = NULL)f_filter(.data, ..., .by = NULL)
.data |
A data frame. |
... |
Expressions used to filter the data frame with. |
.by |
(Optional). A selection of columns to group by for this operation. Columns are specified using tidy-select. |
A filtered data frame.
dplyr::group_by()
This works the exact same as dplyr::group_by() and typically
performs around the same speed but uses slightly less memory.
f_group_by( .data, ..., .add = FALSE, .order = group_by_order_default(.data), .by = NULL, .cols = NULL, .drop = df_group_by_drop_default(.data) )f_group_by( .data, ..., .add = FALSE, .order = group_by_order_default(.data), .by = NULL, .cols = NULL, .drop = df_group_by_drop_default(.data) )
.data |
data frame. |
... |
Variables to group by. |
.add |
Should groups be added to existing groups?
Default is |
.order |
Should groups be ordered? If |
.by |
(Optional). A selection of columns to group by for this operation.
Columns are specified using |
.cols |
(Optional) alternative to |
.drop |
Should unused factor levels be dropped? Default is |
f_group_by() works almost exactly like the 'dplyr' equivalent.
An attribute "ordered" (TRUE or FALSE) is added to the group data to
signify if the groups are sorted or not.
The distinction between ordered and sorted is somewhat subtle.
Functions in fastplyr that use a sort argument generally refer
to the top-level dataset being sorted in some way, either by sorting
the group columns like in f_expand() or f_distinct(), or
some other columns, like the count column in f_count().
The .order argument, when set to TRUE (the default),
is used to mean that the group data will be calculated
using a sort-based algorithm, leading to sorted group data.
When .order is FALSE, the group data will be returned based on
the order-of-first appearance of the groups in the data.
This order-of-first appearance may still naturally be sorted
depending on the data.
For example, group_id(1:3, order = T) results in the same group IDs
as group_id(1:3, order = F) because 1, 2, and 3 appear in the data in
ascending sequence whereas group_id(3:1, order = T) does not equal
group_id(3:1, order = F)
Part of the reason for the distinction is that internally fastplyr
can in theory calculate group data
using the sort-based algorithm and still return unsorted groups,
though this combination is only available to the user in limited places like
f_distinct(.order = TRUE, .sort = FALSE).
The other reason is to prevent confusion in the meaning
of sort and order so that order always refers to the
algorithm specified, resulting in sorted groups, and sort implies a
physical sorting of the returned data. It's also worth mentioning that
in most functions, sort will implicitly utilise the sort-based algorithm
specified via order = TRUE.
In many situations (not all) it can be faster to use the
order-of-first appearance algorithm, specified via .order = FALSE.
This can generally be accessed by first calling
f_group_by(data, ..., .order = FALSE) and then
performing your calculations.
To utilise this algorithm more globally and package-wide,
set the '.fastplyr.order.groups' option to FALSE using the code:
options(.fastplyr.order.groups = FALSE).
f_group_by() returns a grouped_df that can be used
for further for grouped calculations.
group_ordered() returns TRUE if the group data are sorted,
i.e if attr(attr(data, "groups"), "ordered") == TRUE. If sorted,
which is usually the default, this leads to summary calculations
like f_summarise() or dplyr::summarise() producing sorted groups.
If FALSE they are returned based on order-of-first appearance in the data.
dplyr::group_split
Alternative to dplyr::group_split
f_group_split( .data, ..., .add = FALSE, .order = group_by_order_default(.data), .by = NULL, .cols = NULL, .drop = df_group_by_drop_default(.data), .group_names = FALSE )f_group_split( .data, ..., .add = FALSE, .order = group_by_order_default(.data), .by = NULL, .cols = NULL, .drop = df_group_by_drop_default(.data), .group_names = FALSE )
.data |
data frame. |
... |
Variables to group by. |
.add |
Should groups be added to existing groups?
Default is |
.order |
Should groups be ordered? If |
.by |
(Optional). A selection of columns to group by for this operation.
Columns are specified using |
.cols |
(Optional) alternative to |
.drop |
Should unused factor levels be dropped? Default is |
.group_names |
Should group names be added? Default is |
A list of data frames split by group.
Mostly a wrapper around collapse::join() that behaves more like
dplyr's joins. List columns, lubridate intervals and vctrs rcrds
work here too.
f_left_join( x, y, by = NULL, suffix = c(".x", ".y"), multiple = TRUE, keep = FALSE, ... ) f_right_join( x, y, by = NULL, suffix = c(".x", ".y"), multiple = TRUE, keep = FALSE, ... ) f_inner_join( x, y, by = NULL, suffix = c(".x", ".y"), multiple = TRUE, keep = FALSE, ... ) f_full_join( x, y, by = NULL, suffix = c(".x", ".y"), multiple = TRUE, keep = FALSE, ... ) f_anti_join( x, y, by = NULL, suffix = c(".x", ".y"), multiple = TRUE, keep = FALSE, ... ) f_semi_join( x, y, by = NULL, suffix = c(".x", ".y"), multiple = TRUE, keep = FALSE, ... ) f_cross_join(x, y, suffix = c(".x", ".y"), ...) f_union_all(x, y, ...) f_union(x, y, ...)f_left_join( x, y, by = NULL, suffix = c(".x", ".y"), multiple = TRUE, keep = FALSE, ... ) f_right_join( x, y, by = NULL, suffix = c(".x", ".y"), multiple = TRUE, keep = FALSE, ... ) f_inner_join( x, y, by = NULL, suffix = c(".x", ".y"), multiple = TRUE, keep = FALSE, ... ) f_full_join( x, y, by = NULL, suffix = c(".x", ".y"), multiple = TRUE, keep = FALSE, ... ) f_anti_join( x, y, by = NULL, suffix = c(".x", ".y"), multiple = TRUE, keep = FALSE, ... ) f_semi_join( x, y, by = NULL, suffix = c(".x", ".y"), multiple = TRUE, keep = FALSE, ... ) f_cross_join(x, y, suffix = c(".x", ".y"), ...) f_union_all(x, y, ...) f_union(x, y, ...)
x |
Left data frame. |
y |
Right data frame. |
by |
|
suffix |
|
multiple |
|
keep |
|
... |
Additional arguments passed to |
A joined data frame, joined on the columns specified with by, using an
equality join.
f_cross_join() returns all possible combinations
between the two data frames.
mutate() with per-group optimisationsA faster mutate() with per-group optimisations
f_mutate( .data, ..., .by = NULL, .order = group_by_order_default(.data), .keep = "all" )f_mutate( .data, ..., .by = NULL, .order = group_by_order_default(.data), .keep = "all" )
.data |
A data frame. |
... |
Name-value pairs of summary functions. Expressions with
|
.by |
(Optional). A selection of columns to group by for this operation. Columns are specified using tidy-select. |
.order |
Should the groups be returned in sorted order?
If |
.keep |
Which columns to keep. Options are 'all', 'used', 'unused' and 'none'. |
A data frame with added columns.
fastplyr data-masking functions like f_mutate and f_summarise operate
very similarly to their dplyr counterparts but with some crucial
differences.
Optimisations for by-group operations kick in for
common statistical functions which are detailed below.
A message will be printed which one can disable
by running options(fastplyr.inform = FALSE).
When this happens, the expressions which become optimised no longer
obey data-masking rules pertaining to sequential and dependent expression
execution.
For example,
the pseudo code
f_summarise(data, mean = mean(x), mean2 = round(mean), .by = g)
when optimised will not work because the named col mean will not be visible
in later expressions.
One can disable fastplyr optimisations
globally by running options(fastplyr.optimise = F).
Some functions are internally optimised using 'collapse' fast statistical functions. This makes execution on many groups very fast.
For fast quantiles (percentiles) by group, see tidy_quantiles
List of currently optimised functions
dplyr::n -> <custom_expression> dplyr::row_number -> <custom_expression> (only for f_mutate) dplyr::cur_group -> <custom_expression> dplyr::cur_group_id -> <custom_expression> dplyr::cur_group_rows -> <custom_expression> (only for f_mutate) dplyr::lag -> <custom_expression> (only for f_mutate) dplyr::lead -> <custom_expression> (only for f_mutate) base::sum -> collapse::fsum base::prod -> collapse::fprod base::min -> collapse::fmin base::max -> collapse::fmax stats::mean -> collapse::fmean stats::median -> collapse::fmedian stats::sd -> collapse::fsd stats::var -> collapse::fvar dplyr::first -> collapse::ffirst dplyr::last -> collapse::flast dplyr::n_distinct -> collapse::fndistinct
A faster nest_by().
f_nest_by( .data, ..., .add = FALSE, .order = group_by_order_default(.data), .by = NULL, .cols = NULL, .drop = df_group_by_drop_default(.data) )f_nest_by( .data, ..., .add = FALSE, .order = group_by_order_default(.data), .by = NULL, .cols = NULL, .drop = df_group_by_drop_default(.data) )
.data |
data frame. |
... |
Variables to group by. |
.add |
Should groups be added to existing groups?
Default is |
.order |
Should groups be ordered? If |
.by |
(Optional). A selection of columns to group by for this operation.
Columns are specified using |
.cols |
(Optional) alternative to |
.drop |
Should unused factor levels be dropped? Default is |
A row-wise grouped_df of the corresponding data of each group.
library(dplyr) library(fastplyr) # Stratified linear-model example models <- iris |> f_nest_by(Species) |> mutate(model = list(lm(Sepal.Length ~ Petal.Width + Petal.Length, data = first(data))), summary = list(summary(first(model))), r_sq = first(summary)$r.squared) models models$summary # dplyr's `nest_by()` is admittedly more convenient # as it performs a double bracket subset `[[` on list elements for you # which we have emulated by using `first()` # `f_nest_by()` is faster when many groups are involved models <- iris |> nest_by(Species) |> mutate(model = list(lm(Sepal.Length ~ Petal.Width + Petal.Length, data = data)), summary = list(summary(model)), r_sq = summary$r.squared) models$summary models$summary[[1]]library(dplyr) library(fastplyr) # Stratified linear-model example models <- iris |> f_nest_by(Species) |> mutate(model = list(lm(Sepal.Length ~ Petal.Width + Petal.Length, data = first(data))), summary = list(summary(first(model))), r_sq = first(summary)$r.squared) models models$summary # dplyr's `nest_by()` is admittedly more convenient # as it performs a double bracket subset `[[` on list elements for you # which we have emulated by using `first()` # `f_nest_by()` is faster when many groups are involved models <- iris |> nest_by(Species) |> mutate(model = list(lm(Sepal.Length ~ Petal.Width + Petal.Length, data = data)), summary = list(summary(model)), r_sq = summary$r.squared) models$summary models$summary[[1]]
reframe() with per-group optimisationsA faster reframe() with per-group optimisations
f_reframe(.data, ..., .by = NULL, .order = group_by_order_default(.data))f_reframe(.data, ..., .by = NULL, .order = group_by_order_default(.data))
.data |
A data frame. |
... |
Name-value pairs of summary functions. Expressions with
|
.by |
(Optional). A selection of columns to group by for this operation. Columns are specified using tidy-select. |
.order |
Should the groups be returned in sorted order?
If |
A data frame of specified results.
fastplyr data-masking functions like f_mutate and f_summarise operate
very similarly to their dplyr counterparts but with some crucial
differences.
Optimisations for by-group operations kick in for
common statistical functions which are detailed below.
A message will be printed which one can disable
by running options(fastplyr.inform = FALSE).
When this happens, the expressions which become optimised no longer
obey data-masking rules pertaining to sequential and dependent expression
execution.
For example,
the pseudo code
f_summarise(data, mean = mean(x), mean2 = round(mean), .by = g)
when optimised will not work because the named col mean will not be visible
in later expressions.
One can disable fastplyr optimisations
globally by running options(fastplyr.optimise = F).
Some functions are internally optimised using 'collapse' fast statistical functions. This makes execution on many groups very fast.
For fast quantiles (percentiles) by group, see tidy_quantiles
List of currently optimised functions
dplyr::n -> <custom_expression> dplyr::row_number -> <custom_expression> (only for f_mutate) dplyr::cur_group -> <custom_expression> dplyr::cur_group_id -> <custom_expression> dplyr::cur_group_rows -> <custom_expression> (only for f_mutate) dplyr::lag -> <custom_expression> (only for f_mutate) dplyr::lead -> <custom_expression> (only for f_mutate) base::sum -> collapse::fsum base::prod -> collapse::fprod base::min -> collapse::fmin base::max -> collapse::fmax stats::mean -> collapse::fmean stats::median -> collapse::fmedian stats::sd -> collapse::fsd stats::var -> collapse::fvar dplyr::first -> collapse::ffirst dplyr::last -> collapse::flast dplyr::n_distinct -> collapse::fndistinct
fastplyr currently cannot handle rowwise_df objects created through
dplyr::rowwise() and so this is a convenience function to allow you to
perform row-wise operations.
For common efficient row-wise functions,
see the 'kit' package.
f_rowwise(.data, ..., .ascending = TRUE, .cols = NULL, .name = ".row_id")f_rowwise(.data, ..., .ascending = TRUE, .cols = NULL, .name = ".row_id")
.data |
data frame. |
... |
Variables to group by using |
.ascending |
Should data be grouped in ascending row-wise order?
Default is |
.cols |
(Optional) alternative to |
.name |
Name of row-id column to be added. |
A row-wise grouped_df.
select()/rename()/pull()
f_select() operates the exact same way as dplyr::select() and
can be used naturally with tidy-select helpers.
It uses collapse to perform the actual selecting of variables and is
considerably faster than dplyr for selecting exact columns,
and even more so when supplying the .cols argument.
f_select(data, ..., .cols = NULL) f_rename(data, ..., .cols = NULL) f_pull(data, ..., .cols = NULL) nothing()f_select(data, ..., .cols = NULL) f_rename(data, ..., .cols = NULL) f_pull(data, ..., .cols = NULL) nothing()
data |
A data frame. |
... |
Variables to select using |
.cols |
(Optional) faster alternative to |
A data.frame of selected columns.
dplyr::slice()
When there are lots of groups, the f_slice() functions are much faster.
f_slice( .data, i = 0L, ..., .by = NULL, .order = group_by_order_default(.data), keep_order = FALSE ) f_slice_head( .data, n, prop, .by = NULL, .order = group_by_order_default(.data), keep_order = FALSE ) f_slice_tail( .data, n, prop, .by = NULL, .order = group_by_order_default(.data), keep_order = FALSE ) f_slice_min( .data, order_by, n, prop, .by = NULL, with_ties = TRUE, na_rm = FALSE, .order = group_by_order_default(.data), keep_order = FALSE ) f_slice_max( .data, order_by, n, prop, .by = NULL, with_ties = TRUE, na_rm = FALSE, .order = group_by_order_default(.data), keep_order = FALSE ) f_slice_sample( .data, n, replace = FALSE, prop, .by = NULL, .order = group_by_order_default(.data), keep_order = FALSE, weights = NULL )f_slice( .data, i = 0L, ..., .by = NULL, .order = group_by_order_default(.data), keep_order = FALSE ) f_slice_head( .data, n, prop, .by = NULL, .order = group_by_order_default(.data), keep_order = FALSE ) f_slice_tail( .data, n, prop, .by = NULL, .order = group_by_order_default(.data), keep_order = FALSE ) f_slice_min( .data, order_by, n, prop, .by = NULL, with_ties = TRUE, na_rm = FALSE, .order = group_by_order_default(.data), keep_order = FALSE ) f_slice_max( .data, order_by, n, prop, .by = NULL, with_ties = TRUE, na_rm = FALSE, .order = group_by_order_default(.data), keep_order = FALSE ) f_slice_sample( .data, n, replace = FALSE, prop, .by = NULL, .order = group_by_order_default(.data), keep_order = FALSE, weights = NULL )
.data |
A data frame. |
i |
An integer vector of slice locations. |
... |
A temporary argument to give the user an error if dots are used. |
.by |
(Optional). A selection of columns to group by for this operation. Columns are specified using tidy-select. |
.order |
Should the groups be returned in sorted order?
If |
keep_order |
Should the sliced data frame be returned in its original order?
The default is |
n |
Number of rows. |
prop |
Proportion of rows. |
order_by |
Variables to order by. |
with_ties |
Should ties be kept together? The default is |
na_rm |
Should missing values in |
replace |
Should |
weights |
Probability weights used in |
i argument in f_slice
i is first evaluated on an un-grouped basis and then searches for
those locations in each group. Thus if you supply an expression
of slice locations that vary by-group, this will not be respected nor checked.
For example,
do f_slice(data, 10:20, .by = group)
not f_slice(data, sample(1:10), .by = group).
The former results in slice locations that do not vary by group but the latter
will result in different within-group slice locations which f_slice cannot
correctly compute.
To do the the latter type of by-group slicing, use f_filter, e.g. f_filter(data, row_number() %in% slices, .by = groups)
or even faster: library(cheapr) f_filter(data, row_number() %in_% slices, .by = groups)
f_slice_sampleThe arguments of f_slice_sample() align more closely with base::sample() and thus
by default re-samples each entire group without replacement.
A data.frame filtered on the specified row indices.
Like dplyr::summarise() but with some internal optimisations
for common statistical functions.
f_summarise(.data, ..., .by = NULL, .order = group_by_order_default(.data)) f_summarize(.data, ..., .by = NULL, .order = group_by_order_default(.data))f_summarise(.data, ..., .by = NULL, .order = group_by_order_default(.data)) f_summarize(.data, ..., .by = NULL, .order = group_by_order_default(.data))
.data |
A data frame. |
... |
Name-value pairs of summary functions. Expressions with
|
.by |
(Optional). A selection of columns to group by for this operation. Columns are specified using tidy-select. |
.order |
Should the groups be returned in sorted order?
If |
An un-grouped data frame of summaries by group.
fastplyr data-masking functions like f_mutate and f_summarise operate
very similarly to their dplyr counterparts but with some crucial
differences.
Optimisations for by-group operations kick in for
common statistical functions which are detailed below.
A message will be printed which one can disable
by running options(fastplyr.inform = FALSE).
When this happens, the expressions which become optimised no longer
obey data-masking rules pertaining to sequential and dependent expression
execution.
For example,
the pseudo code
f_summarise(data, mean = mean(x), mean2 = round(mean), .by = g)
when optimised will not work because the named col mean will not be visible
in later expressions.
One can disable fastplyr optimisations
globally by running options(fastplyr.optimise = F).
Some functions are internally optimised using 'collapse' fast statistical functions. This makes execution on many groups very fast.
For fast quantiles (percentiles) by group, see tidy_quantiles
List of currently optimised functions
dplyr::n -> <custom_expression> dplyr::row_number -> <custom_expression> (only for f_mutate) dplyr::cur_group -> <custom_expression> dplyr::cur_group_id -> <custom_expression> dplyr::cur_group_rows -> <custom_expression> (only for f_mutate) dplyr::lag -> <custom_expression> (only for f_mutate) dplyr::lead -> <custom_expression> (only for f_mutate) base::sum -> collapse::fsum base::prod -> collapse::fprod base::min -> collapse::fmin base::max -> collapse::fmax stats::mean -> collapse::fmean stats::median -> collapse::fmedian stats::sd -> collapse::fsd stats::var -> collapse::fvar dplyr::first -> collapse::ffirst dplyr::last -> collapse::flast dplyr::n_distinct -> collapse::fndistinct
library(fastplyr) library(nycflights13) library(dplyr) options(fastplyr.inform = FALSE) # Number of flights per month, including first and last day flights |> f_group_by(year, month) |> f_summarise(first_day = first(day), last_day = last(day), num_flights = n()) ## Fast mean summary using `across()` flights |> f_summarise( across(where(is.numeric), mean), .by = tailnum ) flights |> f_group_by(.cols = "tailnum") |> f_summarise( across(where(is.numeric), mean) )library(fastplyr) library(nycflights13) library(dplyr) options(fastplyr.inform = FALSE) # Number of flights per month, including first and last day flights |> f_group_by(year, month) |> f_summarise(first_day = first(day), last_day = last(day), num_flights = n()) ## Fast mean summary using `across()` flights |> f_summarise( across(where(is.numeric), mean), .by = tailnum ) flights |> f_group_by(.cols = "tailnum") |> f_summarise( across(where(is.numeric), mean) )
grouped_df
Un-group grouped_df
f_ungroup(data) group_ordered(data)f_ungroup(data) group_ordered(data)
data |
A data frame. |
An un-grouped data frame.
Helper functions to allow users to:
Enable or disable optimisations for common functions package-wide
Enable or disable informative messages
fastplyr_enable_optimisations() fastplyr_disable_optimisations() fastplyr_enable_informative_msgs() fastplyr_disable_informative_msgs()fastplyr_enable_optimisations() fastplyr_disable_optimisations() fastplyr_enable_informative_msgs() fastplyr_disable_informative_msgs()
Enables or disables fastplyr global options invisibly.
Get list of current group-unaware functions
get_group_unaware_fns()get_group_unaware_fns()
A named list of functions marked as group-unaware in fastplyr.
library(fastplyr) fns <- get_group_unaware_fns() names(fns) fns$roundlibrary(fastplyr) fns <- get_group_unaware_fns() names(fns) fns$round
A default value, TRUE or FALSE that controls which algorithm to use
for calculating groups. See f_group_by for more details.
group_by_order_default(x)group_by_order_default(x)
x |
A data frame. |
A logical of length 1, either TRUE or FALSE.
Fast group metadata
f_group_data(x) f_group_keys(x) f_group_rows(x) f_group_indices(x) f_group_vars(x) f_group_size(x) f_n_groups(x)f_group_data(x) f_group_keys(x) f_group_rows(x) f_group_indices(x) f_group_vars(x) f_group_size(x) f_n_groups(x)
x |
A |
Requested group metadata.
These are tidy-based functions for calculating group IDs and row IDs.
group_id() returns an integer vector of group IDs
the same size as the x.
row_id() returns an integer vector of row IDs.
f_consecutive_id() returns an integer vector of consecutive run IDs.
The add_ variants add a column of group IDs/row IDs.
group_id(x, order = TRUE, ascending = TRUE, as_qg = FALSE) row_id(x, ascending = TRUE) f_consecutive_id(x)group_id(x, order = TRUE, ascending = TRUE, as_qg = FALSE) row_id(x, ascending = TRUE) f_consecutive_id(x)
x |
A vector or data frame. |
order |
Should the groups be ordered?
When order is |
ascending |
Should the order be ascending or descending?
The default is |
as_qg |
Should the group IDs be returned as a
collapse "qG" class? The default ( |
Note - When working with data frames it is highly recommended
to use the add_ variants of these functions. Not only are they more
intuitive to use, they also have optimisations for large numbers of groups.
group_idThis assigns an integer value to unique elements of a vector or unique rows of a data frame. It is an extremely useful function for analysis as you can compress a lot of information into a single column, using that for further operations.
row_idThis assigns a row number to each group. To assign plain row numbers
to a data frame one can use add_row_id().
This function can be used in rolling calculations, finding duplicates and
more.
consecutive_idAn alternative to dplyr::consecutive_id(), f_consecutive_id() also
creates an integer vector with values in the range [1, n] where
n is the length of the vector or number of rows of the data frame.
The ID increments every time x[i] != x[i - 1] thus giving information on
when there is a change in value.
f_consecutive_id has a very small overhead in terms
of calling the function, making it suitable for repeated calls.
An integer vector.
add_group_id add_row_id add_consecutive_id
rlang::list2
Evaluates arguments dynamically like rlang::list2 but objects
created in list_tidy have precedence over environment objects.
list_tidy(..., .keep_null = TRUE, .named = FALSE)list_tidy(..., .keep_null = TRUE, .named = FALSE)
... |
Dynamic name-value pairs. |
.keep_null |
|
.named |
|
Fast 'tibble' alternatives
new_tbl(..., .nrows = NULL, .recycle = TRUE, .name_repair = TRUE) f_enframe(x, name = "name", value = "value") f_deframe(x) as_tbl(x)new_tbl(..., .nrows = NULL, .recycle = TRUE, .name_repair = TRUE) f_enframe(x, name = "name", value = "value") f_deframe(x) as_tbl(x)
... |
Dynamic name-value pairs. |
.nrows |
|
.recycle |
|
.name_repair |
|
x |
A data frame or vector. |
name |
|
value |
|
new_tbl and as_tbl are alternatives to
tibble and as_tibble respectively.
f_enframe(x) where x is a data.frame converts x into a tibble
of column names and list-values.
A tibble or vector.
NA valuesFast remove rows with NA values
remove_rows_if_any_na(.data, ..., .cols = NULL) remove_rows_if_all_na(.data, ..., .cols = NULL)remove_rows_if_any_na(.data, ..., .cols = NULL) remove_rows_if_all_na(.data, ..., .cols = NULL)
.data |
A data frame. |
... |
Cols to fill |
.cols |
(Optional) alternative to |
A data frame with removed rows containing either any or all NA values.
Fast grouped sample quantiles
tidy_quantiles( data, ..., probs = seq(0, 1, 0.25), type = 7, pivot = c("long", "wide"), na.rm = TRUE, .by = NULL, .cols = NULL, .order = group_by_order_default(data), .drop_groups = deprecated() )tidy_quantiles( data, ..., probs = seq(0, 1, 0.25), type = 7, pivot = c("long", "wide"), na.rm = TRUE, .by = NULL, .cols = NULL, .order = group_by_order_default(data), .drop_groups = deprecated() )
data |
A data frame. |
... |
|
probs |
|
type |
|
pivot |
|
na.rm |
|
.by |
(Optional). A selection of columns to group by for this operation. Columns are specified using tidy-select. |
.cols |
(Optional) alternative to |
.order |
Should the groups be returned in sorted order?
If |
.drop_groups |
|
A data frame of sample quantiles.
library(fastplyr) library(dplyr) groups <- 1 * 2^(0:10) # Normal distributed samples by group using the group value as the mean # and sqrt(groups) as the sd samples <- tibble(groups) |> reframe(x = rnorm(100, mean = groups, sd = sqrt(groups)), .by = groups) |> f_group_by(groups) # Fast means and quantiles by group quantiles <- samples |> tidy_quantiles(x, pivot = "wide") means <- samples |> f_summarise(mean = mean(x)) means |> f_left_join(quantiles)library(fastplyr) library(dplyr) groups <- 1 * 2^(0:10) # Normal distributed samples by group using the group value as the mean # and sqrt(groups) as the sd samples <- tibble(groups) |> reframe(x = rnorm(100, mean = groups, sd = sqrt(groups)), .by = groups) |> f_group_by(groups) # Fast means and quantiles by group quantiles <- samples |> tidy_quantiles(x, pivot = "wide") means <- samples |> f_summarise(mean = mean(x)) means |> f_left_join(quantiles)