Title: | Fast Alternatives to 'tidyverse' Functions |
---|---|
Description: | A full set of fast data manipulation tools with a tidy front-end and a fast back-end using 'collapse' and 'cheapr'. |
Authors: | Nick Christofides [aut, cre] |
Maintainer: | Nick Christofides <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.5.1 |
Built: | 2025-02-21 08:26:43 UTC |
Source: | https://github.com/nicchr/fastplyr |
fastplyr is a tidy front-end using a faster and more efficient back-end based on two packages, collapse and cheapr.
fastplyr includes dplyr and tidyr alternatives that behave like their tidyverse equivalents but are more efficient.
Similar in spirit to the excellent tidytable package, fastplyr also offers a tidy front-end that is fast and easy to use. Unlike tidytable, fastplyr verbs are interchangeable with dplyr verbs.
You can learn more about the tidyverse, collapse and cheapr using the links below.
Maintainer: Nick Christofides [email protected] (ORCID)
Useful links:
Report bugs at https://github.com/NicChr/fastplyr/issues
Add a column of useful IDs (group IDs, row IDs & consecutive IDs)
add_group_id(data, ...) ## S3 method for class 'data.frame' add_group_id( data, ..., .order = df_group_by_order_default(data), .ascending = TRUE, .by = NULL, .cols = NULL, .name = NULL, as_qg = FALSE ) add_row_id(data, ...) ## S3 method for class 'data.frame' add_row_id( data, ..., .ascending = TRUE, .by = NULL, .cols = NULL, .name = NULL ) add_consecutive_id(data, ...) ## S3 method for class 'data.frame' add_consecutive_id( data, ..., .order = df_group_by_order_default(data), .by = NULL, .cols = NULL, .name = NULL )
add_group_id(data, ...) ## S3 method for class 'data.frame' add_group_id( data, ..., .order = df_group_by_order_default(data), .ascending = TRUE, .by = NULL, .cols = NULL, .name = NULL, as_qg = FALSE ) add_row_id(data, ...) ## S3 method for class 'data.frame' add_row_id( data, ..., .ascending = TRUE, .by = NULL, .cols = NULL, .name = NULL ) add_consecutive_id(data, ...) ## S3 method for class 'data.frame' add_consecutive_id( data, ..., .order = df_group_by_order_default(data), .by = NULL, .cols = NULL, .name = NULL )
data |
A data frame. |
... |
Additional groups using tidy |
.order |
Should the groups be ordered? |
.ascending |
Should the order be ascending or descending?
The default is |
.by |
Alternative way of supplying groups using |
.cols |
(Optional) alternative to |
.name |
Name of the added ID column which should be a
character vector of length 1.
If |
as_qg |
Should the group IDs be returned as a
collapse "qG" class? The default ( |
A data frame with the requested ID column.
group_id row_id f_consecutive_id
An alternative to dplyr::desc()
which is much faster
for character vectors and factors.
desc(x)
desc(x)
x |
Vector. |
A numeric vector that can be ordered in ascending or descending order.
Useful in dplyr::arrange()
or f_arrange()
.
collapse
version of dplyr::arrange()
This is a fast and near-identical alternative to dplyr::arrange()
using the collapse
package.
desc()
is like dplyr::desc()
but works faster when
called directly on vectors.
f_arrange( data, ..., .by = NULL, .by_group = FALSE, .cols = NULL, .descending = FALSE )
f_arrange( data, ..., .by = NULL, .by_group = FALSE, .cols = NULL, .descending = FALSE )
data |
A data frame. |
... |
Variables to arrange by. |
.by |
(Optional). A selection of columns to group by for this operation.
Columns are specified using |
.by_group |
If |
.cols |
(Optional) alternative to |
.descending |
|
A sorted data.frame
.
Faster bind rows and columns.
f_bind_rows(..., .fill = TRUE) f_bind_cols(..., .repair_names = TRUE, .recycle = TRUE, .sep = "...")
f_bind_rows(..., .fill = TRUE) f_bind_cols(..., .repair_names = TRUE, .recycle = TRUE, .sep = "...")
... |
Data frames to bind. |
.fill |
Should missing columns be filled with |
.repair_names |
Should duplicate column names be made unique?
Default is |
.recycle |
Should inputs be recycled to a common row size?
Default is |
.sep |
Separator to use for creating unique column names. |
f_bind_rows()
performs a union of the data frames specified via ...
and
joins the rows of all the data frames, without removing duplicates.
f_bind_cols()
joins the columns, creating unique column names if there are
any duplicates by default.
Near-identical alternative to dplyr::count()
.
f_count( data, ..., wt = NULL, sort = FALSE, .order = df_group_by_order_default(data), name = NULL, .by = NULL, .cols = NULL ) f_add_count( data, ..., wt = NULL, sort = FALSE, .order = df_group_by_order_default(data), name = NULL, .by = NULL, .cols = NULL )
f_count( data, ..., wt = NULL, sort = FALSE, .order = df_group_by_order_default(data), name = NULL, .by = NULL, .cols = NULL ) f_add_count( data, ..., wt = NULL, sort = FALSE, .order = df_group_by_order_default(data), name = NULL, .by = NULL, .cols = NULL )
data |
A data frame. |
... |
Variables to group by. |
wt |
Frequency weights.
Can be
|
sort |
If |
.order |
Should the groups be calculated as ordered groups?
If |
name |
The name of the new column in the output.
If there's already a column called |
.by |
(Optional). A selection of columns to group by for this operation. Columns are specified using tidy-select. |
.cols |
(Optional) alternative to |
This is a fast and near-identical alternative to dplyr::count() using the collapse
package.
Unlike collapse::fcount()
, this works very similarly to dplyr::count()
.
The only main difference is that anything supplied to wt
is recycled and added as a data variable.
Other than that everything works exactly as the dplyr equivalent.
f_count()
and f_add_count()
can be up to >100x faster than the dplyr equivalents.
A data.frame
of frequency counts by group.
Like dplyr::distinct()
but faster when lots of
groups are involved.
f_distinct( data, ..., .keep_all = FALSE, .sort = FALSE, .order = .sort, .by = NULL, .cols = NULL )
f_distinct( data, ..., .keep_all = FALSE, .sort = FALSE, .order = .sort, .by = NULL, .cols = NULL )
data |
A data frame. |
... |
Variables used to find distinct rows. |
.keep_all |
If |
.sort |
Should result be sorted? Default is |
.order |
Should the groups be calculated as ordered groups?
Setting to |
.by |
(Optional). A selection of columns to group by for this operation. Columns are specified using tidy-select. |
.cols |
(Optional) alternative to |
A data.frame
of distinct groups.
Find duplicate rows
f_duplicates( data, ..., .keep_all = FALSE, .both_ways = FALSE, .add_count = FALSE, .drop_empty = FALSE, .sort = FALSE, .by = NULL, .cols = NULL )
f_duplicates( data, ..., .keep_all = FALSE, .both_ways = FALSE, .add_count = FALSE, .drop_empty = FALSE, .sort = FALSE, .by = NULL, .cols = NULL )
data |
A data frame. |
... |
Variables used to find duplicate rows. |
.keep_all |
If |
.both_ways |
If |
.add_count |
If |
.drop_empty |
If |
.sort |
Should result be sorted?
If |
.by |
(Optional). A selection of columns to group by for this operation. Columns are specified using tidy-select. |
.cols |
(Optional) alternative to |
This function works like dplyr::distinct()
in its handling of
arguments and data-masking but returns duplicate rows.
In certain situations in can be much faster than data %>% group_by() %>% filter(n() > 1)
when there are many groups.
A data.frame
of duplicate rows.
tidyr::expand()
and tidyr::complete()
.Fast versions of tidyr::expand()
and tidyr::complete()
.
f_expand(data, ..., .sort = FALSE, .by = NULL, .cols = NULL) f_complete(data, ..., .sort = FALSE, .by = NULL, .cols = NULL, fill = NA) crossing(..., .sort = FALSE) nesting(..., .sort = FALSE)
f_expand(data, ..., .sort = FALSE, .by = NULL, .cols = NULL) f_complete(data, ..., .sort = FALSE, .by = NULL, .cols = NULL, fill = NA) crossing(..., .sort = FALSE) nesting(..., .sort = FALSE)
data |
A data frame |
... |
Variables to expand. |
.sort |
Logical. If |
.by |
(Optional). A selection of columns to group by for this operation. Columns are specified using tidy-select. |
.cols |
(Optional) alternative to |
fill |
A named list containing value-name pairs to fill the named implicit missing values. |
crossing
and nesting
are helpers that are basically identical to
tidyr's crossing
and nesting
.
A data.frame
of expanded groups.
NA
values forwards and backwardsFill NA
values forwards and backwards
f_fill( data, ..., .by = NULL, .cols = NULL, .direction = c("forwards", "backwards"), .fill_limit = Inf, .new_names = "{.col}" )
f_fill( data, ..., .by = NULL, .cols = NULL, .direction = c("forwards", "backwards"), .fill_limit = Inf, .new_names = "{.col}" )
data |
A data frame. |
... |
Cols to fill |
.by |
Cols to group by for this operation.
Specified through |
.cols |
(Optional) alternative to |
.direction |
Which direction should |
.fill_limit |
The maximum number of consecutive |
.new_names |
A name specification for the names of filled variables.
The default |
A data frame with NA
values filled forward or backward.
dplyr::filter()
Alternative to dplyr::filter()
f_filter(data, ..., .by = NULL)
f_filter(data, ..., .by = NULL)
data |
A data frame. |
... |
Expressions used to filter the data frame with. |
.by |
(Optional). A selection of columns to group by for this operation. Columns are specified using tidy-select. |
A filtered data frame.
dplyr::group_by()
This works the exact same as dplyr::group_by()
and typically
performs around the same speed but uses slightly less memory.
f_group_by( data, ..., .add = FALSE, .order = df_group_by_order_default(data), .by = NULL, .cols = NULL, .drop = df_group_by_drop_default(data) ) group_ordered(data) f_ungroup(data)
f_group_by( data, ..., .add = FALSE, .order = df_group_by_order_default(data), .by = NULL, .cols = NULL, .drop = df_group_by_drop_default(data) ) group_ordered(data) f_ungroup(data)
data |
data frame. |
... |
Variables to group by. |
.add |
Should groups be added to existing groups?
Default is |
.order |
Should groups be ordered? If |
.by |
(Optional). A selection of columns to group by for this operation.
Columns are specified using |
.cols |
(Optional) alternative to |
.drop |
Should unused factor levels be dropped? Default is |
f_group_by()
works almost exactly like the 'dplyr' equivalent.
An attribute "ordered" (TRUE
or FALSE
) is added to the group data to
signify if the groups are sorted or not.
The distinction between ordered and sorted is somewhat subtle.
Functions in fastplyr that use a sort
argument generally refer
to the top-level dataset being sorted in some way, either by sorting
the group columns like in f_expand()
or f_distinct()
, or
some other columns, like the count column in f_count()
.
The .order
argument, when set to TRUE
(the default),
is used to mean that the group data will be calculated
using a sort-based algorithm, leading to sorted group data.
When .order
is FALSE
, the group data will be returned based on
the order-of-first appearance of the groups in the data.
This order-of-first appearance may still naturally be sorted
depending on the data.
For example, group_id(1:3, order = T)
results in the same group IDs
as group_id(1:3, order = F)
because 1, 2, and 3 appear in the data in
ascending sequence whereas group_id(3:1, order = T)
does not equal
group_id(3:1, order = F)
Part of the reason for the distinction is that internally fastplyr
can in theory calculate group data
using the sort-based algorithm and still return unsorted groups,
though this combination is only available to the user in limited places like
f_distinct(.order = TRUE, .sort = FALSE)
.
The other reason is to prevent confusion in the meaning
of sort
and order
so that order
always refers to the
algorithm specified, resulting in sorted groups, and sort
implies a
physical sorting of the returned data. It's also worth mentioning that
in most functions, sort
will implicitly utilise the sort-based algorithm
specified via order = TRUE
.
In many situations (not all) it can be faster to use the
order-of-first appearance algorithm, specified via .order = FALSE
.
This can generally be accessed by first calling
f_group_by(data, ..., .order = FALSE)
and then
performing your calculations.
To utilise this algorithm more globally and package-wide,
set the '.fastplyr.order.groups' option to FALSE
using the code:
options(.fastplyr.order.groups = FALSE)
.
f_group_by()
returns a grouped_df
that can be used
for further for grouped calculations.
group_ordered()
returns TRUE
if the group data are sorted,
i.e if attr(attr(data, "groups"), "ordered") == TRUE
. If sorted,
which is usually the default, this leads to summary calculations
like f_summarise()
or dplyr::summarise()
producing sorted groups.
If FALSE
they are returned based on order-of-first appearance in the data.
Mostly a wrapper around collapse::join()
that behaves more like
dplyr's joins. List columns, lubridate intervals and vctrs rcrds
work here too.
f_left_join( x, y, by = NULL, suffix = c(".x", ".y"), multiple = TRUE, keep = FALSE, ... ) f_right_join( x, y, by = NULL, suffix = c(".x", ".y"), multiple = TRUE, keep = FALSE, ... ) f_inner_join( x, y, by = NULL, suffix = c(".x", ".y"), multiple = TRUE, keep = FALSE, ... ) f_full_join( x, y, by = NULL, suffix = c(".x", ".y"), multiple = TRUE, keep = FALSE, ... ) f_anti_join( x, y, by = NULL, suffix = c(".x", ".y"), multiple = TRUE, keep = FALSE, ... ) f_semi_join( x, y, by = NULL, suffix = c(".x", ".y"), multiple = TRUE, keep = FALSE, ... ) f_cross_join(x, y, suffix = c(".x", ".y"), ...) f_union_all(x, y, ...) f_union(x, y, ...)
f_left_join( x, y, by = NULL, suffix = c(".x", ".y"), multiple = TRUE, keep = FALSE, ... ) f_right_join( x, y, by = NULL, suffix = c(".x", ".y"), multiple = TRUE, keep = FALSE, ... ) f_inner_join( x, y, by = NULL, suffix = c(".x", ".y"), multiple = TRUE, keep = FALSE, ... ) f_full_join( x, y, by = NULL, suffix = c(".x", ".y"), multiple = TRUE, keep = FALSE, ... ) f_anti_join( x, y, by = NULL, suffix = c(".x", ".y"), multiple = TRUE, keep = FALSE, ... ) f_semi_join( x, y, by = NULL, suffix = c(".x", ".y"), multiple = TRUE, keep = FALSE, ... ) f_cross_join(x, y, suffix = c(".x", ".y"), ...) f_union_all(x, y, ...) f_union(x, y, ...)
x |
Left data frame. |
y |
Right data frame. |
by |
|
suffix |
|
multiple |
|
keep |
|
... |
Additional arguments passed to |
A joined data frame, joined on the columns specified with by
, using an
equality join.
f_cross_join()
returns all possible combinations
between the two data frames.
A faster nest_by()
.
f_nest_by( data, ..., .add = FALSE, .order = df_group_by_order_default(data), .by = NULL, .cols = NULL, .drop = df_group_by_drop_default(data) )
f_nest_by( data, ..., .add = FALSE, .order = df_group_by_order_default(data), .by = NULL, .cols = NULL, .drop = df_group_by_drop_default(data) )
data |
data frame. |
... |
Variables to group by. |
.add |
Should groups be added to existing groups?
Default is |
.order |
Should groups be ordered? If |
.by |
(Optional). A selection of columns to group by for this operation.
Columns are specified using |
.cols |
(Optional) alternative to |
.drop |
Should unused factor levels be dropped? Default is |
A row-wise grouped_df
of the corresponding data of each group.
library(dplyr) library(fastplyr) # Stratified linear-model example models <- iris %>% f_nest_by(Species) %>% mutate(model = list(lm(Sepal.Length ~ Petal.Width + Petal.Length, data = first(data))), summary = list(summary(first(model))), r_sq = first(summary)$r.squared) models models$summary # dplyr's `nest_by()` is admittedly more convenient # as it performs a double bracket subset `[[` on list elements for you # which we have emulated by using `first()` # `f_nest_by()` is faster when many groups are involved models <- iris %>% nest_by(Species) %>% mutate(model = list(lm(Sepal.Length ~ Petal.Width + Petal.Length, data = data)), summary = list(summary(model)), r_sq = summary$r.squared) models$summary models$summary[[1]]
library(dplyr) library(fastplyr) # Stratified linear-model example models <- iris %>% f_nest_by(Species) %>% mutate(model = list(lm(Sepal.Length ~ Petal.Width + Petal.Length, data = first(data))), summary = list(summary(first(model))), r_sq = first(summary)$r.squared) models models$summary # dplyr's `nest_by()` is admittedly more convenient # as it performs a double bracket subset `[[` on list elements for you # which we have emulated by using `first()` # `f_nest_by()` is faster when many groups are involved models <- iris %>% nest_by(Species) %>% mutate(model = list(lm(Sepal.Length ~ Petal.Width + Petal.Length, data = data)), summary = list(summary(model)), r_sq = summary$r.squared) models$summary models$summary[[1]]
fastplyr currently cannot handle rowwise_df
objects created through
dplyr::rowwise()
and so this is a convenience function to allow you to
perform row-wise operations.
For common efficient row-wise functions,
see the 'kit' package.
f_rowwise(data, ..., .ascending = TRUE, .cols = NULL, .name = ".row_id")
f_rowwise(data, ..., .ascending = TRUE, .cols = NULL, .name = ".row_id")
data |
data frame. |
... |
Variables to group by using |
.ascending |
Should data be grouped in ascending row-wise order?
Default is |
.cols |
(Optional) alternative to |
.name |
Name of row-id column to be added. |
A row-wise grouped_df
.
select()
/rename()
/pull()
f_select()
operates the exact same way as dplyr::select()
and
can be used naturally with tidy-select
helpers.
It uses collapse to perform the actual selecting of variables and is
considerably faster than dplyr for selecting exact columns,
and even more so when supplying the .cols
argument.
f_select(data, ..., .cols = NULL) f_rename(data, ..., .cols = NULL) f_pull(data, ..., .cols = NULL) nothing()
f_select(data, ..., .cols = NULL) f_rename(data, ..., .cols = NULL) f_pull(data, ..., .cols = NULL) nothing()
data |
A data frame. |
... |
Variables to select using |
.cols |
(Optional) faster alternative to |
A data.frame
of selected columns.
dplyr::slice()
When there are lots of groups, the f_slice()
functions are much faster.
f_slice( data, i = 0L, ..., .by = NULL, .order = df_group_by_order_default(data), keep_order = FALSE ) f_slice_head( data, n, prop, .by = NULL, .order = df_group_by_order_default(data), keep_order = FALSE ) f_slice_tail( data, n, prop, .by = NULL, .order = df_group_by_order_default(data), keep_order = FALSE ) f_slice_min( data, order_by, n, prop, .by = NULL, with_ties = TRUE, na_rm = FALSE, .order = df_group_by_order_default(data), keep_order = FALSE ) f_slice_max( data, order_by, n, prop, .by = NULL, with_ties = TRUE, na_rm = FALSE, .order = df_group_by_order_default(data), keep_order = FALSE ) f_slice_sample( data, n, replace = FALSE, prop, .by = NULL, .order = df_group_by_order_default(data), keep_order = FALSE, weights = NULL )
f_slice( data, i = 0L, ..., .by = NULL, .order = df_group_by_order_default(data), keep_order = FALSE ) f_slice_head( data, n, prop, .by = NULL, .order = df_group_by_order_default(data), keep_order = FALSE ) f_slice_tail( data, n, prop, .by = NULL, .order = df_group_by_order_default(data), keep_order = FALSE ) f_slice_min( data, order_by, n, prop, .by = NULL, with_ties = TRUE, na_rm = FALSE, .order = df_group_by_order_default(data), keep_order = FALSE ) f_slice_max( data, order_by, n, prop, .by = NULL, with_ties = TRUE, na_rm = FALSE, .order = df_group_by_order_default(data), keep_order = FALSE ) f_slice_sample( data, n, replace = FALSE, prop, .by = NULL, .order = df_group_by_order_default(data), keep_order = FALSE, weights = NULL )
data |
A data frame. |
i |
An integer vector of slice locations. |
... |
A temporary argument to give the user an error if dots are used. |
.by |
(Optional). A selection of columns to group by for this operation. Columns are specified using tidy-select. |
.order |
Should the groups be returned in sorted order?
If |
keep_order |
Should the sliced data frame be returned in its original order?
The default is |
n |
Number of rows. |
prop |
Proportion of rows. |
order_by |
Variables to order by. |
with_ties |
Should ties be kept together? The default is |
na_rm |
Should missing values in |
replace |
Should |
weights |
Probability weights used in |
i
argument in f_slice
i
is first evaluated on an un-grouped basis and then searches for
those locations in each group. Thus if you supply an expression
of slice locations that vary by-group, this will not be respected nor checked.
For example,
do f_slice(data, 10:20, .by = group)
not f_slice(data, sample(1:10), .by = group)
.
The former results in slice locations that do not vary by group but the latter
will result in different within-group slice locations which f_slice
cannot
correctly compute.
To do the the latter type of by-group slicing, use f_filter
, e.g. f_filter(data, row_number() %in% slices, .by = groups)
or even faster: library(cheapr)
f_filter(data, row_number() %in_% slices, .by = groups)
f_slice_sample
The arguments of f_slice_sample()
align more closely with base::sample()
and thus
by default re-samples each entire group without replacement.
A data.frame
filtered on the specified row indices.
Like dplyr::summarise()
but with some internal optimisations
for common statistical functions.
f_summarise( data, ..., .by = NULL, .order = df_group_by_order_default(data), .optimise = TRUE ) f_summarize( data, ..., .by = NULL, .order = df_group_by_order_default(data), .optimise = TRUE )
f_summarise( data, ..., .by = NULL, .order = df_group_by_order_default(data), .optimise = TRUE ) f_summarize( data, ..., .by = NULL, .order = df_group_by_order_default(data), .optimise = TRUE )
data |
A data frame. |
... |
Name-value pairs of summary functions. Expressions with
|
.by |
(Optional). A selection of columns to group by for this operation. Columns are specified using tidy-select. |
.order |
Should the groups be returned in sorted order?
If |
.optimise |
(Optionally) turn off optimisations for common statistical
functions by setting to |
f_summarise
behaves mostly like dplyr::summarise
except that expressions
supplied to ...
are evaluated independently.
Some functions are internally optimised using 'collapse' fast statistical functions. This makes execution on many groups very fast.
For fast quantiles (percentiles) by group, see tidy_quantiles
List of currently optimised functions and their equivalent 'collapse' function
base::sum
-> collapse::fsum
base::prod
-> collapse::fprod
base::min
-> collapse::fmin
base::max
-> collapse::fmax
stats::mean
-> collapse::fmean
stats::median
-> collapse::fmedian
stats::sd
-> collapse::fsd
stats::var
-> collapse::fvar
dplyr::first
-> collapse::ffirst
dplyr::last
-> collapse::flast
dplyr::n_distinct
-> collapse::fndistinct
An un-grouped data frame of summaries by group.
library(fastplyr) library(nycflights13) # Number of flights per month, including first and last day flights %>% f_group_by(year, month) %>% f_summarise(first_day = first(day), last_day = last(day), num_flights = n()) ## Fast mean summary using `across()` flights %>% f_summarise( across(where(is.double), mean), .by = tailnum ) # To ignore or keep NAs, use collapse::set_collapse(na.rm) collapse::set_collapse(na.rm = FALSE) flights %>% f_summarise( across(where(is.double), mean), .by = origin ) collapse::set_collapse(na.rm = TRUE)
library(fastplyr) library(nycflights13) # Number of flights per month, including first and last day flights %>% f_group_by(year, month) %>% f_summarise(first_day = first(day), last_day = last(day), num_flights = n()) ## Fast mean summary using `across()` flights %>% f_summarise( across(where(is.double), mean), .by = tailnum ) # To ignore or keep NAs, use collapse::set_collapse(na.rm) collapse::set_collapse(na.rm = FALSE) flights %>% f_summarise( across(where(is.double), mean), .by = origin ) collapse::set_collapse(na.rm = TRUE)
A default value, TRUE
or FALSE
that controls which algorithm to use
for calculating groups. See f_group_by for more details.
group_by_order_default(x)
group_by_order_default(x)
x |
A data frame. |
A logical of length 1, either TRUE
or FALSE
.
These are tidy-based functions for calculating group IDs and row IDs.
group_id()
returns an integer vector of group IDs
the same size as the x
.
row_id()
returns an integer vector of row IDs.
f_consecutive_id()
returns an integer vector of consecutive run IDs.
The add_
variants add a column of group IDs/row IDs.
group_id(x, order = TRUE, ascending = TRUE, as_qg = FALSE) row_id(x, ascending = TRUE) f_consecutive_id(x)
group_id(x, order = TRUE, ascending = TRUE, as_qg = FALSE) row_id(x, ascending = TRUE) f_consecutive_id(x)
x |
A vector or data frame. |
order |
Should the groups be ordered?
When order is |
ascending |
Should the order be ascending or descending?
The default is |
as_qg |
Should the group IDs be returned as a
collapse "qG" class? The default ( |
Note - When working with data frames it is highly recommended
to use the add_
variants of these functions. Not only are they more
intuitive to use, they also have optimisations for large numbers of groups.
group_id
This assigns an integer value to unique elements of a vector or unique rows of a data frame. It is an extremely useful function for analysis as you can compress a lot of information into a single column, using that for further operations.
row_id
This assigns a row number to each group. To assign plain row numbers
to a data frame one can use add_row_id()
.
This function can be used in rolling calculations, finding duplicates and
more.
consecutive_id
An alternative to dplyr::consecutive_id()
, f_consecutive_id()
also
creates an integer vector with values in the range [1, n]
where
n
is the length of the vector or number of rows of the data frame.
The ID increments every time x[i] != x[i - 1]
thus giving information on
when there is a change in value.
f_consecutive_id
has a very small overhead in terms
of calling the function, making it suitable for repeated calls.
An integer vector.
add_group_id add_row_id add_consecutive_id
rlang::list2
Evaluates arguments dynamically like rlang::list2
but objects
created in list_tidy
have precedence over environment objects.
list_tidy(..., .keep_null = TRUE, .named = FALSE)
list_tidy(..., .keep_null = TRUE, .named = FALSE)
... |
Dynamic name-value pairs. |
.keep_null |
|
.named |
|
Fast 'tibble' alternatives
new_tbl(..., .nrows = NULL, .recycle = TRUE, .name_repair = TRUE) f_enframe(x, name = "name", value = "value") f_deframe(x) as_tbl(x)
new_tbl(..., .nrows = NULL, .recycle = TRUE, .name_repair = TRUE) f_enframe(x, name = "name", value = "value") f_deframe(x) as_tbl(x)
... |
Dynamic name-value pairs. |
.nrows |
|
.recycle |
|
.name_repair |
|
x |
A data frame or vector. |
name |
|
value |
|
new_tbl
and as_tbl
are alternatives to
tibble
and as_tibble
respectively.
f_enframe(x)
where x
is a data.frame
converts x
into a tibble
of column names and list-values.
A tibble or vector.
NA
valuesFast remove rows with NA
values
remove_rows_if_any_na(data, ..., .cols = NULL) remove_rows_if_all_na(data, ..., .cols = NULL)
remove_rows_if_any_na(data, ..., .cols = NULL) remove_rows_if_all_na(data, ..., .cols = NULL)
data |
A data frame. |
... |
Cols to fill |
.cols |
(Optional) alternative to |
A data frame with removed rows containing either any or all NA
values.
Fast grouped sample quantiles
tidy_quantiles( data, ..., probs = seq(0, 1, 0.25), type = 7, pivot = c("long", "wide"), na.rm = TRUE, .by = NULL, .cols = NULL, .order = df_group_by_order_default(data), .drop_groups = TRUE )
tidy_quantiles( data, ..., probs = seq(0, 1, 0.25), type = 7, pivot = c("long", "wide"), na.rm = TRUE, .by = NULL, .cols = NULL, .order = df_group_by_order_default(data), .drop_groups = TRUE )
data |
A data frame. |
... |
|
probs |
|
type |
|
pivot |
|
na.rm |
|
.by |
(Optional). A selection of columns to group by for this operation. Columns are specified using tidy-select. |
.cols |
(Optional) alternative to |
.order |
Should the groups be returned in sorted order?
If |
.drop_groups |
|
A data frame of sample quantiles.
library(fastplyr) library(dplyr) groups <- 1 * 2^(0:10) # Normal distributed samples by group using the group value as the mean # and sqrt(groups) as the sd samples <- tibble(groups) %>% reframe(x = rnorm(100, mean = groups, sd = sqrt(groups)), .by = groups) %>% f_group_by(groups) # Fast means and quantiles by group quantiles <- samples %>% tidy_quantiles(x, pivot = "wide") means <- samples %>% f_summarise(mean = mean(x)) means %>% f_left_join(quantiles)
library(fastplyr) library(dplyr) groups <- 1 * 2^(0:10) # Normal distributed samples by group using the group value as the mean # and sqrt(groups) as the sd samples <- tibble(groups) %>% reframe(x = rnorm(100, mean = groups, sd = sqrt(groups)), .by = groups) %>% f_group_by(groups) # Fast means and quantiles by group quantiles <- samples %>% tidy_quantiles(x, pivot = "wide") means <- samples %>% f_summarise(mean = mean(x)) means %>% f_left_join(quantiles)