Title: | Miscellaneous Statistical Functions Used in 'guide-R' |
---|---|
Description: | Companion package for the manual 'guide-R : Guide pour l’analyse de données d’enquêtes avec R' available at <https://larmarange.github.io/guide-R/>. 'guideR' implements miscellaneous functions introduced in 'guide-R' to facilitate statistical analysis and manipulation of survey data. |
Authors: | Joseph Larmarange [aut, cre] |
Maintainer: | Joseph Larmarange <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.1.0.9000 |
Built: | 2025-03-03 18:31:10 UTC |
Source: | https://github.com/larmarange/guider |
This function uses renv::dependencies()
to identify R package dependencies
in a project and then calls pak::pkg_install()
to install / update these
packages.
install_dependencies(ask = TRUE)
install_dependencies(ask = TRUE)
ask |
Whether to ask for confirmation when installing a different version of a package that is already installed. Installations that only add new packages never require confirmation. |
(Invisibly) A data frame with information about the installed package(s).
## Not run: install_dependencies() ## End(Not run)
## Not run: install_dependencies() ## End(Not run)
NA
as values to be comparedis_different()
and is_equal()
performs comparison tests, considering
NA
values as legitimate values (see examples).
is_different(x, y) is_equal(x, y) cumdifferent(x) num_cycle(x)
is_different(x, y) is_equal(x, y) cumdifferent(x) num_cycle(x)
x , y
|
Vectors to be compared. |
cum_different()
allows to identify groups of continuous rows that have
the same value. num_cycle()
could be used to identify sub-groups that
respect a certain condition (see examples).
is_equal(x, y)
is equivalent to
(x == y & !is.na(x) & !is.na(y)) | (is.na(x) & is.na(y))
, and
is_different(x, y)
is equivalent to
(x != y & !is.na(x) & !is.na(y)) | xor(is.na(x), is.na(y))
.
A vector of the same length as x
.
v <- c("a", "b", NA) is_different(v, "a") is_different(v, NA) is_equal(v, "a") is_equal(v, NA) d <- dplyr::tibble(group = c("a", "a", "b", "b", "a", "b", "c", "a")) d |> dplyr::mutate( subgroup = cumdifferent(group), sub_a = num_cycle(group == "a") )
v <- c("a", "b", NA) is_different(v, "a") is_different(v, NA) is_equal(v, "a") is_equal(v, NA) d <- dplyr::tibble(group = c("a", "a", "b", "b", "a", "b", "c", "a")) d |> dplyr::mutate( subgroup = cumdifferent(group), sub_a = num_cycle(group == "a") )
Add leading zeros
leading_zeros(x, left_digits = NULL, digits = 0, prefix = "", suffix = "", ...)
leading_zeros(x, left_digits = NULL, digits = 0, prefix = "", suffix = "", ...)
x |
a numeric vector |
left_digits |
number of digits before decimal point, automatically computed if not provided |
digits |
number of digits after decimal point |
prefix , suffix
|
Symbols to display before and after value |
... |
additional parameters passed to |
A character vector of the same length as x
.
base::formatC()
, base::sprintf()
v <- c(2, 103.24, 1042.147, 12.4566, NA) leading_zeros(v) leading_zeros(v, digits = 1) leading_zeros(v, left_digits = 6, big.mark = " ") leading_zeros(c(0, 6, 12, 18), prefix = "M")
v <- c(2, 103.24, 1042.147, 12.4566, NA) leading_zeros(v) leading_zeros(v, digits = 1) leading_zeros(v, left_digits = 6, big.mark = " ") leading_zeros(c(0, 6, 12, 18), prefix = "M")
Transform a data frame from long format to period format
long_to_periods(data, id, start, stop = NULL, by = NULL)
long_to_periods(data, id, start, stop = NULL, by = NULL)
data |
A data frame, or a data frame extension (e.g. a tibble). |
id |
< |
start |
< |
stop |
< |
by |
< |
A tibble.
d <- dplyr::tibble( patient = c(1, 2, 3, 3, 4, 4, 4), begin = c(0, 0, 0, 1, 0, 36, 39), end = c(50, 6, 1, 16, 36, 39, 45), covar = c("no", "no", "no", "yes", "no", "yes", "yes") ) d d |> long_to_periods(id = patient, start = begin, stop = end) d |> long_to_periods(id = patient, start = begin, stop = end, by = covar) # If stop not provided, it is deduced. # However, it considers that observation ends at the last start time. d |> long_to_periods(id = patient, start = begin)
d <- dplyr::tibble( patient = c(1, 2, 3, 3, 4, 4, 4), begin = c(0, 0, 0, 1, 0, 36, 39), end = c(50, 6, 1, 16, 36, 39, 45), covar = c("no", "no", "no", "yes", "no", "yes", "yes") ) d d |> long_to_periods(id = patient, start = begin, stop = end) d |> long_to_periods(id = patient, start = begin, stop = end, by = covar) # If stop not provided, it is deduced. # However, it considers that observation ends at the last start time. d |> long_to_periods(id = patient, start = begin)
Plot observed vs predicted distribution of a fitted model
observed_vs_theoretical(model)
observed_vs_theoretical(model)
model |
A statistical model. |
Has been tested with stats::lm()
and stats::glm()
models. It may work
with other types of models, but without any warranty.
A ggplot2
plot.
# a linear model mod <- lm(Sepal.Length ~ Sepal.Width + Species, data = iris) mod |> observed_vs_theoretical() # a logistic regression mod <- glm( as.factor(Survived) ~ Class + Sex, data = titanic, family = binomial() ) mod |> observed_vs_theoretical()
# a linear model mod <- lm(Sepal.Length ~ Sepal.Width + Species, data = iris) mod |> observed_vs_theoretical() # a logistic regression mod <- glm( as.factor(Survived) ~ Class + Sex, data = titanic, family = binomial() ) mod |> observed_vs_theoretical()
Transform a data frame from period format to long format
periods_to_long( data, start, stop, time_step = 1, time_name = "time", keep = FALSE )
periods_to_long( data, start, stop, time_step = 1, time_name = "time", keep = FALSE )
data |
A data frame, or a data frame extension (e.g. a tibble). |
start |
< |
stop |
< |
time_step |
(numeric) Desired value for the time variable. |
time_name |
(character) Name of the time variable. |
keep |
(logical) Should start and stop variable be kept in the results? |
A tibble.
d <- dplyr::tibble( patient = c(1, 2, 3, 3), begin = c(0, 2, 0, 3), end = c(6, 4, 2, 8), covar = c("no", "yes", "no", "yes") ) d d |> periods_to_long(start = begin, stop = end) d |> periods_to_long(start = begin, stop = end, time_step = 5)
d <- dplyr::tibble( patient = c(1, 2, 3, 3), begin = c(0, 2, 0, 3), end = c(6, 4, 2, 8), covar = c("no", "yes", "no", "yes") ) d d |> periods_to_long(start = begin, stop = end) d |> periods_to_long(start = begin, stop = end, time_step = 5)
Plot inertia, absolute loss and relative loss from a classification tree
plot_inertia_from_tree(tree, k_max = 15) get_inertia_from_tree(tree, k_max = 15)
plot_inertia_from_tree(tree, k_max = 15) get_inertia_from_tree(tree, k_max = 15)
tree |
A dendrogram, i.e. an stats::hclust object,
an FactoMineR::HCPC object or an object that can be converted to an
stats::hclust object with |
k_max |
Maximum number of clusters to return / plot. |
A ggplot2
plot or a tibble.
hc <- hclust(dist(USArrests)) get_inertia_from_tree(hc) plot_inertia_from_tree(hc)
hc <- hclust(dist(USArrests)) get_inertia_from_tree(hc) plot_inertia_from_tree(hc)
proportion()
lets you quickly count observations (like dplyr::count()
)
and compute relative proportions. Proportions are computed separately by
group (see examples).
proportion(data, ...) ## S3 method for class 'data.frame' proportion( data, ..., .by = NULL, .na.rm = FALSE, .weight = NULL, .scale = 100, .sort = FALSE, .drop = FALSE, .conf.int = FALSE, .conf.level = 0.95, .options = list(correct = TRUE) ) ## S3 method for class 'survey.design' proportion( data, ..., .by = NULL, .na.rm = FALSE, .scale = 100, .sort = FALSE, .conf.int = FALSE, .conf.level = 0.95, .options = NULL ) ## Default S3 method: proportion( data, ..., .na.rm = FALSE, .scale = 100, .sort = FALSE, .drop = FALSE, .conf.int = FALSE, .conf.level = 0.95, .options = list(correct = TRUE) )
proportion(data, ...) ## S3 method for class 'data.frame' proportion( data, ..., .by = NULL, .na.rm = FALSE, .weight = NULL, .scale = 100, .sort = FALSE, .drop = FALSE, .conf.int = FALSE, .conf.level = 0.95, .options = list(correct = TRUE) ) ## S3 method for class 'survey.design' proportion( data, ..., .by = NULL, .na.rm = FALSE, .scale = 100, .sort = FALSE, .conf.int = FALSE, .conf.level = 0.95, .options = NULL ) ## Default S3 method: proportion( data, ..., .na.rm = FALSE, .scale = 100, .sort = FALSE, .drop = FALSE, .conf.int = FALSE, .conf.level = 0.95, .options = list(correct = TRUE) )
data |
A vector, a data frame, data frame extension (e.g. a tibble), or a survey design object. |
... |
< |
.by |
< |
.na.rm |
Should |
.weight |
< |
.scale |
A scaling factor applied to proportion. Use |
.sort |
If |
.drop |
If |
.conf.int |
If |
.conf.level |
Confidence level for the returned confidence intervals. |
.options |
Additional arguments passed to |
A tibble.
A tibble with one row per group.
# using a vector titanic$Class |> proportion() # univariable table titanic |> proportion(Class) titanic |> proportion(Class, .sort = TRUE) titanic |> proportion(Class, .conf.int = TRUE) titanic |> proportion(Class, .conf.int = TRUE, .scale = 1) # bivariable table titanic |> proportion(Class, Survived) # proportions of the total titanic |> proportion(Survived, .by = Class) # row proportions titanic |> # equivalent syntax dplyr::group_by(Class) |> proportion(Survived) # combining 3 variables or more titanic |> proportion(Class, Sex, Survived) titanic |> proportion(Sex, Survived, .by = Class) titanic |> proportion(Survived, .by = c(Class, Sex)) # missing values dna <- titanic dna$Survived[c(1:20, 500:530)] <- NA dna |> proportion(Survived) dna |> proportion(Survived, .na.rm = TRUE) ## SURVEY DATA ------------------------------------------------------ ds <- srvyr::as_survey(titanic) # univariable table ds |> proportion(Class) ds |> proportion(Class, .sort = TRUE) ds |> proportion(Class, .conf.int = TRUE) ds |> proportion(Class, .conf.int = TRUE, .scale = 1) # bivariable table ds |> proportion(Class, Survived) # proportions of the total ds |> proportion(Survived, .by = Class) # row proportions ds |> dplyr::group_by(Class) |> proportion(Survived) # combining 3 variables or more ds |> proportion(Class, Sex, Survived) ds |> proportion(Sex, Survived, .by = Class) ds |> proportion(Survived, .by = c(Class, Sex)) # missing values dsna <- srvyr::as_survey(dna) dsna |> proportion(Survived) dsna |> proportion(Survived, .na.rm = TRUE)
# using a vector titanic$Class |> proportion() # univariable table titanic |> proportion(Class) titanic |> proportion(Class, .sort = TRUE) titanic |> proportion(Class, .conf.int = TRUE) titanic |> proportion(Class, .conf.int = TRUE, .scale = 1) # bivariable table titanic |> proportion(Class, Survived) # proportions of the total titanic |> proportion(Survived, .by = Class) # row proportions titanic |> # equivalent syntax dplyr::group_by(Class) |> proportion(Survived) # combining 3 variables or more titanic |> proportion(Class, Sex, Survived) titanic |> proportion(Sex, Survived, .by = Class) titanic |> proportion(Survived, .by = c(Class, Sex)) # missing values dna <- titanic dna$Survived[c(1:20, 500:530)] <- NA dna |> proportion(Survived) dna |> proportion(Survived, .na.rm = TRUE) ## SURVEY DATA ------------------------------------------------------ ds <- srvyr::as_survey(titanic) # univariable table ds |> proportion(Class) ds |> proportion(Class, .sort = TRUE) ds |> proportion(Class, .conf.int = TRUE) ds |> proportion(Class, .conf.int = TRUE, .scale = 1) # bivariable table ds |> proportion(Class, Survived) # proportions of the total ds |> proportion(Survived, .by = Class) # row proportions ds |> dplyr::group_by(Class) |> proportion(Survived) # combining 3 variables or more ds |> proportion(Class, Sex, Survived) ds |> proportion(Sex, Survived, .by = Class) ds |> proportion(Survived, .by = c(Class, Sex)) # missing values dsna <- srvyr::as_survey(dna) dsna |> proportion(Survived) dsna |> proportion(Survived, .na.rm = TRUE)
Sometimes, the sum of rounded numbers (e.g., using base::round()
) is not
the same as their rounded sum.
round_preserve_sum(x, digits = 0)
round_preserve_sum(x, digits = 0)
x |
Numerical vector to sum. |
digits |
Number of decimals for rounding. |
This solution applies the following algorithm
Round down to the specified number of decimal places
Order numbers by their remainder values
Increment the specified decimal place of values with k largest remainders, where k is the number of values that must be incremented to preserve their rounded sum
A numerical vector of same length as x
.
https://biostatmatt.com/archives/2902
sum(c(0.333, 0.333, 0.334)) round(c(0.333, 0.333, 0.334), 2) sum(round(c(0.333, 0.333, 0.334), 2)) round_preserve_sum(c(0.333, 0.333, 0.334), 2) sum(round_preserve_sum(c(0.333, 0.333, 0.334), 2))
sum(c(0.333, 0.333, 0.334)) round(c(0.333, 0.333, 0.334), 2) sum(round(c(0.333, 0.333, 0.334), 2)) round_preserve_sum(c(0.333, 0.333, 0.334), 2) sum(round_preserve_sum(c(0.333, 0.333, 0.334), 2))
step()
, taking into account missing valuesWhen your data contains missing values, concerned observations are removed from a model. However, then at a later stage, you try to apply a descending stepwise approach to reduce your model by minimization of AIC, you may encounter an error because the number of rows has changed.
step_with_na(model, ...) ## Default S3 method: step_with_na(model, ..., full_data = eval(model$call$data)) ## S3 method for class 'svyglm' step_with_na(model, ..., design)
step_with_na(model, ...) ## Default S3 method: step_with_na(model, ..., full_data = eval(model$call$data)) ## S3 method for class 'svyglm' step_with_na(model, ..., design)
model |
A model object. |
... |
Additional parameters passed to |
full_data |
Full data frame used for the model, including missing data. |
design |
Survey design previously passed to |
step_with_na()
applies the following strategy:
recomputes the models using only complete cases;
applies stats::step()
;
recomputes the reduced model using the full original dataset.
step_with_na()
has been tested with stats::lm()
, stats::glm()
,
nnet::multinom()
, survey::svyglm()
and survival::coxph()
.
It may be working with other types of models, but with no warranty.
In some cases, it may be necessary to provide the full dataset initially used to estimate the model.
step_with_na()
may not work inside other functions. In that case, you
may try to pass full_data
to the function.
The stepwise-selected model.
set.seed(42) d <- titanic |> dplyr::mutate( Group = sample( c("a", "b", NA), dplyr::n(), replace = TRUE ) ) mod <- glm(as.factor(Survived) ~ ., data = d, family = binomial()) # step(mod) should produce an error mod2 <- step_with_na(mod) mod2 ## WITH SURVEY --------------------------------------- library(survey) ds <- d |> dplyr::mutate(Survived = as.factor(Survived)) |> srvyr::as_survey() mods <- survey::svyglm( Survived ~ Class + Group + Sex, design = ds, family = quasibinomial() ) mod2s <- step_with_na(mods, design = ds) mod2s
set.seed(42) d <- titanic |> dplyr::mutate( Group = sample( c("a", "b", NA), dplyr::n(), replace = TRUE ) ) mod <- glm(as.factor(Survived) ~ ., data = d, family = binomial()) # step(mod) should produce an error mod2 <- step_with_na(mod) mod2 ## WITH SURVEY --------------------------------------- library(survey) ds <- d |> dplyr::mutate(Survived = as.factor(Survived)) |> srvyr::as_survey() mods <- survey::svyglm( Survived ~ Class + Group + Sex, design = ds, family = quasibinomial() ) mod2s <- step_with_na(mods, design = ds) mod2s
This titanic
dataset is equivalent to
datasets::Titanic |> dplyr::as_tibble() |> tidyr::uncount(n)
.
titanic
titanic
An object of class tbl_df
(inherits from tbl
, data.frame
) with 2201 rows and 4 columns.
Remove row-wise grouping created with dplyr::rowwise()
while preserving
any other grouping declared with dplyr::group_by()
.
unrowwise(data)
unrowwise(data)
data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame. |
A tibble.
titanic |> dplyr::rowwise() titanic |> dplyr::rowwise() |> unrowwise() titanic |> dplyr::group_by(Sex, Class) |> dplyr::rowwise() titanic |> dplyr::group_by(Sex, Class) |> dplyr::rowwise() |> unrowwise()
titanic |> dplyr::rowwise() titanic |> dplyr::rowwise() |> unrowwise() titanic |> dplyr::group_by(Sex, Class) |> dplyr::rowwise() titanic |> dplyr::group_by(Sex, Class) |> dplyr::rowwise() |> unrowwise()