Package 'guideR'

Title: Miscellaneous Statistical Functions Used in 'guide-R'
Description: Companion package for the manual 'guide-R : Guide pour l’analyse de données d’enquêtes avec R' available at <https://larmarange.github.io/guide-R/>. 'guideR' implements miscellaneous functions introduced in 'guide-R' to facilitate statistical analysis and manipulation of survey data.
Authors: Joseph Larmarange [aut, cre]
Maintainer: Joseph Larmarange <[email protected]>
License: GPL (>= 3)
Version: 0.1.0.9000
Built: 2025-03-03 18:31:10 UTC
Source: https://github.com/larmarange/guider

Help Index


Install / Update project dependencies

Description

This function uses renv::dependencies() to identify R package dependencies in a project and then calls pak::pkg_install() to install / update these packages.

Usage

install_dependencies(ask = TRUE)

Arguments

ask

Whether to ask for confirmation when installing a different version of a package that is already installed. Installations that only add new packages never require confirmation.

Value

(Invisibly) A data frame with information about the installed package(s).

Examples

## Not run: 
install_dependencies()

## End(Not run)

Comparison tests considering NA as values to be compared

Description

is_different() and is_equal() performs comparison tests, considering NA values as legitimate values (see examples).

Usage

is_different(x, y)

is_equal(x, y)

cumdifferent(x)

num_cycle(x)

Arguments

x, y

Vectors to be compared.

Details

cum_different() allows to identify groups of continuous rows that have the same value. num_cycle() could be used to identify sub-groups that respect a certain condition (see examples).

is_equal(x, y) is equivalent to (x == y & !is.na(x) & !is.na(y)) | (is.na(x) & is.na(y)), and is_different(x, y) is equivalent to (x != y & !is.na(x) & !is.na(y)) | xor(is.na(x), is.na(y)).

Value

A vector of the same length as x.

Examples

v <- c("a", "b", NA)
is_different(v, "a")
is_different(v, NA)
is_equal(v, "a")
is_equal(v, NA)
d <- dplyr::tibble(group = c("a", "a", "b", "b", "a", "b", "c", "a"))
d |>
  dplyr::mutate(
    subgroup = cumdifferent(group),
    sub_a = num_cycle(group == "a")
  )

Add leading zeros

Description

Add leading zeros

Usage

leading_zeros(x, left_digits = NULL, digits = 0, prefix = "", suffix = "", ...)

Arguments

x

a numeric vector

left_digits

number of digits before decimal point, automatically computed if not provided

digits

number of digits after decimal point

prefix, suffix

Symbols to display before and after value

...

additional parameters passed to base::formatC(), as big.mark or decimal.mark

Value

A character vector of the same length as x.

See Also

base::formatC(), base::sprintf()

Examples

v <- c(2, 103.24, 1042.147, 12.4566, NA)
leading_zeros(v)
leading_zeros(v, digits = 1)
leading_zeros(v, left_digits = 6, big.mark = " ")
leading_zeros(c(0, 6, 12, 18), prefix = "M")

Transform a data frame from long format to period format

Description

Transform a data frame from long format to period format

Usage

long_to_periods(data, id, start, stop = NULL, by = NULL)

Arguments

data

A data frame, or a data frame extension (e.g. a tibble).

id

<tidy-select> Column containing individual ids

start

<tidy-select> Time variable indicating the beginning of each row

stop

<tidy-select> Optional time variable indicating the end of each row. If not provided, it will be derived from the dataset, considering that each row ends at the beginning of the next one.

by

<tidy-select> Co-variables to consider (optional)

Value

A tibble.

See Also

periods_to_long()

Examples

d <- dplyr::tibble(
  patient = c(1, 2, 3, 3, 4, 4, 4),
  begin = c(0, 0, 0, 1, 0, 36, 39),
  end = c(50, 6, 1, 16, 36, 39, 45),
  covar = c("no", "no", "no", "yes", "no", "yes", "yes")
)
d

d |> long_to_periods(id = patient, start = begin, stop = end)
d |> long_to_periods(id = patient, start = begin, stop = end, by = covar)

# If stop not provided, it is deduced.
# However, it considers that observation ends at the last start time.
d |> long_to_periods(id = patient, start = begin)

Plot observed vs predicted distribution of a fitted model

Description

Plot observed vs predicted distribution of a fitted model

Usage

observed_vs_theoretical(model)

Arguments

model

A statistical model.

Details

Has been tested with stats::lm() and stats::glm() models. It may work with other types of models, but without any warranty.

Value

A ggplot2 plot.

Examples

# a linear model
mod <- lm(Sepal.Length ~ Sepal.Width + Species, data = iris)
mod |> observed_vs_theoretical()

# a logistic regression
mod <- glm(
  as.factor(Survived) ~ Class + Sex,
  data = titanic,
  family = binomial()
)
mod |> observed_vs_theoretical()

Transform a data frame from period format to long format

Description

Transform a data frame from period format to long format

Usage

periods_to_long(
  data,
  start,
  stop,
  time_step = 1,
  time_name = "time",
  keep = FALSE
)

Arguments

data

A data frame, or a data frame extension (e.g. a tibble).

start

<tidy-select> Time variable indicating the beginning of each row

stop

<tidy-select> Optional time variable indicating the end of each row. If not provided, it will be derived from the dataset, considering that each row ends at the beginning of the next one.

time_step

(numeric) Desired value for the time variable.

time_name

(character) Name of the time variable.

keep

(logical) Should start and stop variable be kept in the results?

Value

A tibble.

See Also

long_to_periods()

Examples

d <- dplyr::tibble(
  patient = c(1, 2, 3, 3),
  begin = c(0, 2, 0, 3),
  end = c(6, 4, 2, 8),
  covar = c("no", "yes", "no", "yes")
)
d

d |> periods_to_long(start = begin, stop = end)
d |> periods_to_long(start = begin, stop = end, time_step = 5)

Plot inertia, absolute loss and relative loss from a classification tree

Description

Plot inertia, absolute loss and relative loss from a classification tree

Usage

plot_inertia_from_tree(tree, k_max = 15)

get_inertia_from_tree(tree, k_max = 15)

Arguments

tree

A dendrogram, i.e. an stats::hclust object, an FactoMineR::HCPC object or an object that can be converted to an stats::hclust object with stats::as.hclust().

k_max

Maximum number of clusters to return / plot.

Value

A ggplot2 plot or a tibble.

Examples

hc <- hclust(dist(USArrests))
get_inertia_from_tree(hc)
plot_inertia_from_tree(hc)

Compute proportions

Description

proportion() lets you quickly count observations (like dplyr::count()) and compute relative proportions. Proportions are computed separately by group (see examples).

Usage

proportion(data, ...)

## S3 method for class 'data.frame'
proportion(
  data,
  ...,
  .by = NULL,
  .na.rm = FALSE,
  .weight = NULL,
  .scale = 100,
  .sort = FALSE,
  .drop = FALSE,
  .conf.int = FALSE,
  .conf.level = 0.95,
  .options = list(correct = TRUE)
)

## S3 method for class 'survey.design'
proportion(
  data,
  ...,
  .by = NULL,
  .na.rm = FALSE,
  .scale = 100,
  .sort = FALSE,
  .conf.int = FALSE,
  .conf.level = 0.95,
  .options = NULL
)

## Default S3 method:
proportion(
  data,
  ...,
  .na.rm = FALSE,
  .scale = 100,
  .sort = FALSE,
  .drop = FALSE,
  .conf.int = FALSE,
  .conf.level = 0.95,
  .options = list(correct = TRUE)
)

Arguments

data

A vector, a data frame, data frame extension (e.g. a tibble), or a survey design object.

...

<data-masking> Variable(s) for those computing proportions.

.by

<tidy-select> Optional additional variables to group by (in addition to those eventually previously declared using dplyr::group_by()).

.na.rm

Should NA values be removed (from variables declared in ...)?

.weight

<data-masking> Frequency weights. Can be NULL or a variable.

.scale

A scaling factor applied to proportion. Use 1 for keeping proportions unchanged.

.sort

If TRUE, will show the highest proportions at the top.

.drop

If TRUE, will remove empty groups from the output.

.conf.int

If TRUE, will estimate confidence intervals.

.conf.level

Confidence level for the returned confidence intervals.

.options

Additional arguments passed to stats::prop.test() or srvyr::survey_prop().

Value

A tibble.

A tibble with one row per group.

Examples

# using a vector
titanic$Class |> proportion()

# univariable table
titanic |> proportion(Class)
titanic |> proportion(Class, .sort = TRUE)
titanic |> proportion(Class, .conf.int = TRUE)
titanic |> proportion(Class, .conf.int = TRUE, .scale = 1)

# bivariable table
titanic |> proportion(Class, Survived) # proportions of the total
titanic |> proportion(Survived, .by = Class) # row proportions
titanic |> # equivalent syntax
  dplyr::group_by(Class) |>
  proportion(Survived)

# combining 3 variables or more
titanic |> proportion(Class, Sex, Survived)
titanic |> proportion(Sex, Survived, .by = Class)
titanic |> proportion(Survived, .by = c(Class, Sex))

# missing values
dna <- titanic
dna$Survived[c(1:20, 500:530)] <- NA
dna |> proportion(Survived)
dna |> proportion(Survived, .na.rm = TRUE)


## SURVEY DATA ------------------------------------------------------

ds <- srvyr::as_survey(titanic)

# univariable table
ds |> proportion(Class)
ds |> proportion(Class, .sort = TRUE)
ds |> proportion(Class, .conf.int = TRUE)
ds |> proportion(Class, .conf.int = TRUE, .scale = 1)

# bivariable table
ds |> proportion(Class, Survived) # proportions of the total
ds |> proportion(Survived, .by = Class) # row proportions
ds |> dplyr::group_by(Class) |> proportion(Survived)

# combining 3 variables or more
ds |> proportion(Class, Sex, Survived)
ds |> proportion(Sex, Survived, .by = Class)
ds |> proportion(Survived, .by = c(Class, Sex))

# missing values
dsna <- srvyr::as_survey(dna)
dsna |> proportion(Survived)
dsna |> proportion(Survived, .na.rm = TRUE)

Round values while preserve their rounded sum in R

Description

Sometimes, the sum of rounded numbers (e.g., using base::round()) is not the same as their rounded sum.

Usage

round_preserve_sum(x, digits = 0)

Arguments

x

Numerical vector to sum.

digits

Number of decimals for rounding.

Details

This solution applies the following algorithm

  • Round down to the specified number of decimal places

  • Order numbers by their remainder values

  • Increment the specified decimal place of values with k largest remainders, where k is the number of values that must be incremented to preserve their rounded sum

Value

A numerical vector of same length as x.

Source

https://biostatmatt.com/archives/2902

Examples

sum(c(0.333, 0.333, 0.334))
round(c(0.333, 0.333, 0.334), 2)
sum(round(c(0.333, 0.333, 0.334), 2))
round_preserve_sum(c(0.333, 0.333, 0.334), 2)
sum(round_preserve_sum(c(0.333, 0.333, 0.334), 2))

Apply step(), taking into account missing values

Description

When your data contains missing values, concerned observations are removed from a model. However, then at a later stage, you try to apply a descending stepwise approach to reduce your model by minimization of AIC, you may encounter an error because the number of rows has changed.

Usage

step_with_na(model, ...)

## Default S3 method:
step_with_na(model, ..., full_data = eval(model$call$data))

## S3 method for class 'svyglm'
step_with_na(model, ..., design)

Arguments

model

A model object.

...

Additional parameters passed to stats::step().

full_data

Full data frame used for the model, including missing data.

design

Survey design previously passed to survey::svyglm().

Details

step_with_na() applies the following strategy:

  • recomputes the models using only complete cases;

  • applies stats::step();

  • recomputes the reduced model using the full original dataset.

step_with_na() has been tested with stats::lm(), stats::glm(), nnet::multinom(), survey::svyglm() and survival::coxph(). It may be working with other types of models, but with no warranty.

In some cases, it may be necessary to provide the full dataset initially used to estimate the model.

step_with_na() may not work inside other functions. In that case, you may try to pass full_data to the function.

Value

The stepwise-selected model.

Examples

set.seed(42)
d <- titanic |>
  dplyr::mutate(
    Group = sample(
      c("a", "b", NA),
      dplyr::n(),
      replace = TRUE
    )
  )
mod <- glm(as.factor(Survived) ~ ., data = d, family = binomial())
# step(mod) should produce an error
mod2 <- step_with_na(mod)
mod2


## WITH SURVEY ---------------------------------------

library(survey)
ds <- d |>
  dplyr::mutate(Survived = as.factor(Survived)) |>
  srvyr::as_survey()
mods <- survey::svyglm(
  Survived ~ Class + Group + Sex,
  design = ds,
  family = quasibinomial()
)
mod2s <- step_with_na(mods, design = ds)
mod2s

Titanic data set in long format

Description

This titanic dataset is equivalent to datasets::Titanic |> dplyr::as_tibble() |> tidyr::uncount(n).

Usage

titanic

Format

An object of class tbl_df (inherits from tbl, data.frame) with 2201 rows and 4 columns.

See Also

datasets::Titanic


Remove row-wise grouping

Description

Remove row-wise grouping created with dplyr::rowwise() while preserving any other grouping declared with dplyr::group_by().

Usage

unrowwise(data)

Arguments

data

A data frame, data frame extension (e.g. a tibble), or a lazy data frame.

Value

A tibble.

Examples

titanic |> dplyr::rowwise()
titanic |> dplyr::rowwise() |> unrowwise()

titanic |> dplyr::group_by(Sex, Class) |> dplyr::rowwise()
titanic |> dplyr::group_by(Sex, Class) |> dplyr::rowwise() |> unrowwise()