--- author: "Joseph Larmarange" title: "Introduction to labelled" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to labelled} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- The purpose of the **labelled** package is to provide functions to manipulate metadata as variable labels, value labels and defined missing values using the `haven_labelled` and `haven_labelled_spss` classes introduced in `haven` package. These classes allow to add metadata (variable, value labels and SPSS-style missing values) to vectors. It should be noted that **value labels** doesn't imply that your vectors should be considered as categorical or continuous. Therefore, value labels are not intended to be use for data analysis. For example, before performing modeling, you should convert vectors with value labels into factors or into classic numeric/character vectors. Therefore, two main approaches could be considered. ![Two main approaches](approaches.png){width=100%} In **approach A**, `haven_labelled` vectors are converted into factors or into numeric/character vectors just after data import, using `unlabelled()`, `to_factor()` or `unclass()`. Then, data cleaning, recoding and analysis are performed using classic **R** vector types. In **approach B**, `haven_labelled` vectors are kept for data cleaning and coding, allowing to preserved original recoding, in particular if data should be reexported after that step. Functions provided by `labelled` will be useful for managing value labels. However, as in approach A, `haven_labelled` vectors will have to be converted into classic factors or numeric vectors before data analysis (in particular modeling) as this is the way categorical and continuous variables should be coded for analysis functions. ## Variable labels A variable label could be specified for any vector using `var_label()`. ```{r} library(labelled) var_label(iris$Sepal.Length) <- "Length of sepal" ``` It's possible to add a variable label to several columns of a data frame using a named list. ```{r} var_label(iris) <- list( Petal.Length = "Length of petal", Petal.Width = "Width of Petal" ) ``` To get the variable label, simply call `var_label()`. ```{r} var_label(iris$Petal.Width) var_label(iris) ``` To remove a variable label, use `NULL`. ```{r} var_label(iris$Sepal.Length) <- NULL ``` In **RStudio**, variable labels will be displayed in data viewer. ```{r, eval=FALSE} View(iris) ``` You can display and search through variable names and labels with `look_for()`: ```{r} look_for(iris) look_for(iris, "pet") look_for(iris, details = FALSE) ``` ## Value labels The first way to create a labelled vector is to use the `labelled()` function. It's not mandatory to provide a label for each value observed in your vector. You can also provide a label for values not observed. ```{r} v <- labelled( c(1, 2, 2, 2, 3, 9, 1, 3, 2, NA), c(yes = 1, no = 3, "don't know" = 8, refused = 9) ) v ``` Use `val_labels()` to get all value labels and `val_label()` to get the value label associated with a specific value. ```{r} val_labels(v) val_label(v, 8) ``` `val_labels()` could also be used to modify all the value labels attached to a vector, while `val_label()` will update only one specific value label. ```{r} val_labels(v) <- c(yes = 1, nno = 3, bug = 5) v val_label(v, 3) <- "no" v ``` With `val_label()`, you can also add or remove specific value labels. ```{r} val_label(v, 2) <- "maybe" val_label(v, 5) <- NULL v ``` To remove all value labels, use `val_labels()` and `NULL`. The `haven_labelled` class will also be removed. ```{r} val_labels(v) <- NULL v ``` Adding a value label to a non labelled vector will apply `haven_labelled` class to it. ```{r} val_label(v, 1) <- "yes" v ``` Note that applying `val_labels()` to a factor will generate an error! ```{r, error = TRUE} f <- factor(1:3) f val_labels(f) <- c(yes = 1, no = 3) ``` You could also apply `val_labels()` to several columns of a data frame. ```{r} df <- data.frame(v1 = 1:3, v2 = c(2, 3, 1), v3 = 3:1) val_label(df, 1) <- "yes" val_label(df[, c("v1", "v3")], 2) <- "maybe" val_label(df[, c("v2", "v3")], 3) <- "no" val_labels(df) val_labels(df[, c("v1", "v3")]) <- c(YES = 1, MAYBE = 2, NO = 3) val_labels(df) val_labels(df) <- NULL val_labels(df) val_labels(df) <- list(v1 = c(yes = 1, no = 3), v2 = c(a = 1, b = 2, c = 3)) val_labels(df) ``` ## Sorting value labels Value labels are sorted by default in the order they have been created. ```{r} v <- c(1, 2, 2, 2, 3, 9, 1, 3, 2, NA) val_label(v, 1) <- "yes" val_label(v, 3) <- "no" val_label(v, 9) <- "refused" val_label(v, 2) <- "maybe" val_label(v, 8) <- "don't know" v ``` It could be useful to reorder the value labels according to their attached values, with `sort_val_labels()`. ```{r} sort_val_labels(v) sort_val_labels(v, decreasing = TRUE) ``` If you prefer, you can also sort them according to the labels. ```{r} sort_val_labels(v, according_to = "l") ``` ## User defined missing values (SPSS's style) `haven` (>= 2.0.0) introduced an additional `haven_labelled_spss` class to deal with user defined missing values. In such case, additional attributes will be used to indicate with values should be considered as missing, but such values will not be stored as internal `NA` values. You should note that most R function will not take this information into account. Therefore, you will have to convert missing values into `NA` if required before analysis. These defined missing values could co-exist with internal `NA` values. It is possible to manipulate this missing values with `na_values()` and `na_range()`. Note that `is.na()` will return `TRUE` as well for user-defined missing values. ```{r} v <- labelled( c(1, 2, 2, 2, 3, 9, 1, 3, 2, NA), c(yes = 1, no = 3, "don't know" = 9) ) v na_values(v) <- 9 na_values(v) v is.na(v) na_values(v) <- NULL v na_range(v) <- c(5, Inf) na_range(v) v ``` Since version 2.1.0, it is not mandatory to define at least one value label before defining missing values. ```{r} x <- c(1, 2, 2, 9) na_values(x) <- 9 x ``` To convert user defined missing values into `NA`, simply use `user_na_to_na()`. ```{r} v <- labelled_spss(1:10, c(Good = 1, Bad = 8), na_values = c(9, 10)) v v2 <- user_na_to_na(v) v2 ``` You can also remove user missing values definition without converting these values to `NA`. ```{r} v <- labelled_spss(1:10, c(Good = 1, Bad = 8), na_values = c(9, 10)) v v2 <- remove_user_na(v) v2 ``` or ```{r} v <- labelled_spss(1:10, c(Good = 1, Bad = 8), na_values = c(9, 10)) v na_values(v) <- NULL v ``` ## Other conversion to NA In some cases, values who don't have an attached value label could be considered as missing. `nolabel_to_na()` will convert them to `NA`. ```{r} v <- labelled(c(1, 2, 2, 2, 3, 9, 1, 3, 2, NA), c(yes = 1, maybe = 2, no = 3)) v nolabel_to_na(v) ``` In other cases, a value label is attached only to specific values that corresponds to a missing value. For example: ```{r} size <- labelled(c(1.88, 1.62, 1.78, 99, 1.91), c("not measured" = 99)) size ``` In such cases, `val_labels_to_na()` could be appropriate. ```{r} val_labels_to_na(size) ``` These two functions could also be applied to an overall data frame. Only labelled vectors will be impacted. ## Converting to factor A labelled vector could easily be converted to a factor with `to_factor()`. ```{r} v <- labelled( c(1, 2, 2, 2, 3, 9, 1, 3, 2, NA), c(yes = 1, no = 3, "don't know" = 8, refused = 9) ) v to_factor(v) ``` The `levels` argument allows to specify what should be used as the factor levels, i.e. the labels (default), the values or the labels prefixed with values. ```{r} to_factor(v, levels = "v") to_factor(v, levels = "p") ``` The `ordered` argument will create an ordinal factor. ```{r} to_factor(v, ordered = TRUE) ``` The argument `nolabel_to_na` specify if the corresponding function should be applied before converting to a factor. Therefore, the two following commands are equivalent. ```{r} to_factor(v, nolabel_to_na = TRUE) to_factor(nolabel_to_na(v)) ``` `sort_levels` specifies how the levels should be sorted: `"none"` to keep the order in which value labels have been defined, `"values"` to order the levels according to the values and `"labels"` according to the labels. `"auto"` (default) will be equivalent to `"none"` except if some values with no attached labels are found and are not dropped. In that case, `"values"` will be used. ```{r} to_factor(v, sort_levels = "n") to_factor(v, sort_levels = "v") to_factor(v, sort_levels = "l") ``` The function `to_labelled()` could be used to turn a factor into a labelled numeric vector. ```{r} f <- factor(1:3, labels = c("a", "b", "c")) to_labelled(f) ``` Note that `to_labelled(to_factor(v))` will not be equal to `v` due to the way factors are stored internally by **R**. ```{r} v to_labelled(to_factor(v)) ``` ## Other type of conversions You can use `to_character()` for converting into a character vector instead of a factor. ```{r} v to_character(v) ``` To remove the `haven_class`, you can simply use `unclass()`. ```{r} unclass(v) ``` Note that value labels will be preserved as an attribute to the vector. ```{r} remove_val_labels(v) ``` To remove value labels, use `remove_val_labels()`. ```{r} remove_val_labels(v) ``` Note that if your vector does have user-defined missing values, you may also want to use `remove_user_na()`. ```{r} x <- c(1, 2, 2, 9) na_values(x) <- 9 val_labels(x) <- c(yes = 1, no = 2) var_label(x) <- "A test variable" x remove_val_labels(x) remove_user_na(x) remove_user_na(x, user_na_to_na = TRUE) remove_val_labels(remove_user_na(x)) unclass(x) ``` You can remove all labels and user-defined missing values with `remove_labels()`. Use `keep_var_label = TRUE` to preserve only variable label. ```{r} remove_labels(x, user_na_to_na = TRUE) remove_labels(x, user_na_to_na = TRUE, keep_var_label = TRUE) ``` ## Conditional conversion to factors{#unlabelled} For any analysis, it is the responsibility of user to identify which labelled numeric vectors should be considered as **categorical** (and therefore converted into factors using `to_factor()`) and which variables should be treated as **continuous** (and therefore unclassed into numeric using `base::unclass()`). It should be noted that most functions expect categorical variables to be coded as factors. It includes most modeling functions (such as `stats::lm()` or `stats::glm()`) or plotting functions from `ggplot2`. In most of cases, if data documentation was properly done, categorical variables corresponds to vectors where all observed values have a value label while vectors where only few values have a value label should be considered as continuous. In that situation, you could apply the `unlabelled()` method to an overall data frame. By default, `unlabelled()` works as follow: - if a column doesn't inherit the `haven_labelled` class, it will be not affected; - if all observed values have a corresponding value label, the column will be converted into a factor (using `to_factor()`); - otherwise, the column will be unclassed (and converted back to a numeric or character vector by applying `base::unclass()`). ```{r} df <- data.frame( a = labelled(c(1, 1, 2, 3), labels = c(No = 1, Yes = 2)), b = labelled(c(1, 1, 2, 3), labels = c(No = 1, Yes = 2, DK = 3)), c = labelled(c(1, 1, 2, 2), labels = c(No = 1, Yes = 2, DK = 3)), d = labelled(c("a", "a", "b", "c"), labels = c(No = "a", Yes = "b")), e = labelled_spss( c(1, 9, 1, 2), labels = c(No = 1, Yes = 2), na_values = 9 ) ) df %>% look_for() unlabelled(df) %>% look_for() unlabelled(df, user_na_to_na = TRUE) %>% look_for() unlabelled(df, drop_unused_labels = TRUE) %>% look_for() ``` ## Importing labelled data In **haven** package, `read_spss`, `read_stata` and `read_sas` are natively importing data using the `labelled` class and the `label` attribute for variable labels. Functions from **foreign** package could also import some metadata from **SPSS** and **Stata** files. `to_labelled` can convert data imported with **foreign** into a labelled data frame. However, there are some limitations compared to using **haven**: - For **SPSS** files, it will be better to set `use.value.labels = FALSE`, `to.data.frame = FALSE` and `use.missings = FALSE` when calling `read.spss`. If `use.value.labels = TRUE`, variable with value labels will be converted into factors by `read.spss` (and kept as factors by `foreign_to_label`). If `to.data.frame = TRUE`, meta data describing the missing values will not be imported. If `use.missings = TRUE`, missing values would have been converted to `NA` by `read.spss`. - For **Stata** files, set `convert.factors = FALSE` when calling `read.dta` to avoid conversion of variables with value labels into factors. So far, missing values defined in Stata are always imported as `NA` by `read.dta` and could not be retrieved by `foreign_to_labelled`. The **memisc** package provide functions to import variable metadata and store them in specific object of class `data.set`. The `to_labelled` method can convert a data.set into a labelled data frame. ```{r, eval=FALSE} # from foreign library(foreign) df <- to_labelled(read.spss( "file.sav", to.data.frame = FALSE, use.value.labels = FALSE, use.missings = FALSE )) df <- to_labelled(read.dta( "file.dta", convert.factors = FALSE )) # from memisc library(memisc) nes1948.por <- UnZip("anes/NES1948.ZIP", "NES1948.POR", package = "memisc") nes1948 <- spss.portable.file(nes1948.por) df <- to_labelled(nes1948) ds <- as.data.set(nes19480) df <- to_labelled(ds) ``` ## Using labelled with dplyr/magrittr If you are using the `%>%` operator, you can use the functions `set_variable_labels()`, `set_value_labels()`, `add_value_labels()` and `remove_value_labels()`. ```{r} library(dplyr) df <- tibble(s1 = c("M", "M", "F"), s2 = c(1, 1, 2)) %>% set_variable_labels(s1 = "Sex", s2 = "Question") %>% set_value_labels(s1 = c(Male = "M", Female = "F"), s2 = c(Yes = 1, No = 2)) df$s2 ``` `set_value_labels()` will replace the list of value labels while `add_value_labels()` will update it. ```{r} df <- df %>% set_value_labels(s2 = c(Yes = 1, "Don't know" = 8, Unknown = 9)) df$s2 df <- df %>% add_value_labels(s2 = c(No = 2)) df$s2 ``` You can also remove some variable and/or value labels. ```{r} df <- df %>% set_variable_labels(s1 = NULL) # removing one value label df <- df %>% remove_value_labels(s2 = 2) df$s2 # removing several value labels df <- df %>% remove_value_labels(s2 = 8:9) df$s2 # removing all value labels df <- df %>% set_value_labels(s2 = NULL) df$s2 ``` To convert variables, the easiest is to use `unlabelled()`. ```{r} library(questionr) data(fertility) glimpse(women) glimpse(women %>% unlabelled()) ``` Alternatively, you can use functions as `dplyr::mutate()` + `dplyr::across()`. See the example below. ```{r} glimpse(to_factor(women)) glimpse(women %>% mutate(across(where(is.labelled), to_factor))) glimpse(women %>% mutate(across(employed:religion, to_factor))) ```