| Title: | Guillem Hurault Functions' Library |
|---|---|
| Description: | Contains various functions for data analysis, notably helpers and diagnostics for Bayesian modelling using Stan. |
| Authors: | Guillem Hurault [aut, cre] (ORCID: <https://orcid.org/0000-0002-1052-3564>) |
| Maintainer: | Guillem Hurault <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.2.1.9000 |
| Built: | 2026-06-04 07:36:07 UTC |
| Source: | https://github.com/ghurault/huraultmisc |
Compute whether x and y are approximately equal given a tolerance level
approx_equal(x, y, tol = .Machine$double.eps^0.5) x %~% yapprox_equal(x, y, tol = .Machine$double.eps^0.5) x %~% y
x |
Numeric scalar. |
y |
Numeric scalar. |
tol |
Tolerance. |
Boolean
approx_equal(1, 1) 1 %~% (1 + 1e-16) 1 %~% 1.01approx_equal(1, 1) 1 %~% (1 + 1e-16) 1 %~% 1.01
Shortcut for c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7").
cbbPalettecbbPalette
An object of class character of length 8.
library(ggplot2) library(dplyr) df <- data.frame(palette = HuraultMisc::cbbPalette) %>% mutate(palette = factor(palette, levels = palette)) ggplot(data = df, aes(x = palette, fill = palette, y = 0)) + geom_tile() + scale_fill_manual(values = levels(df$palette)) + coord_cartesian(expand = FALSE) + labs(x = "", y = "") + theme_classic(base_size = 15) + theme(legend.position = "none")library(ggplot2) library(dplyr) df <- data.frame(palette = HuraultMisc::cbbPalette) %>% mutate(palette = factor(palette, levels = palette)) ggplot(data = df, aes(x = palette, fill = palette, y = 0)) + geom_tile() + scale_fill_manual(values = levels(df$palette)) + coord_cartesian(expand = FALSE) + labs(x = "", y = "") + theme_classic(base_size = 15) + theme(legend.position = "none")
Estimate calibration given forecasts and corresponding outcomes
compute_calibration( forecast, outcome, method = c("smoothing", "binning"), CI = NULL, binwidth = NULL, ... )compute_calibration( forecast, outcome, method = c("smoothing", "binning"), CI = NULL, binwidth = NULL, ... )
forecast |
Vector of probability forecasts. |
outcome |
Vector of observations (0 or 1). |
method |
Method used to estimate calibration, either |
CI |
Confidence level (e.g. 0.95).
CI not computed if |
binwidth |
Binwidth when calibration is estimated by binning.
If |
... |
Arguments of |
Dataframe with columns Forecast (bins), Frequency (frequency of outcomes in the bin),
Lower (lower bound of the CI) and Upper (upper bound of the CI).
N <- 1e4 f <- rbeta(N, 1, 1) o <- sapply(f, function(x) { rbinom(1, 1, x) }) lapply( c("binning", "smoothing"), function(m) { cal <- compute_calibration(f, o, method = m) with(cal, plot(Forecast, Frequency, type = "l")) abline(c(0, 1), col = "red") } )N <- 1e4 f <- rbeta(N, 1, 1) o <- sapply(f, function(x) { rbinom(1, 1, x) }) lapply( c("binning", "smoothing"), function(m) { cal <- compute_calibration(f, o, method = m) with(cal, plot(Forecast, Frequency, type = "l")) abline(c(0, 1), col = "red") } )
The resolution is computed as the mean squared distance to a base rate (reference forecast) and is then normalised by the uncertainty (maximum resolution). This means the output is between 0 and 1, 1 corresponding to the maximum resolution.
compute_resolution(f, p0)compute_resolution(f, p0)
f |
Vector of forecasts |
p0 |
Vector of base rate. In the case rate is usually the prevalence of a uniform forecast (e.g. 1 / number of categories) but can depend on the observation (hence the vector). |
Vector of resolution values
compute_resolution(seq(0, 1, .1), 0.5)compute_resolution(seq(0, 1, .1), 0.5)
Compute RPS for a single forecast
compute_RPS(forecast, outcome)compute_RPS(forecast, outcome)
forecast |
Vector of length N (forecast). |
outcome |
Index of the true outcome (between 1 and N). |
RPS (numeric scalar)
compute_RPS(c(.2, .5, .3), 2)compute_RPS(c(.2, .5, .3), 2)
The function returns a global estimate rather than an estimate for each sample, since the predictive means and residual variance are estimated from replications instead of being computed on a per-sample basis.
compute_rsquared(yrep)compute_rsquared(yrep)
yrep |
Matrix with rows representing samples and columns representing observations |
Bayesian R-squared (scalar, between 0 and 1)
Gelman, A. et al. (2019) "R-squared for Bayesian Regression Models", The American Statistician, 73(3), pp. 307–309. doi:10.1080/00031305.2018.1549100.
N <- 50 N_sample <- 1e2 y <- runif(N, 0, 10) yrep <- do.call( cbind, lapply( 1:N, function(i) { y[i] + rnorm(N_sample) } ) ) compute_rsquared(yrep)N <- 50 N_sample <- 1e2 y <- runif(N, 0, 10) yrep <- do.call( cbind, lapply( 1:N, function(i) { y[i] + rnorm(N_sample) } ) ) compute_rsquared(yrep)
Compute and plot coverage of CI for different confidence level. Useful for fake data check.
compute_coverage( post_samples, truth, CI = seq(0, 1, 0.05), type = c("eti", "hdi") ) plot_coverage( post_samples, truth, CI = seq(0, 1, 0.05), type = c("eti", "hdi") )compute_coverage( post_samples, truth, CI = seq(0, 1, 0.05), type = c("eti", "hdi") ) plot_coverage( post_samples, truth, CI = seq(0, 1, 0.05), type = c("eti", "hdi") )
post_samples |
Matrix of posterior samples. Rows represent a sample and columns represent variables. |
truth |
Vector of true parameter values (should be the same length as the number of columns in |
CI |
Vector of confidence levels. |
type |
Type of confidence intervals: either "eti" (equal-tailed intervals) or "hdi" (highest density intervals). |
compute_coverage returns a Dataframe containing coverage (and 95% uncertainty interval for the coverage) for different confidence level (nominal coverage).
plot_coverage returns a ggplot of the coverage as the function of the nominal coverage with 95% uncertainty interval.
N <- 100 N_post <- 1e3 truth <- rep(0, N) post_samples <- sapply(rnorm(N, 0, 1), function(x) { rnorm(N_post, x, 1) }) compute_coverage(post_samples, truth) plot_coverage(post_samples, truth)N <- 100 N_post <- 1e3 truth <- rep(0, N) post_samples <- sapply(rnorm(N, 0, 1), function(x) { rnorm(N_post, x, 1) }) compute_coverage(post_samples, truth) plot_coverage(post_samples, truth)
Compute empirical p-values
empirical_pval(t_rep, t, alternative = c("two.sided", "less", "greater"))empirical_pval(t_rep, t, alternative = c("two.sided", "less", "greater"))
t_rep |
Vector of samples from a distribution. |
t |
Observation (numeric scalar). |
alternative |
Indicates the alternative hypothesis: must be one of "two.sided", "greater" or "less". |
Empirical p-value.
empirical_pval(rnorm(1e2), 2)empirical_pval(rnorm(1e2), 2)
Extract confidence intervals from a vector of samples
extract_ci(x, CI_level = seq(0.1, 0.9, 0.1), type = c("eti", "hdi"))extract_ci(x, CI_level = seq(0.1, 0.9, 0.1), type = c("eti", "hdi"))
x |
Vector of samples from a distribution. |
CI_level |
Vector containing the level of the confidence/credible intervals. |
type |
"eti" for equal-tailed intervals and "hdi" for highest density intervals. |
Dataframe with columns: Lower, Upper, Level.
x <- rexp(1e4) extract_ci(x, type = "eti") extract_ci(x, type = "hdi")x <- rexp(1e4) extract_ci(x, type = "eti") extract_ci(x, type = "hdi")
The distribution can be extracted as:
a probability density function (type = "continuous"), cf. extract_pdf().
a probability mass function (type = "discrete"), cf. extract_pmf().
a series of equal-tailed confidence/credible intervals (type = "eti"), cf. extract_ci().
a series of highest density confidence/credible intervals (type = "hdi"), cf. extract_ci().
extract_distribution( object, parName = "", type = c("continuous", "discrete", "eti", "hdi"), transform = identity, ... )extract_distribution( object, parName = "", type = c("continuous", "discrete", "eti", "hdi"), transform = identity, ... )
object |
Object specifying the distribution as samples: can be a Stanfit object, a matrix (columns represents parameters, rows samples) or a vector. |
parName |
Name of the parameter to extract. |
type |
Indicates how the distribution is summarised. |
transform |
Function to apply to the samples. |
... |
Arguments to pass to |
Dataframe. The columns depends on the method that is used (see specific function for details).
This function can notably be used to prepare the data for plotting fan charts when type = "eti" or "hdi".
In that case, the ggdist package offers an alternative with ggdist::stat_lineribbon().
extract_draws() for extracting draws of an object.
extract_distribution(runif(1e2), type = "continuous", support = c(0, 1))extract_distribution(runif(1e2), type = "continuous", support = c(0, 1))
Extract parameters' draws
extract_draws(obj, draws)extract_draws(obj, draws)
obj |
Array/Vector/Matrix of draws (cf. first dimension) or list of it. |
draws |
Vector of draws to extract. |
Dataframe with columns: Draw, Index, Value and Parameter.
x <- rnorm(1e3) X <- matrix(x, ncol = 10) a <- array(rnorm(80), dim = c(10, 2, 2, 2)) extract_draws(x, sample(1:length(x), 10)) extract_draws(X, sample(1:nrow(X), 10)) %>% head() extract_draws(a, sample(1:10, 5)) %>% head() extract_draws(list(x = x, X = X, a = a), 1:10) %>% head()x <- rnorm(1e3) X <- matrix(x, ncol = 10) a <- array(rnorm(80), dim = c(10, 2, 2, 2)) extract_draws(x, sample(1:length(x), 10)) extract_draws(X, sample(1:nrow(X), 10)) %>% head() extract_draws(a, sample(1:10, 5)) %>% head() extract_draws(list(x = x, X = X, a = a), 1:10) %>% head()
Extract multiple indices inside bracket(s) as a list
extract_index_nd(x, dim_names = NULL)extract_index_nd(x, dim_names = NULL)
x |
Character vector. |
dim_names |
Optional character vector of dimension names.
If |
Dataframe with columns:
Variable, containing x where brackets have been removed
Index, a list containing values within the brackets.
If dim_names is not NULL, Index is replaced by columns with names dim_names containing numeric values.
extract_index_nd(c("sigma", "sigma[1]", "sigma[1, 1]", "sigma[1][2]"))extract_index_nd(c("sigma", "sigma[1]", "sigma[1, 1]", "sigma[1][2]"))
Extract probability density function from vector of samples
extract_pdf(x, support = NULL, n_density = 2^7)extract_pdf(x, support = NULL, n_density = 2^7)
x |
Vector of samples from a distribution. |
support |
Vector of length 2 corresponding to the range of the distribution. Can be NULL. |
n_density |
Number of equally spaced points at which the density is to be estimated (better to use a power of 2). |
Dataframe with columns: Value, Density.
extract_pdf(rnorm(1e3)) %>% head()extract_pdf(rnorm(1e3)) %>% head()
Extract probability mass function from vector of samples
extract_pmf(x, support = NULL)extract_pmf(x, support = NULL)
x |
Vector of samples from a distribution. |
support |
Vector of all possible values that the distribution can take. Can be NULL. |
Dataframe with columns: Value, Probability.
extract_pmf(round(rnorm(1e3, 0, 10))) %>% head()extract_pmf(round(rnorm(1e3, 0, 10))) %>% head()
Change the type of the column of a dataframe from factor to numeric
factor_to_numeric(df, factor_name)factor_to_numeric(df, factor_name)
df |
Dataframe. |
factor_name |
Vector of names of factors to change to numeric. |
Same dataframe with type of the given columns changed to numeric.
df <- data.frame(A = rep(1:5, each = 10)) df$A <- factor(df$A) str(df) df <- factor_to_numeric(df, "A") str(df)df <- data.frame(A = rep(1:5, each = 10)) df$A <- factor(df$A) str(df) df <- factor_to_numeric(df, "A") str(df)
Test whether x is of length 1
is_scalar(x)is_scalar(x)
x |
Object to be tested. |
Logical
is_scalar(1) # TRUE is_scalar("a") # TRUE is_scalar(c(1, 2)) # FALSEis_scalar(1) # TRUE is_scalar("a") # TRUE is_scalar(c(1, 2)) # FALSE
Test whether an object is of class "stanfit"
is_stanfit(obj)is_stanfit(obj)
obj |
Object. |
Boolean
is_wholenumber() uses base::round() to test whether x is a whole number,
it will therefore issue an error if x is not of mode numeric.
If used in base::stopifnot() for example, this won't be a problem but it may be in conditionals.
is_scalar_wholenumber() comes with the additional argument check_numeric
to check whether x is a numeric before checking it is a whole number.
is_wholenumber(x, tol = .Machine$double.eps^0.5) is_scalar_wholenumber(x, check_numeric = TRUE, ...)is_wholenumber(x, tol = .Machine$double.eps^0.5) is_scalar_wholenumber(x, check_numeric = TRUE, ...)
x |
Object to be tested |
tol |
Tolerance |
check_numeric |
Whether to check whether |
... |
Arguments to pass to |
Logical
is_wholenumber(1) # TRUE is_wholenumber(1.0) # TRUE is_wholenumber(1.1) # FALSE is_scalar_wholenumber(1) # TRUE is_scalar_wholenumber(c(1, 2)) # FALSEis_wholenumber(1) # TRUE is_wholenumber(1.0) # TRUE is_wholenumber(1.1) # FALSE is_scalar_wholenumber(1) # TRUE is_scalar_wholenumber(c(1, 2)) # FALSE
Logit and Inverse logit
logit(x) inv_logit(x)logit(x) inv_logit(x)
x |
Numeric vector. |
Numeric vector.
logit(0.5) inv_logit(0)logit(0.5) inv_logit(0)
Compute and plot posterior predictive p-value (Bayesian p-value) from samples of a distribution. The simulations and observations are first summarised into a test statistics, then the test statistic of the observations is compared to the test statistic of the empirical distribution.
post_pred_pval( yrep, y, test_statistic = mean, alternative = c("two.sided", "less", "greater"), plot = FALSE )post_pred_pval( yrep, y, test_statistic = mean, alternative = c("two.sided", "less", "greater"), plot = FALSE )
yrep |
Matrix of posterior replications with rows corresponding to samples and columns to simulated observations. |
y |
Vector of observations. |
test_statistic |
Function of the test statistic to compute the p-value for. |
alternative |
Indicates the alternative hypothesis: must be one of |
plot |
Whether to output a plot visualising the distribution of the test statistic. |
List containing the p-value and (optionally) a ggplot.
post_pred_pval(matrix(rnorm(1e3), ncol = 10), rnorm(10), plot = TRUE)post_pred_pval(matrix(rnorm(1e3), ncol = 10), rnorm(10), plot = TRUE)
Plot the distribution density of parameters within a same group from a single/multiple draw of the posterior distribution. In the case of a hierarchical model, we might look at the distribution of patient parameter and compare it to the prior for the population distribution.
PPC_group_distribution(obj, parName = "", nDraws = 1)PPC_group_distribution(obj, parName = "", nDraws = 1)
obj |
Matrix (rows: samples, cols: parameter) or Stanfit object. |
parName |
Name of the observation-dependent (e.g. patient-dependent) parameter to consider (optional when |
nDraws |
Number of draws to plot |
Ggplot of the distribution
A. Gelman, J. B. B. Carlin, H. S. S. Stern, and D. B. B. Rubin, Bayesian Data Analysis (Chapter 6), Third Edition, 2014.
X <- matrix(rnorm(1e3), ncol = 10) PPC_group_distribution(X, "", 10)X <- matrix(rnorm(1e3), ncol = 10) PPC_group_distribution(X, "", 10)
combine_prior_posterior subsets and binds the prior and posterior dataframes.
plot_prior_posterior plots posterior CI alongside prior CI.
compute_prior_influence computes diagnostics of how the posterior is influenced by the prior.
plot_prior_influence plots diagnostics from compute_prior_influence.
check_model_sensitivity is a deprecated alias of plot_prior_influence.
combine_prior_posterior(prior, post, pars = NULL, match_exact = TRUE) plot_prior_posterior( prior, post, pars = NULL, match_exact = TRUE, lb = "5%", ub = "95%" ) compute_prior_influence( prior, post, pars = NULL, match_exact = TRUE, remove_index_prior = TRUE ) plot_prior_influence(prior, post, pars = NULL, match_exact = TRUE) check_model_sensitivity(prior, post, pars = NULL)combine_prior_posterior(prior, post, pars = NULL, match_exact = TRUE) plot_prior_posterior( prior, post, pars = NULL, match_exact = TRUE, lb = "5%", ub = "95%" ) compute_prior_influence( prior, post, pars = NULL, match_exact = TRUE, remove_index_prior = TRUE ) plot_prior_influence(prior, post, pars = NULL, match_exact = TRUE) check_model_sensitivity(prior, post, pars = NULL)
prior |
Dataframe of prior parameter estimates.
The dataframe is expected to have columns |
post |
Dataframe of posterior parameter estimates, with same columns as |
pars |
Vector of parameter names to plot. Defaults to all parameters presents in |
match_exact |
Logical indicating whether parameters should be matched exactly (e.g. |
lb |
Name of the column in |
ub |
Name of the column in |
remove_index_prior |
Whether to remove the index variable for |
Posterior shrinkage (PostShrinkage = 1 - Var(Post) / Var(Prior)), capturing how much the model is learning.
Shrinkage near 0 indicates that the data provides little information beyond the prior.
Shrinkage near 1 indicates that the data is much more informative than the prior.
'Mahalanobis' distance between the mean posterior and the prior (DistPrior), capturing whether the prior "includes" the posterior.
combine_prior_posterior returns a dataframe with the same columns as in prior and post and a column Distribution.
compute_prior_influence returns a dataframe with columns: Variable, Index, PostShrinkage, DistPrior.
plot_prior_posterior and plot_prior_influence returns a ggplot object.
For plot_prior_posterior, parameters with the same name but different indices are plotted together.
If their prior distribution is the same, it can be useful to only keep one index in prior.
If not, we can use match_exact = FALSE to plot parameter[1] and parameter[2] separately.
M. Betancourt, “Towards a Principled Bayesian Workflow”, 2018.
library(dplyr) prior <- data.frame( Variable = c("a", "b"), Mean = c(0, 0), sd = c(10, 5), Index = c(NA, NA) ) %>% mutate( `5%` = qnorm(.05, mean = Mean, sd = sd), `95%` = qnorm(.95, mean = Mean, sd = sd) ) post <- data.frame( Variable = c("a", "a", "b"), Mean = c(-1, 1, 2), sd = c(3, 4, 1), Index = c(1, 2, NA) ) %>% mutate( `5%` = qnorm(.05, mean = Mean, sd = sd), `95%` = qnorm(.95, mean = Mean, sd = sd) ) plot_prior_posterior(prior, post) plot_prior_influence(prior, post, pars = c("a", "b"))library(dplyr) prior <- data.frame( Variable = c("a", "b"), Mean = c(0, 0), sd = c(10, 5), Index = c(NA, NA) ) %>% mutate( `5%` = qnorm(.05, mean = Mean, sd = sd), `95%` = qnorm(.95, mean = Mean, sd = sd) ) post <- data.frame( Variable = c("a", "a", "b"), Mean = c(-1, 1, 2), sd = c(3, 4, 1), Index = c(1, 2, NA) ) %>% mutate( `5%` = qnorm(.05, mean = Mean, sd = sd), `95%` = qnorm(.95, mean = Mean, sd = sd) ) plot_prior_posterior(prior, post) plot_prior_influence(prior, post, pars = c("a", "b"))
Extract summary statistics
summary_statistics(fit, pars, probs = c(0.05, 0.25, 0.5, 0.75, 0.95))summary_statistics(fit, pars, probs = c(0.05, 0.25, 0.5, 0.75, 0.95))
fit |
Stanfit object. |
pars |
Character vector of parameters to extract. Defaults to all parameters. |
probs |
Numeric vector of quantiles to extract. |
Dataframe of posterior summary statistics
The tidybayes package offers an alternative to this function, for example:
fit %>% tidy_draws() %>% gather_variables() %>% mean_qi().
However, this does not provide information about Rhat or Neff, nor does it process the indexes.
The tidybayes package is more useful for summarising the distribution of a handful of parameters (using tidybayes::spread_draws()).