Title: | Guillem Hurault Functions' Library |
---|---|
Description: | Contains various functions for data analysis, notably helpers and diagnostics for Bayesian modelling using Stan. |
Authors: | Guillem Hurault [aut, cre] |
Maintainer: | Guillem Hurault <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.1.2.9000 |
Built: | 2025-02-03 06:11:30 UTC |
Source: | https://github.com/ghurault/huraultmisc |
Compute whether x and y are approximately equal given a tolerance level
approx_equal(x, y, tol = .Machine$double.eps^0.5) x %~% y
approx_equal(x, y, tol = .Machine$double.eps^0.5) x %~% y
x |
Numeric scalar. |
y |
Numeric scalar. |
tol |
Tolerance. |
Boolean
approx_equal(1, 1) 1 %~% (1 + 1e-16) 1 %~% 1.01
approx_equal(1, 1) 1 %~% (1 + 1e-16) 1 %~% 1.01
Shortcut for c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
.
cbbPalette
cbbPalette
An object of class character
of length 8.
library(ggplot2) library(dplyr) df <- data.frame(palette = HuraultMisc::cbbPalette) %>% mutate(palette = factor(palette, levels = palette)) ggplot(data = df, aes(x = palette, fill = palette, y = 0)) + geom_tile() + scale_fill_manual(values = levels(df$palette)) + coord_cartesian(expand = FALSE) + labs(x = "", y = "") + theme_classic(base_size = 15) + theme(legend.position = "none")
library(ggplot2) library(dplyr) df <- data.frame(palette = HuraultMisc::cbbPalette) %>% mutate(palette = factor(palette, levels = palette)) ggplot(data = df, aes(x = palette, fill = palette, y = 0)) + geom_tile() + scale_fill_manual(values = levels(df$palette)) + coord_cartesian(expand = FALSE) + labs(x = "", y = "") + theme_classic(base_size = 15) + theme(legend.position = "none")
Change column names of a dataframe
change_colnames(df, current_names, new_names)
change_colnames(df, current_names, new_names)
df |
Dataframe |
current_names |
Vector of column names to change. |
new_names |
Vector of new names. |
Dataframe with new column names
## Not run: df <- data.frame(A = 1:2, B = 3:4, C = 5:6) df <- change_colnames(df, c("A", "C"), c("Aa", "Cc")) ## End(Not run)
## Not run: df <- data.frame(A = 1:2, B = 3:4, C = 5:6) df <- change_colnames(df, c("A", "C"), c("Aa", "Cc")) ## End(Not run)
Estimate calibration given forecasts and corresponding outcomes
compute_calibration( forecast, outcome, method = c("smoothing", "binning"), CI = NULL, binwidth = NULL, ... )
compute_calibration( forecast, outcome, method = c("smoothing", "binning"), CI = NULL, binwidth = NULL, ... )
forecast |
Vector of probability forecasts. |
outcome |
Vector of observations (0 or 1). |
method |
Method used to estimate calibration, either |
CI |
Confidence level (e.g. 0.95).
CI not computed if |
binwidth |
Binwidth when calibration is estimated by binning.
If |
... |
Arguments of |
Dataframe with columns Forecast
(bins), Frequency
(frequency of outcomes in the bin),
Lower
(lower bound of the CI) and Upper
(upper bound of the CI).
N <- 1e4 f <- rbeta(N, 1, 1) o <- sapply(f, function(x) {rbinom(1, 1, x)}) lapply(c("binning", "smoothing"), function(m) { cal <- compute_calibration(f, o, method = m) with(cal, plot(Forecast, Frequency, type = "l")) abline(c(0, 1), col = "red") })
N <- 1e4 f <- rbeta(N, 1, 1) o <- sapply(f, function(x) {rbinom(1, 1, x)}) lapply(c("binning", "smoothing"), function(m) { cal <- compute_calibration(f, o, method = m) with(cal, plot(Forecast, Frequency, type = "l")) abline(c(0, 1), col = "red") })
The resolution is computed as the mean squared distance to a base rate (reference forecast) and is then normalised by the uncertainty (maximum resolution). This means the output is between 0 and 1, 1 corresponding to the maximum resolution.
compute_resolution(f, p0)
compute_resolution(f, p0)
f |
Vector of forecasts |
p0 |
Vector of base rate. In the case rate is usually the prevalence of a uniform forecast (e.g. 1 / number of categories) but can depend on the observation (hence the vector). |
Vector of resolution values
compute_resolution(seq(0, 1, .1), 0.5)
compute_resolution(seq(0, 1, .1), 0.5)
Compute RPS for a single forecast
compute_RPS(forecast, outcome)
compute_RPS(forecast, outcome)
forecast |
Vector of length N (forecast). |
outcome |
Index of the true outcome (between 1 and N). |
RPS (numeric scalar)
compute_RPS(c(.2, .5, .3), 2)
compute_RPS(c(.2, .5, .3), 2)
The approach is not fully Bayesian and provides a global estimate rather than an estimate for each sample (this is because the predictive means and residual variance are estimated from replications than given by the model).
compute_rsquared(yrep)
compute_rsquared(yrep)
yrep |
Matrix with rows representing samples and columns representing observations |
Bayesian R-squared (scalar, between 0 and 1)
N <- 50 N_sample <- 1e2 y <- runif(N, 0, 10) yrep <- do.call(cbind, lapply(1:N, function(i) { y[i] + rnorm(N_sample) })) compute_rsquared(yrep)
N <- 50 N_sample <- 1e2 y <- runif(N, 0, 10) yrep <- do.call(cbind, lapply(1:N, function(i) { y[i] + rnorm(N_sample) })) compute_rsquared(yrep)
Compute and plot coverage of CI for different confidence level. Useful for fake data check.
compute_coverage( post_samples, truth, CI = seq(0, 1, 0.05), type = c("eti", "hdi") ) plot_coverage( post_samples, truth, CI = seq(0, 1, 0.05), type = c("eti", "hdi") )
compute_coverage( post_samples, truth, CI = seq(0, 1, 0.05), type = c("eti", "hdi") ) plot_coverage( post_samples, truth, CI = seq(0, 1, 0.05), type = c("eti", "hdi") )
post_samples |
Matrix of posterior samples. Rows represent a sample and columns represent variables. |
truth |
Vector of true parameter values (should be the same length as the number of columns in |
CI |
Vector of confidence levels. |
type |
Type of confidence intervals: either "eti" (equal-tailed intervals) or "hdi" (highest density intervals). |
compute_coverage
returns a Dataframe containing coverage (and 95% uncertainty interval for the coverage) for different confidence level (nominal coverage).
plot_coverage
returns a ggplot of the coverage as the function of the nominal coverage with 95% uncertainty interval.
N <- 100 N_post <- 1e3 truth <- rep(0, N) post_samples <- sapply(rnorm(N, 0, 1), function(x) {rnorm(N_post, x, 1)}) compute_coverage(post_samples, truth) plot_coverage(post_samples, truth)
N <- 100 N_post <- 1e3 truth <- rep(0, N) post_samples <- sapply(rnorm(N, 0, 1), function(x) {rnorm(N_post, x, 1)}) compute_coverage(post_samples, truth) plot_coverage(post_samples, truth)
Compute empirical p-values
empirical_pval(t_rep, t, alternative = c("two.sided", "less", "greater"))
empirical_pval(t_rep, t, alternative = c("two.sided", "less", "greater"))
t_rep |
Vector of samples from a distribution. |
t |
Observation (numeric scalar). |
alternative |
Indicates the alternative hypothesis: must be one of "two.sided", "greater" or "less". |
Empirical p-value.
empirical_pval(rnorm(1e2), 2)
empirical_pval(rnorm(1e2), 2)
Extract confidence intervals from a vector of samples
extract_ci(x, CI_level = seq(0.1, 0.9, 0.1), type = c("eti", "hdi"))
extract_ci(x, CI_level = seq(0.1, 0.9, 0.1), type = c("eti", "hdi"))
x |
Vector of samples from a distribution. |
CI_level |
Vector containing the level of the confidence/credible intervals. |
type |
"eti" for equal-tailed intervals and "hdi" for highest density intervals. |
Dataframe with columns: Lower
, Upper
, Level.
x <- rexp(1e4) extract_ci(x, type = "eti") extract_ci(x, type = "hdi")
x <- rexp(1e4) extract_ci(x, type = "eti") extract_ci(x, type = "hdi")
The distribution can be extracted as:
a probability density function (type = "continuous"
), cf. extract_pdf()
.
a probability mass function (type = "discrete"
), cf. extract_pmf()
.
a series of equal-tailed confidence/credible intervals (type = "eti"
), cf. extract_ci()
.
a series of highest density confidence/credible intervals (type = "hdi"
), cf. extract_ci()
.
extract_distribution( object, parName = "", type = c("continuous", "discrete", "eti", "hdi"), transform = identity, ... )
extract_distribution( object, parName = "", type = c("continuous", "discrete", "eti", "hdi"), transform = identity, ... )
object |
Object specifying the distribution as samples: can be a Stanfit object, a matrix (columns represents parameters, rows samples) or a vector. |
parName |
Name of the parameter to extract. |
type |
Indicates how the distribution is summarised. |
transform |
Function to apply to the samples. |
... |
Arguments to pass to |
Dataframe. The columns depends on the method that is used (see specific function for details).
This function can notably be used to prepare the data for plotting fan charts when type = "eti"
or "hdi"
.
In that case, the ggdist
package offers an alternative with ggdist::stat_lineribbon()
.
extract_draws()
for extracting draws of an object.
extract_distribution(runif(1e2), type = "continuous", support = c(0, 1))
extract_distribution(runif(1e2), type = "continuous", support = c(0, 1))
Extract parameters' draws
extract_draws(obj, draws)
extract_draws(obj, draws)
obj |
Array/Vector/Matrix of draws (cf. first dimension) or list of it. |
draws |
Vector of draws to extract. |
Dataframe with columns: Draw
, Index
, Value
and Parameter
.
x <- rnorm(1e3) X <- matrix(x, ncol = 10) a <- array(rnorm(80), dim = c(10, 2, 2, 2)) extract_draws(x, sample(1:length(x), 10)) extract_draws(X, sample(1:nrow(X), 10)) %>% head() extract_draws(a, sample(1:10, 5)) %>% head() extract_draws(list(x = x, X = X, a = a), 1:10) %>% head()
x <- rnorm(1e3) X <- matrix(x, ncol = 10) a <- array(rnorm(80), dim = c(10, 2, 2, 2)) extract_draws(x, sample(1:length(x), 10)) extract_draws(X, sample(1:nrow(X), 10)) %>% head() extract_draws(a, sample(1:10, 5)) %>% head() extract_draws(list(x = x, X = X, a = a), 1:10) %>% head()
Extract multiple indices inside bracket(s) as a list
extract_index_nd(x, dim_names = NULL)
extract_index_nd(x, dim_names = NULL)
x |
Character vector. |
dim_names |
Optional character vector of dimension names.
If |
Dataframe with columns:
Variable
, containing x
where brackets have been removed
Index
, a list containing values within the brackets.
If dim_names
is not NULL, Index
is replaced by columns with names dim_names
containing numeric values.
extract_index_nd(c("sigma", "sigma[1]", "sigma[1, 1]", "sigma[1][2]"))
extract_index_nd(c("sigma", "sigma[1]", "sigma[1, 1]", "sigma[1][2]"))
Extract parameters from a single draw
extract_parameters_from_draw(fit, param, draw)
extract_parameters_from_draw(fit, param, draw)
fit |
Stanfit object. |
param |
Vector of parameter names. |
draw |
Index of the draw to extract the parameters from. |
Dataframe
Useful for to generate fake data.
The 'tidybayes' package offers an alternative to this function, for example:
fit %>% tidy_draws() %>% gather_variables() %>% filter(.draw == draw & .variable %in% param)
However, the 'tidybayes' version is less efficient as all draws and parameters are extracted and then filtered (also the draw IDs are not the same).
Using 'tidybayes' would be more recommended when we only want to extract specific parameters,
and that it does not matter which draw are extracted (in that case using tidybayes::spread_draws()
).
Extract probability density function from vector of samples
extract_pdf(x, support = NULL, n_density = 2^7)
extract_pdf(x, support = NULL, n_density = 2^7)
x |
Vector of samples from a distribution. |
support |
Vector of length 2 corresponding to the range of the distribution. Can be NULL. |
n_density |
Number of equally spaced points at which the density is to be estimated (better to use a power of 2). |
Dataframe with columns: Value
, Density.
extract_pdf(rnorm(1e3)) %>% head()
extract_pdf(rnorm(1e3)) %>% head()
Extract probability mass function from vector of samples
extract_pmf(x, support = NULL)
extract_pmf(x, support = NULL)
x |
Vector of samples from a distribution. |
support |
Vector of all possible values that the distribution can take. Can be NULL. |
Dataframe with columns: Value
, Probability.
extract_pmf(round(rnorm(1e3, 0, 10))) %>% head()
extract_pmf(round(rnorm(1e3, 0, 10))) %>% head()
Change the type of the column of a dataframe from factor to numeric
factor_to_numeric(df, factor_name)
factor_to_numeric(df, factor_name)
df |
Dataframe. |
factor_name |
Vector of names of factors to change to numeric. |
Same dataframe with type of the given columns changed to numeric.
df <- data.frame(A = rep(1:5, each = 10)) df$A <- factor(df$A) str(df) df <- factor_to_numeric(df, "A") str(df)
df <- data.frame(A = rep(1:5, each = 10)) df$A <- factor(df$A) str(df) df <- factor_to_numeric(df, "A") str(df)
Test whether x is of length 1
is_scalar(x)
is_scalar(x)
x |
Object to be tested. |
Logical
is_scalar(1) # TRUE is_scalar("a") # TRUE is_scalar(c(1, 2)) # FALSE
is_scalar(1) # TRUE is_scalar("a") # TRUE is_scalar(c(1, 2)) # FALSE
Test whether an object is of class "stanfit"
is_stanfit(obj)
is_stanfit(obj)
obj |
Object. |
Boolean
is_wholenumber()
uses base::round()
to test whether x
is a whole number,
it will therefore issue an error if x
is not of mode numeric.
If used in base::stopifnot()
for example, this won't be a problem but it may be in conditionals.
is_scalar_wholenumber()
comes with the additional argument check_numeric
to check whether x
is a numeric before checking it is a whole number.
is_wholenumber(x, tol = .Machine$double.eps^0.5) is_scalar_wholenumber(x, check_numeric = TRUE, ...)
is_wholenumber(x, tol = .Machine$double.eps^0.5) is_scalar_wholenumber(x, check_numeric = TRUE, ...)
x |
Object to be tested |
tol |
Tolerance |
check_numeric |
Whether to check whether |
... |
Arguments to pass to |
Logical
is_wholenumber(1) # TRUE is_wholenumber(1.0) # TRUE is_wholenumber(1.1) # FALSE is_scalar_wholenumber(1) # TRUE is_scalar_wholenumber(c(1, 2)) # FALSE
is_wholenumber(1) # TRUE is_wholenumber(1.0) # TRUE is_wholenumber(1.1) # FALSE is_scalar_wholenumber(1) # TRUE is_scalar_wholenumber(c(1, 2)) # FALSE
Logit and Inverse logit
logit(x) inv_logit(x)
logit(x) inv_logit(x)
x |
Numeric vector. |
Numeric vector.
logit(0.5) inv_logit(0)
logit(0.5) inv_logit(0)
Compute and plot posterior predictive p-value (Bayesian p-value) from samples of a distribution. The simulations and observations are first summarised into a test statistics, then the test statistic of the observations is compared to the test statistic of the empirical distribution.
post_pred_pval( yrep, y, test_statistic = mean, alternative = c("two.sided", "less", "greater"), plot = FALSE )
post_pred_pval( yrep, y, test_statistic = mean, alternative = c("two.sided", "less", "greater"), plot = FALSE )
yrep |
Matrix of posterior replications with rows corresponding to samples and columns to simulated observations. |
y |
Vector of observations. |
test_statistic |
Function of the test statistic to compute the p-value for. |
alternative |
Indicates the alternative hypothesis: must be one of |
plot |
Whether to output a plot visualising the distribution of the test statistic. |
List containing the p-value and (optionally) a ggplot.
post_pred_pval(matrix(rnorm(1e3), ncol = 10), rnorm(10), plot = TRUE)
post_pred_pval(matrix(rnorm(1e3), ncol = 10), rnorm(10), plot = TRUE)
Plot the distribution density of parameters within a same group from a single/multiple draw of the posterior distribution. In the case of a hierarchical model, we might look at the distribution of patient parameter and compare it to the prior for the population distribution.
PPC_group_distribution(obj, parName = "", nDraws = 1)
PPC_group_distribution(obj, parName = "", nDraws = 1)
obj |
Matrix (rows: samples, cols: parameter) or Stanfit object. |
parName |
Name of the observation-dependent (e.g. patient-dependent) parameter to consider (optional when |
nDraws |
Number of draws to plot |
Ggplot of the distribution
A. Gelman, J. B. B. Carlin, H. S. S. Stern, and D. B. B. Rubin, Bayesian Data Analysis (Chapter 6), Third Edition, 2014.
X <- matrix(rnorm(1e3), ncol = 10) PPC_group_distribution(X, "", 10)
X <- matrix(rnorm(1e3), ncol = 10) PPC_group_distribution(X, "", 10)
combine_prior_posterior
subsets and binds the prior and posterior dataframes.
plot_prior_posterior
plots posterior CI alongside prior CI.
compute_prior_influence
computes diagnostics of how the posterior is influenced by the prior.
plot_prior_influence
plots diagnostics from compute_prior_influence
.
check_model_sensitivity
is a deprecated alias of plot_prior_influence
.
combine_prior_posterior(prior, post, pars = NULL, match_exact = TRUE) plot_prior_posterior( prior, post, pars = NULL, match_exact = TRUE, lb = "5%", ub = "95%" ) compute_prior_influence( prior, post, pars = NULL, match_exact = TRUE, remove_index_prior = TRUE ) plot_prior_influence(prior, post, pars = NULL, match_exact = TRUE) check_model_sensitivity(prior, post, pars = NULL)
combine_prior_posterior(prior, post, pars = NULL, match_exact = TRUE) plot_prior_posterior( prior, post, pars = NULL, match_exact = TRUE, lb = "5%", ub = "95%" ) compute_prior_influence( prior, post, pars = NULL, match_exact = TRUE, remove_index_prior = TRUE ) plot_prior_influence(prior, post, pars = NULL, match_exact = TRUE) check_model_sensitivity(prior, post, pars = NULL)
prior |
Dataframe of prior parameter estimates.
The dataframe is expected to have columns |
post |
Dataframe of posterior parameter estimates, with same columns as |
pars |
Vector of parameter names to plot. Defaults to all parameters presents in |
match_exact |
Logical indicating whether parameters should be matched exactly (e.g. |
lb |
Name of the column in |
ub |
Name of the column in |
remove_index_prior |
Whether to remove the index variable for |
Posterior shrinkage (PostShrinkage = 1 - Var(Post) / Var(Prior)
), capturing how much the model is learning.
Shrinkage near 0 indicates that the data provides little information beyond the prior.
Shrinkage near 1 indicates that the data is much more informative than the prior.
'Mahalanobis' distance between the mean posterior and the prior (DistPrior
), capturing whether the prior "includes" the posterior.
combine_prior_posterior
returns a dataframe with the same columns as in prior and post and a column Distribution
.
compute_prior_influence
returns a dataframe with columns: Variable
, Index
, PostShrinkage
, DistPrior.
plot_prior_posterior
and plot_prior_influence
returns a ggplot object.
For plot_prior_posterior
, parameters with the same name but different indices are plotted together.
If their prior distribution is the same, it can be useful to only keep one index in prior
.
If not, we can use match_exact = FALSE
to plot parameter[1]
and parameter[2]
separately.
M. Betancourt, “Towards a Principled Bayesian Workflow”, 2018.
library(dplyr) prior <- data.frame(Variable = c("a", "b"), Mean = c(0, 0), sd = c(10, 5), Index = c(NA, NA)) %>% mutate(`5%` = qnorm(.05, mean = Mean, sd = sd), `95%` = qnorm(.95, mean = Mean, sd = sd)) post <- data.frame(Variable = c("a", "a", "b"), Mean = c(-1, 1, 2), sd = c(3, 4, 1), Index = c(1, 2, NA)) %>% mutate(`5%` = qnorm(.05, mean = Mean, sd = sd), `95%` = qnorm(.95, mean = Mean, sd = sd)) plot_prior_posterior(prior, post) plot_prior_influence(prior, post, pars = c("a", "b"))
library(dplyr) prior <- data.frame(Variable = c("a", "b"), Mean = c(0, 0), sd = c(10, 5), Index = c(NA, NA)) %>% mutate(`5%` = qnorm(.05, mean = Mean, sd = sd), `95%` = qnorm(.95, mean = Mean, sd = sd)) post <- data.frame(Variable = c("a", "a", "b"), Mean = c(-1, 1, 2), sd = c(3, 4, 1), Index = c(1, 2, NA)) %>% mutate(`5%` = qnorm(.05, mean = Mean, sd = sd), `95%` = qnorm(.95, mean = Mean, sd = sd)) plot_prior_posterior(prior, post) plot_prior_influence(prior, post, pars = c("a", "b"))
Extract posterior predictive distribution
process_replications( fit, idx = NULL, parName, bounds = NULL, type = c("continuous", "discrete", "eti", "hdi"), ... )
process_replications( fit, idx = NULL, parName, bounds = NULL, type = c("continuous", "discrete", "eti", "hdi"), ... )
fit |
Stanfit object. |
idx |
Dataframe for translating the indices of the parameters into more informative variable (can be NULL). |
parName |
Name of the parameter to extract. |
bounds |
NULL or vector of length 2 representing the bounds of the distribution if it needs to be truncated. |
type |
Indicates how the distribution is summarised. |
... |
Parameters to be passed to |
Dataframe.
Extract summary statistics
summary_statistics(fit, pars, probs = c(0.05, 0.25, 0.5, 0.75, 0.95))
summary_statistics(fit, pars, probs = c(0.05, 0.25, 0.5, 0.75, 0.95))
fit |
Stanfit object. |
pars |
Character vector of parameters to extract. Defaults to all parameters. |
probs |
Numeric vector of quantiles to extract. |
Dataframe of posterior summary statistics
The tidybayes package offers an alternative to this function, for example:
fit %>% tidy_draws() %>% gather_variables() %>% mean_qi()
.
However, this does not provide information about Rhat
or Neff
, nor does it process the indexes.
The tidybayes package is more useful for summarising the distribution of a handful of parameters (using tidybayes::spread_draws()
).