Title: | m-Out-of-n Bootstrap Functions |
---|---|
Description: | Functions and examples based on the m-out-of-n bootstrap suggested by Politis, D.N. and Romano, J.P. (1994) <doi:10.1214/aos/1176325770>. Additionally there are functions to estimate the scaling factor tau and the subsampling size m. For a detailed description and a full list of references, see Dalitz, C. and Lögler, F. (2024) <doi:10.48550/arXiv.2412.05032>. |
Authors: | Christoph Dalitz [aut, cre], Felix Lögler [aut] |
Maintainer: | Christoph Dalitz <[email protected]> |
License: | BSD 2-clause License + file LICENSE |
Version: | 0.9.2 |
Built: | 2025-01-09 14:24:55 UTC |
Source: | https://github.com/cdalitz/moonboot |
Density, distribution function, quantile function and random
generation for a continuous distribution with the density
(pow+1)*(x-min)^pow/(max-min)^(pow+1)
for x
in the range [min,max]
and pow > -1
.
dpower(x, pow, min = 0, max = 1) ppower(x, pow, min = 0, max = 1) qpower(p, pow, min = 0, max = 1) rpower(n, pow, min = 0, max = 1)
dpower(x, pow, min = 0, max = 1) ppower(x, pow, min = 0, max = 1) qpower(p, pow, min = 0, max = 1) rpower(n, pow, min = 0, max = 1)
x |
vector of values where to evaluate the denisty or CDF. |
pow |
degree of the power law. |
min |
minimum value of the support of the distribution. |
max |
maximum value of the support of the distribution. |
p |
vector of probabilities. |
n |
number of observations. If |
dpower
gives the density, ppower
gives the cumulative
distribution function (CDF), qpower
gives the quantile function
(i.e., the inverse of the CDF), and rpower
generates random numbers.
The length of the result is determined by n
for rpower
, and is
the length of x
or p
for the other functions.
Estimates m
using the selected method
.
Additional parameters can be passed to the underlying methods using params
.
It is also possible to pass parameters to the statistic using '...'.
estimate.m( data, statistic, tau = NULL, R = 1000, replace = FALSE, min.m = 3, method = "bickel", params = NULL, ... )
estimate.m( data, statistic, tau = NULL, R = 1000, replace = FALSE, min.m = 3, method = "bickel", params = NULL, ... )
data |
The data to be bootstrapped. |
statistic |
The estimator of the parameter. |
tau |
The convergence rate. |
R |
The amount of bootstrap replicates. Must be a positive integer. |
replace |
If the sampling should be done with replacement. Setting this value to true requires a sufficient smooth estimator. |
min.m |
Minimum subsample size to be tried. Should be the minimum size for which the statistic make sense. |
method |
The method to be used, one of |
params |
Additional parameters to be passed to the internal functions, see details for more information. |
... |
Additional parameters to be passed to the statistic. |
The different methods have different parameters. Therefore, this wrapper method has been given the params
parameter, which can be used to
pass method-specific arguments to the underlying methods. The specific parameters are described below.
Most of the provided methods need tau
. If not provided, it will be estimated using
estimate.tau
. Note that method 'sherman' is using an alternative approach without using the scalation factor and
therefore tau
will not be computed if selecting 'sherman' as method. Any non NULL
values will be ignored when
selecting the method 'sherman'.
Possible methods are:
The method from Goetze and Rackauskas is based on minimizing the distance between the
CDF of the bootstrap distributions of different subsampling sizes 'm'.
As distance measurement the 'Kolmogorov distance' is used.
The method uses the pairs 'm' and 'm/2' to be minimized.
As this would involve trying out all combinations of 'm' and 'm/2' this method has a running time of order Rn^2.
To reduce the runtime in practical use, params
can be used to pass a search.value
, which is a
list of the smallest and largest value for m to try.
This method works similary to the previous one. The difference here is that the subsample sizes to be
compared are consecutive subsample sizes generated by q^j*n
for j = seq(2,n)
and a chosen q
value between
zero and one.
The parameter q
can be selected using params
. The default value is q=0.75
, as suggested in the corresponding paper.
This method is also known as the 'minimum volatility method'. It is based on the idea that there
should be some range for subsampling sizes, where its choice has little effect on the estimated confidence points.
The algorithm starts by smoothing the endpoints of the intervals and then calculates the standard deviation.
The h.ci
parameter is used to select the number of neighbors used for smoothing.
The h.sigma
parameter is the number of neighbors used in the standard deviation calculation.
Both parameters can be set by using params
.
Note that the h.*
neigbors from each side are used.
To use five elements for smoothing, h.ci
should therefore be set to 2.
This method is based on a 'double-bootstrap' approach.
It tries to estimate the coverage error of different subsampling sizes and chooses the subsampling
size with the lowest one.
As estimating the coverage error is highly computationally intensive, it is not practical to try all m values.
Therefore, the beta
parameter can be used to control which m
values are tried. The values
are then calculated by ms = n^beta
. The default value is a sequence between 0.3 and 0.9 out of 15 values.
This parameter can be set using params
.
Subsampling size m
choosen by the selected method.
Götze F. and Rackauskas A. (2001) Adaptive choice of bootstrap sample sizes. Lecture Notes-Monograph Series, 36(State of the Art in Probability and Statistics):286-309
Bickel P.J. and Sakov A. (2008) On the choice of m in the m out of n bootstrap and confidence bounds for extrema. Statistic Sinica, 18(3):967-985.
Politis D.N. et al. (1999) Subsampling, Springer, New York.
Sherman M. and Carlstein E. (2004) Confidence intervals based on estimators with unknown rates of convergence. Computional statistics & data analysis, 46(1):123-136.
mboot estimate.tau
data <- runif(1000) estimate.max <- function(data, indices) {return(max(data[indices]))} tau <- function(n){n} # convergence rate (usually sqrt(n), but n for max) choosen.m <- estimate.m(data, estimate.max, tau, R = 1000, method = "bickel") print(choosen.m)
data <- runif(1000) estimate.max <- function(data, indices) {return(max(data[indices]))} tau <- function(n){n} # convergence rate (usually sqrt(n), but n for max) choosen.m <- estimate.m(data, estimate.max, tau, R = 1000, method = "bickel") print(choosen.m)
This function estimates the convergence rate of the bootstrap estimator
and returns it as a function of the form tau_n = n^a
, where n
is the input parameter.
estimate.tau( data, statistic, R = 1000, replace = FALSE, min.m = 3, beta = seq(0.2, 0.7, length.out = 5), method = "variance", ... )
estimate.tau( data, statistic, R = 1000, replace = FALSE, min.m = 3, beta = seq(0.2, 0.7, length.out = 5), method = "variance", ... )
data |
The data to be bootstrapped. |
statistic |
The estimator of the parameter. |
R |
Amount of bootstrap replicates used to estimate tau. |
replace |
If sampling should be done with replacement. |
min.m |
Minimal subsampling size used to estimate tau. Should be set to the minimum size for which the statistic makes sense. |
beta |
The tested subsample sizes m are |
method |
Method to estimate tau, can be one of |
... |
Additional parameters to be passed to the |
There are two methods to choose from, variance
and quantile
.
The provided beta
values are used to select subsample sizes m
by using ms = n^beta
.
Note that the choice of the beta
values can impact the accuracy of the estimated tau
(Dalitz & Lögler, 2024).
For each selected subsample size m
a bootstrap with R
replications is performed.
The method 'variance' then fits a linear function to log(variance) of the bootstrap statistics as function of log(m).
The method 'quantile' averages over multiple quantile ranges Q and fits a linear function to log(Q) as a function of log(m).
A function for the square root of the convergence rate of the variance, i.e., f(n) = tau_n
. This function can directly be passed to mboot.ci
.
Bertail P. et al. (1999) On subsampling estimators with unknown rate of convergence. Journal of the American Statistical Association, 94(446):568-579.
Politis D.N. et al. (1999) Subsampling, Springer, New York.
Dalitz, C, and Lögler, F. (2024) moonboot: An R Package Implementing m-out-of-n Bootstrap Methods. doi:10.48550/arXiv.2412.05032
mboot.ci
data <- runif(1000) estimate.max <- function(data, indices) {return(max(data[indices]))} estimated.tau <- estimate.tau(data, estimate.max) boot.out <- mboot(data, estimate.max, R = 1000, m = 2*sqrt(NROW(data)), replace = FALSE) cis <- mboot.ci(boot.out, 0.95, estimated.tau, c("all")) ci.basic <- cis$basic print(ci.basic)
data <- runif(1000) estimate.max <- function(data, indices) {return(max(data[indices]))} estimated.tau <- estimate.tau(data, estimate.max) boot.out <- mboot(data, estimate.max, R = 1000, m = 2*sqrt(NROW(data)), replace = FALSE) cis <- mboot.ci(boot.out, 0.95, estimated.tau, c("all")) ci.basic <- cis$basic print(ci.basic)
Generate R
bootstrap replicates of the given statistic
applied to the data
.
Sampling can be done with or without replacement.
The subsample size m can either be chosen directly or estimated with estimate.m()
.
mboot(data, statistic, m, R = 1000, replace = FALSE, ...)
mboot(data, statistic, m, R = 1000, replace = FALSE, ...)
data |
The data to be bootstrapped. If it is multidimensional, each row is considered as one observation passed to the |
statistic |
A function returing the statistic of interest. It must take two arguments. The first argument passed will be the original data, the second
will be a vector of indicies. Any further arguments can be passed through the |
m |
The subsampling size. |
R |
The number of bootstrap replicates. |
replace |
Whether sampling should be done with replacement or without replacement (the default). |
... |
Additional parameters to be passed to the |
m
needs to be a numeric value meeting the condition 2<=m<=n
.
It must be chosen such that m goes to infinity as n goes to infinits,
but the ratio m/n must go to zero.
The m-out-of-n Bootstrap without replacement, known as subsampling, was introduced by Politis and Romano (1994).
The returned value is an object of the class "mboot"
containing the following components:
t0: The observed value of statistic
applied to the data
.
t: A matrix with R
rows where each is a bootstrap replicate of the result of calling statistic
.
m,n: Selected subsample size and data size.
data: The data
passed to mboot
.
statistic: The statistic
passed to mboot
.
replace: Whether the bootstrap replicates were done with or without replacement.
Politis D.N. and Romano J.P. (1994) Large sample confidence regions based on subsamples under minimal assumptions. The Annals of Statistics, 22(4):2031-2050, doi:10.1214/aos/1176325770
mboot.ci estimate.m estimate.tau
data <- runif(1000) estimate.max <- function(data, indices) {return(max(data[indices]))} boot.out <- mboot(data, estimate.max, R = 1000, m = 2*sqrt(NROW(data)), replace = FALSE)
data <- runif(1000) estimate.max <- function(data, indices) {return(max(data[indices]))} boot.out <- mboot(data, estimate.max, R = 1000, m = 2*sqrt(NROW(data)), replace = FALSE)
Estimates the confidence interval using the methods provided by types
.
tau
must be a function that calculates teh scaling factor
tau(n) for a given n. If tau
is not provided, it is estimated
with estimate.tau
using the default settings of this function.
mboot.ci(boot.out, conf = 0.95, tau = NULL, types = "all", ...)
mboot.ci(boot.out, conf = 0.95, tau = NULL, types = "all", ...)
boot.out |
The simulated bootstrap distribution from the |
conf |
The confidence level. |
tau |
Function that returns the scaling factor tau in dependence of n. If |
types |
The types of confidence intervals to be calculated. The value can be 'all' for all types, or a
subset of |
... |
When |
As estimating the scaling factor tau(n) can be unreliable, it is recommended
to explicitly provide tau
. Otherwise it is estimated with
estimate.tau
. To specify additional arguments for
estimate.tau
, call this function directly and use its return value
as tau
argument. For the type sherman
, tau
is not
needed and its value is ignored.
The following methods to compute teh confidence intervals are supported
through the parameter type
:
This method works for all estimators and computes the interval directly from the quantiles of the m-out-of-n bootstrap distribution.
This method only works for normally distributed estimators. It estimates the variance with the m-out-of-n bootstrap and then computes te interval with the quantiles of teh standard normal distribution.
This method does not scale the interval with tau(m)/tau(n) and thus is too wide. To avoid over-coverage, this is compensated by centering it randomly around the point estimators of one of the m-out-of-n bootstrap samples. Although this results on average in the nominal coverage probability, the interval is less accurate than the other intervals and should be used only as a last resort if the scaling factor tau is neither known, nor estimatable.
A list of confidence intervals for the given types.
Politis D.N. and Romano J.P. (1994) Large sample confidence regions based on subsamples under minimal assumptions. The Annals of Statistics, 22(4):2031-2050, doi:10.1214/aos/1176325770
Sherman M. and Carlstein E. (2004) Confidence intervals based on estimators with unknown rates of convergence. Computional statistics & data analysis, 46(1):123-136.
Dalitz C. and Lögler M. (2024) moonboot: An R Package Implementing m-out-of-n Bootstrap Methods doi:10.48550/arXiv.2412.05032
mboot estimate.tau
data <- runif(1000) estimate.max <- function(data, indices) {return(max(data[indices]))} tau <- function(n){n} # convergence rate (usually sqrt(n), but n for max) boot.out <- mboot(data, estimate.max, R = 1000, m = 2*sqrt(NROW(data)), replace = FALSE) cis <- mboot.ci(boot.out, 0.95, tau, c("all")) ci.basic <- cis$basic print(ci.basic)
data <- runif(1000) estimate.max <- function(data, indices) {return(max(data[indices]))} tau <- function(n){n} # convergence rate (usually sqrt(n), but n for max) boot.out <- mboot(data, estimate.max, R = 1000, m = 2*sqrt(NROW(data)), replace = FALSE) cis <- mboot.ci(boot.out, 0.95, tau, c("all")) ci.basic <- cis$basic print(ci.basic)
Calculates the mean of the data points in the shortest interval containing half of the data.
The arguments of the function are such that it directly can be used as a
statistic in the mboot()
function.
shorth(data, indices = NULL)
shorth(data, indices = NULL)
data |
the data as a numeric vector. |
indices |
the selected indices of |
The mean of the data points in the shortest interval containing half of the data.
Andrews D.F. et al. (1972) Robust Estimates of Location Princeton University Press, Princeton.
data <- rnorm(100) shorth(data) shorth(data, sample(1:100, size = 20)) # Calculating a CI for shorth using [mboot()] data <- rnorm(100) boot.out <- mboot(data, shorth, m = sqrt(length(data))) basic.ci <- mboot.ci(boot.out, conf =0.95, tau = function(n) return(n^(1/3)), types = "basic")$basic
data <- rnorm(100) shorth(data) shorth(data, sample(1:100, size = 20)) # Calculating a CI for shorth using [mboot()] data <- rnorm(100) boot.out <- mboot(data, shorth, m = sqrt(length(data))) basic.ci <- mboot.ci(boot.out, conf =0.95, tau = function(n) return(n^(1/3)), types = "basic")$basic