When coding, especially for data
science, there are multiple ways to solve each problem. When presented
with two options, you want to pick the one that is faster and/or more
accurate. Comparing different code chunks on the same task can be
tedious. It often requires creating data, writing a for loop (or using
sapply
), then comparing.
The comparer package makes this comparison quick and simple:
The same data can be given in to each model.
Various metrics can be used to judge results, including using the predicted errors from the code.
The results are displayed in a table that allows you to quickly judge the results.
This document introduces the main function of the
comparer
package, mbc
.
microbenchmark
The R package microbenchmark
provides the fantastic
eponymous function. It makes it simple to run different segments of code
and see which is faster. Borrowing an example from http://adv-r.had.co.nz/Performance.html, the following
shows how it gives a summary of how fast each ran.
if (requireNamespace("microbenchmark", quietly = TRUE)) {
x <- runif(100)
microbenchmark::microbenchmark(sqrt(x), x ^ .5)
} else {
"microbenchmark not available on your computer"
}
## Unit: nanoseconds
## expr min lq mean median uq max neval
## sqrt(x) 401 471.0 539.30 501 530.5 3026 100
## x^0.5 1894 1998.5 2146.05 2034 2074.0 11792 100
However it gives no summary of the output. For this example it is fine since the output is deterministic, but when working with randomness or model predictions we want to have some sort of summary or evaluation metric to see which has better accuracy, or to just see how the outputs differ.
mbc
to the rescueThe function mbc
in the comparer
package
was created to solve this problem, where a comparison of the output is
desired in addition to the run time.
For example, we may wish to see how the sample size affects an estimate of the mean of a random sample. The following shows the results of finding the mean of 10 and 100 samples from a normal distribution.
## Loading required package: GauPro
## Loading required package: mixopt
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: ggplot2
## Loading required package: splitfngr
## Loading required package: numDeriv
## Loading required package: tidyr
## Loading required package: plyr
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## Loading required package: progress
## Run times (sec)
## Function Sort1 Sort2 Sort3 Sort4
## 1 mean(rnorm(10)) 9.775162e-06 1.001358e-05 1.168251e-05 3.004074e-05
## 2 mean(rnorm(100)) 1.382828e-05 1.382828e-05 1.430511e-05 1.549721e-05
## Sort5 mean sd neval
## 1 4.005432e-05 2.031326e-05 1.392801e-05 5
## 2 1.764297e-05 1.502037e-05 1.617033e-06 5
##
## Output summary
## Func Stat Sort1 Sort2 Sort3 Sort4
## 1 mean(rnorm(10)) 1 -0.1385332 0.005776678 0.12611456 0.29728392
## 2 mean(rnorm(100)) 1 -0.1886518 -0.071591735 -0.04465929 0.02597593
## Sort5 mean sd
## 1 0.56415632 0.17095965 0.2718657
## 2 0.07643412 -0.04049856 0.1012739
By default it only runs 5 trials, but this can be changed with the
times
parameter. The first part of the output gives the run
times. For 5 or fewer, it shows all the values in sorted order, for more
than 5 it shows summary statistics. Unfortunately, the timing is only
accurate up to 0.01 seconds, so these all show as 0.
The second section of the output gives the summary of the output. This also will show summary stats for more than 5 trials, but for this small sample size it shows all the values in sorted order with the mean and standard deviation given. The first column shows the name of each, and the second column shows which output statistic is given. Since there is only one output for this code it is called “1”.
Setting times
changes the number of trials run. Below
the same example as above is run but for 100 trials.
## Run times (sec)
## Function Min. 1st Qu. Median Mean
## 1 mean(rnorm(10)) 9.059906e-06 9.536743e-06 1.001358e-05 1.185179e-05
## 2 mean(rnorm(100)) 1.263618e-05 1.358986e-05 1.418591e-05 1.600504e-05
## 3rd Qu. Max. sd neval
## 1 1.382828e-05 2.217293e-05 3.029895e-06 100
## 2 1.746416e-05 4.982948e-05 4.361296e-06 100
##
## Output summary
## Func Stat Min. 1st Qu. Median Mean
## 1 mean(rnorm(10)) 1 -0.8159787 -0.25627116 -0.057128447 -0.01818848
## 2 mean(rnorm(100)) 1 -0.1839926 -0.07144344 0.005504562 0.00882492
## 3rd Qu. Max. sd
## 1 0.20917428 0.7309789 0.3200503
## 2 0.07699597 0.2467896 0.1014793
We see that the mean of both is around zero, but that the larger
sample size (mean(rnorm(100))
) has a tighter distribution
and a standard deviation a third as large as the other, which is about
what we expect for a sample that is 10 times larger (it should be $\sqrt{10} \approx 3.16$ times smaller on
average).
In this example each function had its own input, but many times we want to compare the functions on the same input for better comparison.
The previous comparisons showed a summary of the outputs, but many
times we want to compare output values to true values, then calculate a
summary statistic, such as an average error. The argument
target
specifies the values the code chunks should give,
then summary statistics can be calculated by specifying
metrics
, which defaults to calculating the rmse.
For example, suppose we have data from a linear function, and want to see how accurate the model is when the output values are corrupted with noise. Below we compare two linear models: the first with an intercept term, and the second without. The model with the intercept term should be much better since the data has an intercept of −0.6.
We see that the output is different in a few ways now. The
Stat
column tells what the row is showing. These all say
rmse
, meaning they are giving the root mean squared error
of the predicted values compared to the true y
. There’s
also a new section at the bottom title Compare
. This
compares the rmse
values from the two methods, and does a
t-test to see if the difference is significant. However, since there is
no randomness, it fails to perform the t-test.
## Run times (sec)
## Function Sort1 Sort2
## 1 predict(lm(ynoise ~ x), data.frame(x)) 0.0008122921 0.0008203983
## 2 predict(lm(ynoise ~ x - 1), data.frame(x)) 0.0008349419 0.0008575916
## Sort3 Sort4 Sort5 mean sd neval
## 1 0.001039743 0.001181364 0.002338171 0.0012383938 0.0006341347 5
## 2 0.001019001 0.001045704 0.001048565 0.0009611607 0.0001058186 5
##
## Output summary
## Func Stat V1 V2
## 1 predict(lm(ynoise ~ x), data.frame(x)) rmse 0.03486556 0.03486556
## 2 predict(lm(ynoise ~ x - 1), data.frame(x)) rmse 0.31189957 0.31189957
## V3 V4 V5 mean sd
## 1 0.03486556 0.03486556 0.03486556 0.03486556 0
## 2 0.31189957 0.31189957 0.31189957 0.31189957 0
##
## Compare
## Func
## 1 predict(lm(ynoise ~ x), data.frame(x)) vs predict(lm(ynoise ~ x - 1), data.frame(x))
## Stat conf.low conf.up t p
## 1 rmse NA NA NA NA
To add randomness we can simply define ynoise
in the
inputi
argument, as shown below. Now there is randomness in
the data, so a paired t-test can be computed. It is paired since the
same ynoise
is given to each model. We see that even with
only a sample size of 5, the p-value is highly significant.
mbc(predict(lm(ynoise ~ x), data.frame(x)),
predict(lm(ynoise ~ x - 1), data.frame(x)),
inputi={ynoise <- y + rnorm(n, 0, .2)},
target = y)
## Run times (sec)
## Function Sort1 Sort2
## 1 predict(lm(ynoise ~ x), data.frame(x)) 0.0008070469 0.0008420944
## 2 predict(lm(ynoise ~ x - 1), data.frame(x)) 0.0007898808 0.0008242130
## Sort3 Sort4 Sort5 mean sd neval
## 1 0.0009193420 0.0009200573 0.0009806156 0.0008938313 6.906221e-05 5
## 2 0.0008368492 0.0010051727 0.0043790340 0.0015670300 1.574163e-03 5
##
## Output summary
## Func Stat V1 V2
## 1 predict(lm(ynoise ~ x), data.frame(x)) rmse 0.04772297 0.04205276
## 2 predict(lm(ynoise ~ x - 1), data.frame(x)) rmse 0.31425846 0.31135820
## V3 V4 V5 mean sd
## 1 0.04900406 0.08436103 0.02185191 0.04899855 0.022568322
## 2 0.31485779 0.31178120 0.31209022 0.31286917 0.001577832
##
## Compare
## Func
## t predict(lm(ynoise ~ x), data.frame(x))-predict(lm(ynoise ~ x - 1), data.frame(x))
## Stat V1 V2 V3 V4 V5 mean
## t rmse -0.2665355 -0.2693054 -0.2658537 -0.2274202 -0.2902383 -0.2638706
## sd t p
## t 0.02271818 -25.97183 1.305753e-05
evaluator
Many times the code chunks we want to compare only differ by a small
amount, such as a single argument. In the example above, the only
difference is the formula in the lm
command. With
mbc
, the evaluator
can be set to make these
cases easier. The argument for evaluator
should be an
expression including .
, which will be replaced with the
code chunks provided. The example below rewrites the above comparison
using evaluator
.
mbc(ynoise ~ x,
ynoise ~ x - 1,
evaluator=predict(lm(.), data.frame(x)),
inputi={ynoise <- y + rnorm(n, 0, .2)},
target = y)
## Run times (sec)
## Function Sort1 Sort2 Sort3 Sort4
## 1 ynoise ~ x 0.0008282661 0.0008506775 0.0008516312 0.0008559227
## 2 ynoise ~ x - 1 0.0008337498 0.0008339882 0.0008592606 0.0008661747
## Sort5 mean sd neval
## 1 0.0009188652 0.0008610725 3.405866e-05 5
## 2 0.0008778572 0.0008542061 1.971925e-05 5
##
## Output summary
## Func Stat V1 V2 V3 V4 V5
## 1 ynoise ~ x rmse 0.03486556 0.03486556 0.03486556 0.03486556 0.03486556
## 2 ynoise ~ x - 1 rmse 0.31189957 0.31189957 0.31189957 0.31189957 0.31189957
## mean sd
## 1 0.03486556 0
## 2 0.31189957 0
##
## Compare
## Func Stat V1 V2 V3 V4
## 1 ynoise ~ x-ynoise ~ x - 1 rmse -0.277034 -0.277034 -0.277034 -0.277034
## V5 mean sd t p
## 1 -0.277034 -0.277034 0 NA NA
K-fold cross validation can also be done using mbc
using
the kfold
parameter. K-fold cross validation involves
splitting N data points into
k groups. kfold
should specify what this N is,
since it depends on the data. By default it will set the number of
folds, k, to be
times
. Then each replicate will be evaluating a single
fold. Note that this will not do k folds times
times.
To make k different from times, pass in kfold as a vector whose second element in the number of folds. For example, suppose you have 100 data points, want to do 5 folds, and repeat this process twice (i.e. evaluate 10 folds). Then you should pass in kfold = c(100, 5) and times = 10. The first five trials would then the the five separate folds. The sixth through tenth trials would be a new partition of the data into five folds.
Then to use this folds you must use ki
as part of an
expression in the code chunk or inputi
. The following shows
how to use k-fold cross validation fitting a linear model to the
cars
dataset. Setting kfold=c(nrow(cars), 5)
tells it that you want to use 5 folds on the cars
data set.
It has 50 rows, so in each trial ki
is a subset of
1:50
of 40 elements. Setting times=30
means
that we are repeating the five folds six times. The code chunk fits the
model, makes predictions on the hold-out data, and calculates the
RMSE.
mbc({mod <- lm(dist ~ speed, data=cars[ki,])
p <- predict(mod,cars[-ki,])
sqrt(mean((p - cars$dist[-ki])^2))
},
kfold=c(nrow(cars), 5),
times=30)
## Run times (sec)
## Function
## 1 { mod <- lm(dist ~ speed, data = cars[ki, ]) p <- predict(mod, cars[-ki, ]) sqrt(mean((p - cars$dist[-ki])^2)) }
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 0.0007576942 0.000770092 0.0007930994 0.0007980585 0.0008243918 0.0008850098
## sd neval
## 1 3.538891e-05 30
##
## Output summary
## Func
## 1 { mod <- lm(dist ~ speed, data = cars[ki, ]) p <- predict(mod, cars[-ki, ]) sqrt(mean((p - cars$dist[-ki])^2)) }
## Stat Min. 1st Qu. Median Mean 3rd Qu. Max. sd
## 1 1 8.929092 11.72902 13.85972 15.19293 19.13354 23.79548 4.322134
The following example simplifies this a little. Setting
targetin
tells it what the input to predict
should be and setting target="dist"
tells it that the
target is the dist
element from targetin
. You
cannot set target=cars$dit[-ki]
since target
cannot be evaluated as an expression.
mbc(lm(dist ~ speed, data=cars[ki,]),
targetin=cars[-ki,], target="dist",
kfold=c(nrow(cars), 5),
times=30)
## Run times (sec)
## Function Min. 1st Qu. Median
## 1 lm(dist ~ speed, data = cars[ki, ]) 0.0004577637 0.0004643798 0.0004717112
## Mean 3rd Qu. Max. sd neval
## 1 0.0004784425 0.000487566 0.0005402565 1.942509e-05 30
##
## Output summary
## Func Stat Min. 1st Qu. Median Mean
## 1 lm(dist ~ speed, data = cars[ki, ]) rmse 10.51727 11.74167 14.4834 15.48358
## 3rd Qu. Max. sd
## 1 18.85401 24.05299 4.106909
In the previous example, the output shows that the “Stat” is “rmse”,
meaning that it calculated the root-mean-square error from the.
predictions and target values. The metric, or statistic, calculated can
be changed using the metric
argument, which defaults to
rmse
. Three of the other options for metric
are t
, mis90
, and sr27
. These
three all compare target values (y) to predicted values (ŷ) and predicted errors (s). These only work for models that
give predicted errors, such as Gaussian process models.
metric=t
Using these the target value, predicted value, and predicted error, we can calculate a t-score.
$$ t = \frac{\hat{y} - y}{s} $$
The output then shows the distribution of these t-scores by showing the six number summary.
mbc(lm(dist ~ speed, data=cars[ki,]),
targetin=cars[-ki,], target="dist",
kfold=c(nrow(cars), 5),
times=30,
metric='t')
## Run times (sec)
## Function Min. 1st Qu. Median
## 1 lm(dist ~ speed, data = cars[ki, ]) 0.0004820824 0.0004858971 0.0004937649
## Mean 3rd Qu. Max. sd neval
## 1 0.0005023638 0.0005053878 0.0005953312 2.444571e-05 30
##
## Output summary
## Func Stat Min. 1st Qu.
## 1 lm(dist ~ speed, data = cars[ki, ]) Min. t -20.9075400 -11.5984276
## 2 lm(dist ~ speed, data = cars[ki, ]) 1st Qu. t -4.9313685 -2.7018594
## 3 lm(dist ~ speed, data = cars[ki, ]) Median t -2.8354889 -0.4966254
## 4 lm(dist ~ speed, data = cars[ki, ]) Mean t -3.2792170 -0.7899931
## 5 lm(dist ~ speed, data = cars[ki, ]) 3rd Qu. t 0.4225208 1.6802855
## 6 lm(dist ~ speed, data = cars[ki, ]) Max. t 3.3310282 6.0047720
## Median Mean 3rd Qu. Max. sd
## 1 -9.7004842 -9.8979129 -7.8246452 -0.08528217 5.402287
## 2 -2.0802868 -1.8818658 -0.8568723 1.61930520 1.752299
## 3 0.6155601 0.7373985 1.9882930 3.95020736 1.803065
## 4 0.3880399 0.1253952 0.9522826 3.31685915 1.650908
## 5 2.8136150 2.9367746 3.9946167 6.06310737 1.536431
## 6 7.1397249 7.0871224 8.8368563 9.65604663 1.942429
metric=mis90
The t-score metric is not very informative because you can get the
same t-scores by having a large error and large predicted error as
having a small error and small predicted error. mis90
is
the mean interval score for 90% coverage intervals as described by Gneiting and Raftery (2007, Equation 43).
3.28s + 20(ŷ − y − 1.64s)+ + 20(y − ŷ − 1.64s)+ where ()+ denotes the positive part of what is in the parentheses. Smaller values are better. This metric penalizes having large predicted errors and having actual errors different from the predicted errors, so it is very good for judging the accuracy of a prediction interval.
metric=sr27
The scoring rule in Equation 27 Gneiting and Raftery (2007) is another proper scoring rule.
$$ -\left( \frac{\hat{y} - y}{s} \right)^2 - \log s^2 $$ For this metric, larger values are better. A problem with this metric is that if s = 0, which can happen from numerical issues, then it will go to infinity, which does not happen with the mean interval score.
ffexp
The other main function of the package is ffexp
, an
abbreviation for full-factorial experiment. It will run a function using
all possible combinations of input parameters given. It is useful for
running experiments that take a long time to complete.
The first arguments given to ffexp$new
should give the
possible values for each input parameter. In the example below,
a
can be 1, 2, or 3, and b
can “a”, “b”, or
“c”. Then eval_func
should be given that can operate on
these parameters. For example, using eval_func = paste
will
paste together the value of a
with the value of
b
.
After creating the ffexp
object, we can call
f1$run_all
to run eval_func
on every
combination of a
and b
.
Now to see the results in a clean format, look at
f1$outcleandf
.
## a b V1 runtime start_time end_time run_number
## 1 1 a 1 a 0 2024-11-08 02:42:03 2024-11-08 02:42:03 1
## 2 2 a 2 a 0 2024-11-08 02:42:04 2024-11-08 02:42:04 2
## 3 3 a 3 a 0 2024-11-08 02:42:04 2024-11-08 02:42:04 3
## 4 1 b 1 b 0 2024-11-08 02:42:04 2024-11-08 02:42:04 4
## 5 2 b 2 b 0 2024-11-08 02:42:04 2024-11-08 02:42:04 5
## 6 3 b 3 b 0 2024-11-08 02:42:04 2024-11-08 02:42:04 6
## 7 1 c 1 c 0 2024-11-08 02:42:04 2024-11-08 02:42:04 7
## 8 2 c 2 c 0 2024-11-08 02:42:04 2024-11-08 02:42:04 8
## 9 3 c 3 c 0 2024-11-08 02:42:04 2024-11-08 02:42:04 9