sample function in R-3.6.0+¶
Wenyu found a very interesting question about R programming. When he was preparing his tutorial, he cannot reproduce the code in the section 8.3 of An Introduction to Statistical Learning, with Applications in R (ISLR), but his sister can successfully produce the results last year.
The program is as follows:
library(tree) library(ISLR) attach(Carseats) High=ifelse(Sales<=8,"No","Yes") Carseats=data.frame(Carseats,High) set.seed(2) train=sample(1:nrow(Carseats), 200) Carseats.test=Carseats[-train,] High.test=High[-train] tree.carseats=tree(High~.-Sales,Carseats,subset=train) tree.pred=predict(tree.carseats,Carseats.test,type="class") table(tree.pred,High.test)
the book reports
## High.test ## tree.pred No Yes ## No 86 27 ## Yes 30 57
while we get
## High.test ## tree.pred No Yes ## No 104 33 ## Yes 13 50
That’s so strange!! There might be some changed things, packages or
All of the related packages are
ISLR, where the second one has not been updated for a long time, and so no need to consider it. For
tree, there are active updates in the recent years,
Version 1.0-39 2018-03-17 Allow more C-level space for labels Version 1.0-38 2018-03-14 Tweak to cv.tree. Version 1.0-37 2016-01-20 Include <stdio.h>, not relying on R.h Version 1.0-36 2015-06-29 Add tests/deparse.R . Update reference output. Tweak NAMESPACE imports.
but actually the last version is back to March 2018, which is already prior to Wenyu’s sister. Anyway, we have tried to install the old version package, and try to see whether there are some differences.
The command for installing a specified version would be
install.packages("https://cran.r-project.org/src/contrib/Archive/tree/tree_1.0-39.tar.gz", repos = NULL, type = "source")
in the R console. However, at the beginning, I am not familiar with it. So I use the interface of rstudio to manually install the old
tree‘s. Finally, I had tried
v1.0-39, but all of them yield the same results as the latest version.
Currently, we are using the latest R version,
v3.6.0. Firstly, I had tried the old version
v3.4.4 on my server rstudio. It can successfully reproduce the results of the book.
Now our goal is to determine the version from which we fail to obtain the same results. It requires us to install multiple R versions to test the above program.
The docker facilitates the multiple R versions easily. Here is an official images for r-base, including most old versions of R. We can install a specified R version with the following commands by tagging the version number,
docker pull r-base:3.6.0 docker run -it --rm r-base:3.6.0
As a result,
v3.5.3 (the largest version of
v3.5.x) can reproduce the results, while it failed in the R version from
v3.6.0. Then I found the changelog for
The first point is exactly what we want,
Changes to random number generation. R 3.6.0 changes the method used to generate random integers in the sample function. In prior versions, the probability of generating each integer could vary from equal by up to 0.04% (or possibly more if generating more than a million different integers). This change will mainly be relevant to people who do large-scale simulations, and it also means that scripts using the sample function will generate different results in R 3.6.0 than they did in prior versions of R. If you need to keep the results the same (for reproducibility or for automated testing), you can revert to the old behavior by adding
RNGkind(sample.kind="Rounding")) to the top of your script.
In a word, for the R starting from
v3.6.0, we can add
to reproduce the results that did in prior versions of R.