Random forest is a particularly prominent method from the field of
machine learning that is frequently used to make predictions and
decisions based on these predictions. Despite its many advantages, it is
important to recognise that random forest is inherently
non-deterministic. This means that repeated analyses, even with the same
data sets, can yield different prediction models due to the stochastic
nature of random forest. To assess the impact of non-determinism on
predictions and decisions derived from them, the
opt_prediction
function from the optRF
package
can be used which is thoroughly described in this vignette. This
function models the relationship between the number of trees and the
resulting stability of the random forest and uses this model to
determine the optimal number of trees from where on adding more trees to
the random forest would increase the computation time but would only
marginally increase the stability.
To demonstrate the functionality of the optRF
package,
we will use the SNPdata
data set included in the package.
This data set contains in the first column the yield of 250 wheat
individuals as well as 5000 single nucleotide polymorphisms (SNPs) that
can contain either the value 0 or 2 (with 1 indicating heterozygous
alleles which do not occur here). While this example focuses on genomic
research, the optRF
package is versatile and can optimise
random forest for any type of data set. Before proceeding, ensure the
necessary packages are loaded into your R environment. The
optRF
package depends on the ranger
package
for efficient random forest implementation.
Random forest can be used as a prediction model that describes the
relationship between the response (in this case the yield) and the
predictor variables (in this case the SNPs) in a training data set. Once
trained, the random forest model can be used to predict the response
variable for a test data set where only the predictor variables are
available. As an example, we will use the first 200 individuals from the
SNPdata
data set as the training set where both yield and
SNPs are known. The remaining 50 individuals will form the test set,
where only SNPs are available:
Training = SNPdata[1:200,] # Rows 1 to 200 as training data
Test = SNPdata[201:250,-1] # Rows 201 to 250 as test data, excluding the response column (column 1)
To model the relationship between the response and predictor
variables, we use the ranger
function. To ensure
reproducibility, we set a seed before running the analysis:
set.seed(123) # Set a seed for reproducibility
RF_model = ranger(y=Training[,1], x=Training[,-1], write.forest = TRUE)
Once the random forest model is built, it can be used to predict the yield for the individuals in the test data set:
predictions = predict(RF_model, data=Test)$predictions
predicted_Test = data.frame(ID = row.names(Test), predicted_yield = predictions)
head(predicted_Test)
#> ID predicted_yield
#> 1 ID_201 593.6063
#> 2 ID_202 596.8615
#> 3 ID_203 591.3695
#> 4 ID_204 589.3909
#> 5 ID_205 599.5155
#> 6 ID_206 608.1031
The resulting data frame predicted_Test
contains the IDs
of the individuals in the test data set along with their predicted
yield. Based on these predictions, a decision can be made. Suppose we
want to select the top five individuals from the test data set with the
highest predicted yields. To achieve this, we reorder the
predicted_Test
data frame in descending order of the
predicted yield and extract the IDs of the top five individuals:
predicted_Test = predicted_Test[order(predicted_Test$predicted_yield, decreasing=TRUE),]
selected_individuals = predicted_Test[1:5,1]
selected_individuals
#> [1] "ID_222" "ID_210" "ID_225" "ID_206" "ID_249"
The vector selected_individuals
now contains the IDs of
the five individuals from the test data set predicted to have the
highest yields.
Random forest is inherently non-deterministic, meaning that repeated analyses on the same data set can produce different prediction models. To illustrate this effect, let’s repeat the previous analysis with the exact same data but using a different random seed before deriving the prediction model:
set.seed(321) # Set a different seed for reproducibility
RF_model_2 = ranger(y=Training[,1], x=Training[,-1], write.forest = TRUE)
predictions_2 = predict(RF_model_2, data=Test)$predictions
predicted_Test_2 = data.frame(ID = row.names(Test), predicted_yield = predictions_2)
The predicted_Test_2
data frame now contains the
predicted yields for individuals in the test data set from the second
random forest run. To evaluate the similarity between the predictions
from the two runs, we can merge the two data frames and visualise the
relationship between their predictions:
M = merge(predicted_Test, predicted_Test_2, by="ID")
plot(M$predicted_yield.x, M$predicted_yield.y,
xlab="Predicted yield in the first run", ylab="Predicted yield in the second run")
The scatterplot reveals significant differences between the predictions from the two runs. Despite being based on the same data sets, the predictions vary. Pearson’s correlation coefficient of 0.67 indicates a moderate agreement but highlights notable variability.
The impact of random forest’s non-determinism becomes even more apparent when decisions are based on these predictions. For example, let’s select the top five individuals with the highest predicted yields in the second run:
predicted_Test_2 = predicted_Test_2[order(predicted_Test_2$predicted_yield, decreasing=TRUE),]
selected_individuals_2 = predicted_Test_2[1:5,1]
We can now compare the selections from the two runs:
selected_individuals
#> [1] "ID_222" "ID_210" "ID_225" "ID_206" "ID_249"
selected_individuals_2
#> [1] "ID_239" "ID_206" "ID_233" "ID_240" "ID_213"
From the comparison, it’s evident that the selections differ substantially. Only ID_206 appears in both selections, meaning that 80% of the selected individuals changed between runs.
Random forest works by constructing multiple decision or regression
trees and averaging their predictions. This ensemble approach reduces
overfitting and improves robustness. However, because each tree
contributes a unique prediction, the overall stability of random forest
predictions improves as the number of trees increases. By default, R
packages like ranger
, randomForest
, and
Rborist
typically use 500 trees. However, for the
SNPdata
data set, the default of 500 trees was insufficient
to achieve stable variable importance estimates.
To evaluate how the number of trees affects the prediction stability (the correlation of predictions between two repeated runs of random forest), the selection stability (the proportion of individuals selected in both runs based on the highest predicted responses), and the computation time, we can test random forests with different tree counts: 500, 1000, 1500, 2000, 2500, and 3000 trees. The following code illustrates the process:
num.trees_values = c(500, 1000, 1500, 2000, 2500, 3000)
result = data.frame()
for(i in num.trees_values){
start.time = Sys.time()
set.seed(123)
RF_model_1 = ranger(y=Training[,1], x=Training[,-1], write.forest = TRUE, num.trees=i)
predictions_1 = predict(RF_model_1, data=Test)$predictions
predicted_Test_1 = data.frame(ID = row.names(Test), predicted_yield = predictions_1)
predicted_Test_1 = predicted_Test_1[order(predicted_Test_1$predicted_yield, decreasing=TRUE),]
selected_individuals_1 = predicted_Test_1[1:5,1]
set.seed(321)
RF_model_2 = ranger(y=Training[,1], x=Training[,-1], write.forest = TRUE, num.trees=i)
predictions_2 = predict(RF_model_2, data=Test)$predictions
predicted_Test_2 = data.frame(ID = row.names(Test), predicted_yield = predictions_2)
predicted_Test_2 = predicted_Test_2[order(predicted_Test_2$predicted_yield, decreasing=TRUE),]
selected_individuals_2 = predicted_Test_2[1:5,1]
end.time = Sys.time()
M = merge(predicted_Test_1, predicted_Test_2, by="ID")
result = rbind(result, data.frame(number_of_trees = i,
prediction_stability = cor(M$predicted_yield.x, M$predicted_yield.y),
selection_stability = sum(selected_individuals_1 %in% selected_individuals_2)/5,
computation_time = end.time - start.time))
}
result
#> number_of_trees prediction_stability selection_stability computation_time
#> 1 500 0.6690023 0.2 1.202866 secs
#> 2 1000 0.8050173 0.6 1.862012 secs
#> 3 1500 0.8804327 0.6 2.601142 secs
#> 4 2000 0.9170951 0.4 3.191468 secs
#> 5 2500 0.9312791 0.6 3.901001 secs
#> 6 3000 0.9287753 0.8 4.332426 secs
As expected, the stability improves with a higher number of trees. Here, the stability rises from 0.67 at 500 trees to 0.93 at 3000 trees. However, while the computation time increases linearly with increasing numbers of trees, is the relationship between the number of trees and the variable importance stability non-linear. We can plot the stability and computation time to better understand these trends:
par(mfrow=c(1,2))
plot(prediction_stability ~ number_of_trees, data=result)
plot(computation_time ~ number_of_trees, data=result)
abline(lm(result$computation_time ~ result$number_of_trees), lwd=2, col="grey")
Initially, the stability increases steeply with additional trees but eventually begins to plateau. This diminishing return indicates that beyond a certain number of trees, the stability gains become marginal. In contrast to the stability, the computation time increases linearly with an increasing number of trees. This is a predictable outcome since each additional tree contributes directly to processing time. Since computation time increases linearly with the number of trees, an optimal trade-off point can be identified from where on adding more trees to random forest would increase the computation time but would increase the stability only marginally.
The optimal number of trees in a random forest model is highly
data-dependent and must be estimated for each new data set. To automate
this process, the optRF
package provides a convenient
method for determining the optimal number of trees.
Therefore, it determines the stability of random forest with 250,
500, 750, 1,000, and 2,000 trees where random forest is repeated ten
times. If the response variable is numeric, the prediction stability is
measured as the intraclass correlation between predictions from the ten
repetitions of random forest.If the response variable is categorical,
the prediction stability is measured using Fleiss’ Kappa to assess
agreement between the predicted classes across the ten repetitions.
Similarly, in each repetition of random forest, the individuals with the
highest predicted response are classified as selected
and
the rest of the individuals as rejected
. The selection
stability is then measured using Fleiss’ Kappa to assess the agreement
among these classifications in the ten repetitions of random forest.
After the stability of random forest has been calculated with these numbers of trees, the relationship between the number of trees and the corresponding stability is modelled using a two parameter logistic regression model and the increase in stability with each additional tree being added to the random forest is calculated. The optimal number of trees is then defined as the point where adding additional trees increases the stability by only \(10^{-6}\) or less.
To determine the optimal number of trees regarding the prediction
stability, the opt_prediction
function from the
optRF
package can be used where the response variable from
the training data set is passed to the y
argument, the
predictor variables from the training data set are inserted in the
X
argument, and the predictor variables from the test data
set are specified in the X_Test
argument. In this example,
the goal is to select the top 5 individuals (10% of the test
population). By default, the function selects the top 15% of individuals
from the test data set but this can be customised in the
alpha
argument:
set.seed(123) # Set a seed for reproducibility
optRF_result = opt_prediction(y=Training[,1], X=Training[,-1], X_Test=Test,
alpha=0.1)
Once the opt_prediction
function has completed, the
recommended number of trees and the expected stability of the random
forest model can be easily accessed using the summary
function:
summary(optRF_result)
#> Recommended number of trees: 19000
#> Expected prediction stability: 0.9780626
#> Expected selection stability: 0.7137072
#> Expected computation time (sec): 12.59672
Here, the optimal number of trees to receive stable predictions is 19,000 trees. Using random forest with this number of trees, one can expect a prediction stability of 0.98 and a selection stability of 0.71.
Now that the optimal number of trees has been determined, we can repeat the analysis from earlier to examine how the stability of the random forest model changes when using 19,000 trees.
set.seed(123)
RF_model_1 = ranger(y=Training[,1], x=Training[,-1],
write.forest = TRUE, num.trees=19000)
predictions_1 = predict(RF_model_1, data=Test)$predictions
predicted_Test_1 = data.frame(ID = row.names(Test), predicted_yield = predictions_1)
predicted_Test_1 = predicted_Test_1[order(predicted_Test_1$predicted_yield, decreasing=TRUE),]
selected_individuals_1 = predicted_Test_1[1:5,1]
set.seed(321)
RF_model_2 = ranger(y=Training[,1], x=Training[,-1],
write.forest = TRUE, num.trees=19000)
predictions_2 = predict(RF_model_2, data=Test)$predictions
predicted_Test_2 = data.frame(ID = row.names(Test), predicted_yield = predictions_2)
predicted_Test_2 = predicted_Test_2[order(predicted_Test_2$predicted_yield, decreasing=TRUE),]
selected_individuals_2 = predicted_Test_2[1:5,1]
M = merge(predicted_Test_1, predicted_Test_2, by="ID")
plot(M$predicted_yield.x, M$predicted_yield.y,
xlab="Predicted yield in the first run", ylab="Predicted yield in the second run")
With 19,000 trees, the correlation between the predictions in the two
repetitions is 0.98, a dramatic improvement over the correlation of 0.67
observed with 500 trees. The scatter plot of the predictions from the
two runs shows a tightly clustered pattern, confirming consistency in
the results and the high correlation aligns with the stability value
predicted by opt_prediction
, demonstrating the reliability
of the function’s recommendation.
While stable predictions are essential, it is equally important to ensure stability in the decisions based on these predictions. In this case, the goal was to select the five individuals from the test dataset with the highest predicted response. We now examine how the optimised number of trees impacts the consistency of these selections across two runs:
selected_individuals_1
#> [1] "ID_206" "ID_225" "ID_214" "ID_249" "ID_210"
selected_individuals_2
#> [1] "ID_225" "ID_206" "ID_214" "ID_210" "ID_213"
The selected individuals from the two runs are now largely similar. Among the five individuals chosen in each run, four were selected in both repetitions, demonstrating improved, though not perfect, selection stability. The slight inconsistency in selections arises because the optimal number of trees was defined based on prediction stability, with the criterion being that adding more trees would increase prediction stability by only \(10^{-6}\) or less. As shown earlier, the prediction stability at 19,000 trees is 0.98, but the selection stability is only 0.71 which might be insufficient for applications requiring highly consistent selections.
To enhance selection stability specifically, we can adjust the
opt_prediction
function to recommend the optimal number of
trees based on selection stability instead of prediction stability. This
can be achieved by modifying the recommendation
parameter
in the function:
set.seed(123) # Set a seed for reproducibility
optRF_result_2 = opt_prediction(y=Training[,1], X=Training[,-1], X_Test=Test,
alpha=0.1, recommendation="selection")
summary(optRF_result_2)
#> Recommended number of trees: 73000
#> Expected prediction stability: 0.9931893
#> Expected selection stability: 0.837301
#> Expected computation time (sec): 49.47346
From the updated output of opt_prediction
, we see that
the recommended number of trees has increased significantly to 73,000.
At this level, the selection stability is expected to be 0.84, meaning
that adding more trees would improve selection stability by \(10^{-6}\) or less.
In addition to determining the optimal number of trees such that
adding more trees would increase either the prediction or selection
stability by \(10^{-6}\) or less, it is
also possible to identify the number of trees required to achieve a
specific level of stability. This can be done using the
estimate_numtrees
and plot_stability
functions.
For instance, by running the plot_stability
function for
an interval of 0 to 200,000 trees, setting the
add.recommendation
argument to TRUE
and
the measure
argument to "selection"
, we can
visualise the selection stability achieved with the recommended 73,000
trees (displayed as a red line) compared to the stability across the
entire interval:
From the plot, it is evident that there is still room for improvement
in selection stability. For example, running random forest with 73,000
trees yields a selection stability of 0.84. If a higher level of
stability is required, we can determine the number of trees necessary to
achieve specific selection stability thresholds (e.g., 0.9, 0.95, or
0.99) using the estimate_numtrees
function:
estimate_numtrees(optRF_result_2, measure="selection", for_stability=0.9)
#> selection_stability opt_numtrees computation_time
#> (Intercept) 0.9 206116 139.4819
estimate_numtrees(optRF_result_2, measure="selection", for_stability=0.95)
#> selection_stability opt_numtrees computation_time
#> (Intercept) 0.95 825538 558.3174
estimate_numtrees(optRF_result_2, measure="selection", for_stability=0.99)
#> selection_stability opt_numtrees computation_time
#> (Intercept) 0.99 17701448 11969.31
The output of estimate_numtrees
provides the desired
selection stability, the required number of trees, and the estimated
computation time for computing random forest with that number of trees.
Notably, increasing the selection stability from 0.9 to 0.95 quadruples
the computation time, while increasing it further from 0.95 to 0.99
amplifies the computation time by a factor of 20. This highlights the
exponential trade-off between stability and computational cost.
It is important to note that the computation time estimated by
opt_prediction
is specific to the machine on which the
function was run. On a different system, the computation time may vary.
Additionally, these estimates are based on the ranger
function which can execute random forest computations in parallel using
multiple threads. To provide realistic estimates, the number of threads
can be specified using the num.threads
argument in both the
ranger
and opt_prediction
functions.
Assume that after analysing the results from
estimate_numtrees
, we decide to use 250,000 trees for
prediction. Based on earlier findings, this would yield a selection
stability between 0.9 and 0.95. However, to publish exact stability
metrics, we can use the estimate_stability
function with
the chosen number of trees:
estimate_stability(optRF_result_2, with_num.trees=250000)
#> num.trees prediction_stability selection_stability computation_time
#> (Intercept) 250000 0.9976854 0.9089737 169.1556
The results indicate that when making predictions with 250,000 trees and this data set, the prediction stability is 0.998 and the selection stability is 0.909. These values now give others the possibility to assess the reproducibility of the results and can show that the results of the analysis are reliable. Thus, the prediction and selection stability should always be given when publishing decisions based on random forest predictions.
opt_prediction
functionThis vignette illustrated an application of the
opt_prediction
function to predict the yield of 50 wheat
plants and select the five individuals with the highest predicted yield.
However, the function is highly flexible and can accommodate different
selection goals such as selecting individuals with the lowest predicted
response by setting the select_for
argument to
"low"
or selecting individuals whose predicted response can
be both negative and positive but is closest to zero by setting the
select_for
argument to "zero"
.
In this example, the optimal number of trees was defined as the
number of trees at which adding more trees would increase the prediction
stability by \(10^{-6}\) or less. This
threshold is arbitrary and can be customised using the
rec.thresh
argument to fit specific requirements. Next to
setting the recommendation
argument to
"prediction"
(default) or "selection"
, the
recommendation
argument can be set to "none"
to analyse the relationship between the number of trees and stability
without receiving a specific recommendation. The user can then manually
determine the optimal number of trees using the
estimate_numtrees
, estimate_stability
, and
plot_stability
functions.
Users can also adjust computational settings to tailor the
opt_prediction
function to their needs. Using the
number.repetitions
argument, it can be controlled how many
repetitions are performed. Fewer repetitions will reduce computation
time but may lead to more variable results. Using the
num.trees_values
argument, certain numbers of trees can be
specified. Smaller ranges or fewer values will speed up computation but
might reduce the precision of the stability analysis. We recommend not
to adjust these arguments.
Next to the arguments described here, the user can also add arguments
from the ranger
function to customise the application of
opt_prediction
. This gives the user the possibility to
adjust the num.threads
argument to allow for parallel
computing or add specific values for hyperparameters in random forest
such as mtry
, min.node.size
, or
sample.fraction
.
While random forest has many advantages and can be easily used to
enable data-driven decision-making processes, it must be taken into
consideration that random forest is a non-deterministic method that can
lead to different prediction models and, thus, different decisions even
when analysing the same data set. The stability of random forest depends
strongly on the number of trees which not only increases stability but
also computation time. The optRF
package provides an
efficient solution for this problem by modelling the relationship
between the number of trees and the corresponding stability of random
forest. With its opt_prediction
function, users can
systematically evaluate this relationship for a specific data set and
determine the optimal number of trees required to achieve satisfactory
stability and improve reproducibility. Researchers are encouraged to
evaluate the stability of the chosen number of trees to ensure reliable
and reproducible results when using random forest.