Supriya Ghosh (Editor)

Jackknife variance estimates for random forest

Updated on
Edit
Like
Comment
Share on FacebookTweet on TwitterShare on LinkedInShare on Reddit

In statistics, jackknife variance estimates for random forest are a way to estimate the variance in random forest models, in order to eliminate the bootstrap effects.

Contents

Jackknife variance estimates

The sampling variance of bagged learners is:

V ( x ) = V a r [ θ ^ ( x ) ]

Jackknife estimates can be considered to eliminate the bootstrap effects. The jackknife variance estimator is defined as:

V ^ j = n 1 n i = 1 n ( θ ^ ( i ) θ ¯ ) 2

In some classification problems, when random forest is used to fit models, jackknife estimated variance is defined as:

V ^ j = n 1 n i = 1 n ( t ¯ ( i ) ( x ) t ¯ ( x ) ) 2

Here, t denotes a decision tree after training, t ( i ) denotes the result based on samples without i t h observation.

Examples

E-mail spam problem is a common classification problem, in this problem, 57 features are used to classify spam e-mail and non-spam e-mail. Applying IJ-U variance formula to evaluate the accuracy of models with m=15,19 and 57. The results shows in paper( Confidence Intervals for Random Forests: The jackknife and the Infinitesimal Jackknife ) that m = 57 random forest appears to be quite unstable, while predictions made by m=5 random forest appear to be quite stable, this results is corresponding to the evaluation made by error percentage, in which the accuracy of model with m=5 is high and m=57 is low.

Here, accuracy is measured by error rate, which is defined as:

E r r o r R a t e = 1 N i = 1 N j = 1 M y i j ,

Here N is also the number of samples, M is the number of classes, y i j is the indicator function which equals 1 when i t h observation is in class j, equals 0 when in other classes. No probability is considered here. There is another method which is similar to error rate to measure accuracy:

l o g l o s s = 1 N i = 1 N j = 1 M y i j l o g ( p i j )

Here N is the number of samples, M is the number of classes, y i j is the indicator function which equals 1 when i t h observation is in class j, equals 0 when in other classes. p i j is the predicted probability of i t h observation in class j .This method is used in Kaggle These two methods are very similar.

Modification for bias

When using Monte Carlo MSEs for estimating V I J and V J , a problem about the Monte Carlo bias should be considered, especially when n is large, the bias is getting large:

E [ V ^ I J B ] V ^ I J n b = 1 B ( t b t ¯ ) 2 B

To eliminate this influence, bias-corrected modifications are suggested:

V ^ I J U B = V ^ I J B n b = 1 B ( t b t ¯ ) 2 B V ^ J U B = V ^ J B ( e 1 ) n b = 1 B ( t b t ¯ ) 2 B

References

Jackknife variance estimates for random forest Wikipedia