Overview Bagging

All datasets are sampling of entire data. Let’s create various classifier with various sampling dataset. Also, with various sampling, we can make robust classifier.

sampling

As above, entire data can be separated to several samples. After ensemble of those, we can make robust classifer as third graph. Also, we call it that several weak classifier makes string classifier.

sampling2

Sampling reduces overfitting problem. Even though results of samples make overfitting, Ensemble of those has good learning result. Or, sampling reduces variance.

bootstrap

What is the bootstrapping?

sampling with replacement in learning dataset
- sampling n subset learning data
bootstrapping means sampling without adding data
.632 bootstrap
- when sampling d times, each datum is sampled with 0.632 probability.

What is bagging?

Bagging is Bootstrap Aggregation.
Ensemble with n subsample
Assign various data to one model.
Proper to high variance model.(likely to get overfitting)
- Bagging is proper ensemble for models likely to getting overfitting well.
- Combination of each overfitting lessens this problem and works out.
supports regressor, classifier

bagging

what is Out of bag error

OOB is error estimation
When using Bagging, validate the performance with not included data in the bag.
- It’s similar to deal with validation set
Good standard for evaluating Bagging performance

How to use Bagging in sklearn

bagging_sklearn

base_estimator : One estimator
- Concept of bagging is using one classifier and many different data
n_estimator : #bootstrap
max_samples : data ratio to use
max_features : feature ratio to use

bagging_regressor

BaggingRegressor is same with BaggingClassifier to use.

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

clf = DecisionTreeClassifier(random_state=1)
eclf = BaggingClassifier(clf, oob_score=True)

cross_val_score(eclf, X, y, cv = 10).mean()

template