Poggi analytics - ensamble - 1b

59
Buenos Aires, marzo de 2016 Eduardo Poggi

Transcript of Poggi analytics - ensamble - 1b

Page 1: Poggi   analytics - ensamble - 1b

Buenos Aires, marzo de 2016Eduardo Poggi

Page 2: Poggi   analytics - ensamble - 1b

Temas

Ensambles Bagging Boosting Random Forest

Page 3: Poggi   analytics - ensamble - 1b

Temas

Ensambles Bagging Boosting Random Forest

Page 4: Poggi   analytics - ensamble - 1b

4Ensambles

Page 5: Poggi   analytics - ensamble - 1b

Ensambles

Ensamble: Conjunto de modelos que se usan juntos como un “meta

modelo”. Idea base conocida:

Usar conocimiento de distintas fuentes al tomar decisiones.

Page 6: Poggi   analytics - ensamble - 1b

Ensambles

Comité de expertos: muchos elementos todos con alto conocimiento todos sobre el mismo tema votan

Gabinete de asesores: expertos en diferentes áreas alto conocimiento hay una cabeza que decide

quién sabe del tema

Ensambles planos:

-Fusión

-Bagging

-Boosting

-Random Forest

Ensambles divisivos:

-Mixture of experts

-Stacking

Crowding decision?

Page 7: Poggi   analytics - ensamble - 1b

Ensambles

Dos componentes base: Un método para seleccionar o construir los miembros

Misma o distinta área? Distintos datasets x distintos modelos x distintas

configuraciones Un método para combinar las decisiones

Votación simple, votación ponderada, promedio, función específica, selectividad …

Page 8: Poggi   analytics - ensamble - 1b

Ensambles

Planos: Muchos expertos, todos buenos: Necesito que sean lo mejor posible individualmente.

De lo contrario, usualmente no sirven. Pero necesito que opinen distinto en algunos casos.

Si todos opinan siempre igual… me quedo con uno solo!

Page 9: Poggi   analytics - ensamble - 1b

Ensambles

Divisivos: Dividir el problema en una serie de subproblemas con

mínima sobreposición. Estrategia de “divide & conquer”. Útiles para atacar problemas grandes. Se necesita una función que decida que clasificador tiene

que actuar.

Page 10: Poggi   analytics - ensamble - 1b

Ensambles

Si un “aprendiz” es bueno produce un buen clasificador, puede que muchos “aprendices” produzcan algo mejor?

Por qué no aprender: { h1, h2, h3 }, entonces: h*(x) = mayoría { h1(x), h2(x), h3(x) } Si hi’s tienen errores independientes h* es más precisa. Error(hi) = ε, entonces Error(h*) = 3ε⌃2 (0.01 → 0.0003)

Page 11: Poggi   analytics - ensamble - 1b

Ensambles

Page 12: Poggi   analytics - ensamble - 1b

Ensambles

1. Subsample Training Sample Bagging Boosting

2. Manipulate Input Features 3. Manipulate Output Targets

ECOC 4. Injecting Randomness

Data Algorithm 5. Algorithm Specific methods

Other combinations Why do Ensembles work?

Page 13: Poggi   analytics - ensamble - 1b

Ensambles

Subsampling

Page 14: Poggi   analytics - ensamble - 1b

Ensambles

Manipulate Input Features

Page 15: Poggi   analytics - ensamble - 1b

Ensambles

Manipulate Output Targets

Page 16: Poggi   analytics - ensamble - 1b

Ensambles

Un aprendiz se dice inestable si el clasificador que produce sufre cambios importantes ante pequeñas variaciones en los datos de entrenamiento

Inestables: árbol de decisiones, redes neuronales, … Estables: La regresión lineal, el vecino más cercano, ...

Subsampling es mejor para los alumnos inestables

Page 17: Poggi   analytics - ensamble - 1b

Ensambles

Voting Algoritms Take an inducer and A training set, Run the inducer multiple times by changing the

distribution of the training set instances, The generated classifiers are combined, … and then classify the set.

Page 18: Poggi   analytics - ensamble - 1b

Ensambles

Voting algorithms can be divided into two types: those that adaptively change the distribution of the

training set based on the performance of previous classifers (as in boosting methods) and

those that do not (as in Bagging).

Page 19: Poggi   analytics - ensamble - 1b

Temas

Ensambles Bagging Boosting Random Forest

Page 20: Poggi   analytics - ensamble - 1b

Bagging Algorithm

Bootstrap aggregating (Breiman 96) Votes classifiers generated by different bootstrap

samples (replicates) Uniformly sampling m instances from the training

set with replacement. T bootstrap samples B1, B2, … , BT are generated

and a classfier Ci is built from each bootstrap sample Bi

A final classfier C* is built from C1, C2, … , CT whose output is the class predicted most often by its subclassiers, with ties broken arbitrarily

Page 21: Poggi   analytics - ensamble - 1b

Bagging Algorithm

Page 22: Poggi   analytics - ensamble - 1b

Bagging Algorithm

Page 23: Poggi   analytics - ensamble - 1b

Bagging Algorithm

Page 24: Poggi   analytics - ensamble - 1b

Bagging Algorithm

Page 25: Poggi   analytics - ensamble - 1b

Bagging Algorithm

An instance instance in the training set has probability 1−(1−1/m)^m of being selected at least once in the m times instances are randomly selected

For large m, this is about 1 − 1/e = 63.2%, which means that each bootstrap sample contains only about 63.2% unique instances from the training set.

If the inducer is unstable (ANN, DT), the performance can improve.

If the inducer is stable (k-nearest neighbor), may slightly degrade the performance.

Page 26: Poggi   analytics - ensamble - 1b

Temas

Ensambles Bagging Boosting Random Forest

Page 27: Poggi   analytics - ensamble - 1b

Adaboost Algorithm

Boosting (Schapire 90), AdaBoost M1 (Freund & Schapire 96)

Generates the classifers sequentially, while Bagging can generate them in parallel.

AdaBoost also changes the weights of the training instances provided as input to each inducer based on classifers that were previously built.

The goal is to force the inducer to minimize expected error over diferent input distributions.

C* = weighted voting. The weight of each classfier depends on its performance on the training set used to build i

Page 28: Poggi   analytics - ensamble - 1b

Adaboost Algorithm

The incorrect instances are weighted by a factor inversely proportional to the error on the training set, i.e., 1/(2Ei). Small training set errors, such as 0.1%, will cause weights to grow by several orders of magnitude.

The AdaBoost algorithm requires a weak learning algorithm whose error is bounded by a constant strictly less than 1/2. In practice, the inducers we use provide no such guarantee.

The original algorithm aborted when the error bound was breached

Resampling + reweighting Success (???) distribution of the “margins”

Page 29: Poggi   analytics - ensamble - 1b

Adaboost Algorithm

Page 30: Poggi   analytics - ensamble - 1b

Adaboost Algorithm

Page 31: Poggi   analytics - ensamble - 1b

Adaboost Algorithm

Page 32: Poggi   analytics - ensamble - 1b

Adaboost Algorithm

Page 33: Poggi   analytics - ensamble - 1b

Adaboost Algorithm

Page 34: Poggi   analytics - ensamble - 1b

Adaboost Algorithm

Page 35: Poggi   analytics - ensamble - 1b

Adaboost : How Will Test Error Behave? (Guess!)

Expect… training error to continue to drop (or reach 0) test error to increase when h* becomes “too complex”

“Occam’s razor” overfitting

Page 36: Poggi   analytics - ensamble - 1b

Adaboost : How Will Test Error Behave? (Real!)

But… test error does not increase, even after 1000 rounds test error continues to drop, even after training error is 0!

Occam’s razor: “simpler rule is better”... appears to not apply!

Page 37: Poggi   analytics - ensamble - 1b

Adaboost : Margins

key idea: training error only measures whether classifications are right or

wrong should also consider confidence of classifications

measure confidence by margin = strength of the vote (weighted fraction voting correctly) − (weighted fraction voting

incorrectly)

Page 38: Poggi   analytics - ensamble - 1b

Adaboost : Margins

key idea: training error only measures whether classifications

are right or wrong should also consider confidence of classifications

Page 39: Poggi   analytics - ensamble - 1b

Adaboost : Application detecting Faces [Viola & Jones]

problem: find faces in photograph or movie weak classifiers: detect light/dark rectangles in

image

many clever tricks to make extremely fast and accurate

Page 40: Poggi   analytics - ensamble - 1b

Adaboost : practical advantages

Fast simple and easy to program no parameters to tune (except T, sometimes) flexible — can combine with any learning

algorithm no prior knowledge needed about weak learner provably effective, given weak classifier

shift in mind set: goal now is merely to find classifiers barely better than random guessing

Versatile can use with data that is textual, numeric, discrete, etc. has been extended to learning problems well beyond

binary classification

Page 41: Poggi   analytics - ensamble - 1b

Adaboost : warnings

Performance of AdaBoost depends on data and weak learner.

Consistent with theory, AdaBoost can fail if... weak classifiers too complex

overfitting weak classifiers too weak (γt → 0 too quickly)

underfitting low margins overfitting

Empirically, AdaBoost seems especially susceptible to uniform noise.

Page 42: Poggi   analytics - ensamble - 1b

Adaboost : Conclusions

Boosting is a practical tool for classification and other learning problems

grounded in rich theory performs well experimentally often (but not always!) resistant to overfitting many applications and extensions

Page 43: Poggi   analytics - ensamble - 1b

Recognizing Handwritten Number

“Obvious” approach: learn F: Scribble → {0,1,2,...,9}

...doesn’t work very well (too hard!)

Or... “decompose” the learning task into 6 “subproblems”

learn 6 classifiers, one for each “sub-problem ”to classify a new scribble:

Run each classifier Predict the class whose code-word is closest (Hamming

distance) to the predicted code

Page 44: Poggi   analytics - ensamble - 1b

Recognizing Handwritten Number

Predict the class whose code-word is closest (Hamming distance) to the predicted code

Page 45: Poggi   analytics - ensamble - 1b

Temas

Ensambles Bagging Boosting Random Forest

Page 46: Poggi   analytics - ensamble - 1b

Ramdom Forest: Bagging + trees

Usar bootstraps genera diversidad, pero los árboles siguen estando muy correlacionados

Las mismas variables tienden a ocupar los primeros cortes siempre.

Ejemplo:

Dos árboles generados con rpart a partir de bootstraps del dataset Pima.tr. La misma variable está en la raíz

Page 47: Poggi   analytics - ensamble - 1b

Ramdom Forest

Agregar un poco de azar al crecimiento En cada nodo, seleccionar un grupo chico de

variables al azar y evaluar sólo esas variables. No agrega sesgo: A la larga todas las variables entran en

juego Agrega varianza: pero eso se soluciona fácil promediando

modelos Es efectivo para decorrelacionar los árboles

Page 48: Poggi   analytics - ensamble - 1b

Ramdom Forest

Page 49: Poggi   analytics - ensamble - 1b

Ramdom Forest

Construye los árboles hasta separar todo. No hay podado. No hay criterio de parada.

El valor de m (mtry en R) es importante. El default es sqrt(p) que suele ser bueno.

Si uso m=p recupero bagging El número de árboles no es importante, mientras

sean muchos. 500, 1000, 2000.

Page 50: Poggi   analytics - ensamble - 1b

Ramdom Forest

Orden

típico

Page 51: Poggi   analytics - ensamble - 1b

RF: Ejemplo

Page 52: Poggi   analytics - ensamble - 1b

Ramdom Forest

Resumen Mejora de bagging sólo para árboles Mejores predicciones que Bagging. Muy usado. Casi automático. Resultados comparables a los mejores métodos actuales. Subproductos útiles, sobre todo la estima OOB y la

importancia de variables.

Page 53: Poggi   analytics - ensamble - 1b

Discover main color

Page 54: Poggi   analytics - ensamble - 1b

Bagging o Boosting: El dilema sesgo-varianza

Los predictores sin sesgo tienen alta varianza (y al revés)

Hay dos formas de resolver el dilema: Disminuir la varianza de los predictores sin sesgo

Construir muchos predictores y promediarlos: Bagging y Random Forest

Reducir el sesgo de los predictores estables Construir una secuencia tal que la combinación tenga menos

sesgo: Boosting

Page 55: Poggi   analytics - ensamble - 1b

Sesgo y Varianza

Que funciones utilizar? Funciones rígidas:

Buena estimación de los parámetros óptimos – poca flexibilidad.

Funciones flexibles: Buen ajuste – mala

estimación de los parámetros óptimos.

Error de sesgo

Error de varianza

Page 56: Poggi   analytics - ensamble - 1b

¿Y ahora?

Las herramientas de ensamble han demostrado que mejoran la performance de las técnicas atómicas que las conforman.

Hay teoremas que demuestran que AdaBoost es mejor siempre y cuando el modelo busteado tenga ciertas características de weakness (sean limitados, no complejos).

Page 57: Poggi   analytics - ensamble - 1b

¿Y ahora?

Corolario: No hace falta que los votantes sean inteligentes, bien

formados, expertos, etc., basta que sean diversos y fieles a sus capacidades limitadas.

“Un comité de tontos funciona mejor que un experto …”

¿Cómo sería un parlamento con legisladores busteados?

Page 58: Poggi   analytics - ensamble - 1b

[email protected]

eduardo-poggi

http://ar.linkedin.com/in/eduardoapoggi

https://www.facebook.com/eduardo.poggi

@eduardoapoggi

Page 59: Poggi   analytics - ensamble - 1b

Bibliografía

https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm