Adabag 2.0 presentacion

download Adabag 2.0 presentacion

of 17

Transcript of Adabag 2.0 presentacion

  • 8/3/2019 Adabag 2.0 presentacion

    1/17

    Adabag 2.0: una librera de R

    para Adaboost.M1 y Bagging

    s e an aro or sMatas Gmez Martnez

    Noelia Garca Rubio

    orna as e suar os e17 y 18 de Noviembre de 2011

    Escuela de Organizacin IndustrialUniv. de Castilla-La Mancha

    F. C.C. Econmicas y EmpresarialesAlbacete

    a r

    ndice

    Albacete

    Albacete

    1. Introduccin

    esariales

    esarialesdd

    2. Algoritmos

    Adaboost

    cas

    casyyEmp

    r

    Emp

    r

    Bagging

    asasEconmi

    Econmi .

    Funciones de Boosting

    dddedeCienci

    Cienci

    Funciones de Bagging

    Funcin Margen

    Fac

    ulta

    Fac

    ulta

    4. Ejemplo5. Conclusiones

  • 8/3/2019 Adabag 2.0 presentacion

    2/17

    Albacete

    Albacete . n ro ucc n

    En las ltimas dcadas se han desarrollado muchos mtodos de

    sariales

    sarialesdd clasificacin combinados basndose en rboles de clasificacin.

    En la librera que presentamos se implementan dos de los ms

    cas

    casyyEmpr

    Empr , .

    Bagging.

    asasEconmi

    Econmi

    Shapire, 1996] construye sus clasificadores basesecuencialmente, actualizando la distribucin sobre el conjunto

    dd

    dedeCienci

    Cienci

    de entrenamiento para crear cada uno de ellos, Bagging[Breiman, 1996] combina los clasificadores individuales

    Faculta

    Faculta

    entrenamiento.

    1. Introduccin

    Albacete

    Albacete

    Adabag 2.1 es una actualizacin de Adabag que aade una nuevafuncin margins para calcular los mrgenes de los

    esariales

    esarialesdd clasificadores. La primera versin fue publicada en junio de 2006

    y se ha utilizado ampliamente en tareas de clasificacin en campos

    cas

    casyyEmp

    r

    Emp

    r

    Kriegler, B. & Berk, R. (2007). Estimacin de las personas sin hogar en Losngeles (estimacin espacial en pequeas reas).

    asasEconmi

    Econmi

    Stewart, B. & Zhukov, Y. (2009). Anlisis del contenido para clasificacin dedocumentos militares.

    dddedeCienci

    Cienci De Bock, K.W. , Coussement, K. & Van den Poel, D. (2010). Clasificacin

    conjunta basada en Modelos Aditivos Generalizados.

    De Bock, K.W. , & Van den Poel, D. (2011). Prediccin de la fu a de

    Fac

    ulta

    Fac

    ulta

    clientes.

    Y en paquetes de R como digeR, GUI para analizar datos DIGEerence n- e ectrop ores s mens ona es, rea za o por:

    Fan, Y., Murphy, T.B. & Watson, W.G. (2009): digeR Manual.

  • 8/3/2019 Adabag 2.0 presentacion

    3/17

    2. Al oritmos: AdaBoost

    Albacete

    Albacete

    sariales

    sarialesdd

    ,

    Utiliza reponderacin en lugar de remuestreo

    cas

    casyyEmpr

    Empr

    El clasificador dbil anterior tiene slo un 50% deprecisin sobre el conjunto de entrenamiento con los

    asasEconmi

    Econmi

    nuevos pesos

    dd

    dedeCienci

    Cienci

    La clasificacin final se basa en el voto ponderado de losFaculta

    Faculta c as ca ores n v ua es

    AdaBoost.M1

    Albacete

    Albacete

    AdaBoost.M1 se aplica de la siguiente manera:

    esariales

    esarialesdd .

    (Tb), inicialmente todos los pesos son 1/n

    2. Re ita el rocedimiento ara b=1 2 .... B

    cas

    casyyEmp

    r

    Emp

    r

    a) Construir sobre Tb un clasif. individual Cb(xi) ={1,, k} n e1

    asasEconmi

    Econmi

    b) Calcular:

    ==b

    biiibe

    ywe1

    b nyx

    dddedeCienci

    Cienci

    =

    c) Actualizar y normalizar los pesos: )))(I(Cexp( b iibb

    i

    1b

    i yww =+ x

    Fac

    ulta

    Fac

    ulta

    B

    ==

    . ,,

    b

    1

    f i

    bYj

    i

    =

  • 8/3/2019 Adabag 2.0 presentacion

    4/17

    Albacete

    Albacete

    Boosting (AdaBoost)Boosting (AdaBoost)

    sariales

    sarialesdd reun an c ap re, Cf

    1 B

    Voto Ponderado

    cas

    casyyEmpr

    Empr

    . . .

    asasEconmi

    Econmi

    1 2 B

    dd

    dedeCienci

    Cienci

    T1

    T2

    TB

    . . .actualizacin

    Faculta

    Faculta

    Distribucin de probabilidad

    Tootstrap o pon era a

    2. Al oritmos: Ba in

    Albacete

    Albacete

    Bootstrapping and aggregating (Breiman, 1996)

    esariales

    esarialesdd

    Para b = 1, 2, .., B

    cas

    casyyEmp

    r

    Emp

    r

    ea zar con reemp azam ento n =n muestras eEntrenar el clasificador C

    b

    asasEconmi

    Econmi

    El clasificador final se obtiene por mayora simple

    dddedeCienci

    Cienci

    1..

    B

    Aumenta la estabilidad del clasificador/disminuye

    Fac

    ulta

    Fac

    ulta

    su varianza

  • 8/3/2019 Adabag 2.0 presentacion

    5/17

    Albacete

    Albacete

    sariales

    sarialesdd Cf(Breiman, 1996) Voto Simple

    cas

    casyyEmpr

    Empr

    . . .

    asasEconmi

    Econmi

    1 B2

    dd

    dedeCienci

    Cienci

    TB

    T2

    T1

    . . .

    Faculta

    Faculta

    Seleccin aleatoria

    Tcon reemp azam en o

    3. Funciones: Boosting

    Albacete

    Albacete adaboost.M1 Applies the Adaboost.M1 algorithm to a data set

    Description

    esariales

    esarialesdd Fits the Adaboost.M1 algorithm proposed by Freund and Schapire in 1996 using

    classification trees as single classifiers.Usage

    cas

    casyyEmp

    r

    Emp

    r a a oos . ormu a, a a, oos = , m na = ,

    coeflearn = 'Breiman', control)

    Arguments

    asasEconmi

    Econmi ormu a a ormu a, as n e m unc on.

    data a data frame in which to interpret the variables named in formula.

    boos if TRUE (by default), a bootstrap sample of the training set is drawn using the

    dddedeCienci

    Cienci . ,

    its weights.

    mfinal an integer, the number of iterations for which boosting is run or the number of

    =

    Fac

    ulta

    Fac

    ulta . .

    coeflearn ifBreiman

    (by default),alpha=1/2ln((1-err)/err)

    is used. IfFreund alpha=ln((1-err)/err) is used. Where alpha is the weight

    u datin coefficient.

    control options that control details of the rpart algorithm. See rpart.control for

    more details.

  • 8/3/2019 Adabag 2.0 presentacion

    6/17

    3. Funciones: Boosting

    Albacete

    Albacete adaboost.M1 Applies the Adaboost.M1 algorithm to a data set

    sariales

    sarialesdd

    Details

    Adaboost.M1 is a simple generalization of Adaboost for more than two classes.

    cas

    casyyEmpr

    Empr

    ValueAn object of class adaboost.M1, which is a list with the following components:

    asasEconmi

    Econmi

    formula the formula used.

    trees the trees grown along the iterations.

    dd

    dedeCienci

    Cienci .

    votes a matrix describing, for each observation, the number of trees that assigned it to

    each class, weighting each tree by its alpha coefficient.

    class the class redicted b the ensemble classifier.

    Faculta

    Faculta

    importance returns the relative importance of each variable in the classification task.

    This measure is the number of times each variable is selected to split.

    3. Funciones: Boosting

    Albacete

    Albacete

    predict.boosting Predicts from a fitted adaboost.M1 object

    esariales

    esarialesdd

    DescriptionClassifies a data frame using a fitted adaboost.M1 object.

    cas

    casyyEmp

    r

    Emp

    r

    Usage## S3 method for class 'boosting'

    asasEconmi

    Econmi pre c o ec , new a a, ...

    Arguments

    dddedeCienci

    Cienci . .

    some function that produces an object with the same named components as that returned

    by the adaboost.M1 function.

    Fac

    ulta

    Fac

    ulta .

    predictors referred to in the right side offormula(object)

    must be present by name innewdata.

    further ar uments assed to or from other methods.

  • 8/3/2019 Adabag 2.0 presentacion

    7/17

    3. Funciones: Boosting

    Albacete

    Albacete

    predict.boosting Predicts from a fitted adaboost.M1 object

    sariales

    sarialesdd

    cas

    casyyEmpr

    Empr

    ValueAn object of class predict.boosting, which is a list with the followingcomponents:

    asasEconmi

    Econmi

    formula the formula used.

    votes a matrix describing, for each observation, the number of trees that assigned it to

    dd

    dedeCienci

    Cienci

    each class, weighting each tree by its alpha coefficient.

    class the class predicted by the ensemble classifier.

    confusion the confusion matrix which compares the real class with the predicted one.

    Faculta

    Faculta

    error returns the average error.

    3. Funciones: Boosting

    Albacete

    Albacete

    . - .

    DescriptionThe data are divided into v non-overlapping subsets of roughly equal size. Then, adaboost.M1 is

    esariales

    esarialesdd applied on (v-1) of the subsets. Finally, predictions are made for the left out subsets, and the

    process is repeated for each of the v subsets.

    Usage= = =

    cas

    casyyEmp

    r

    Emp

    r . , , , , ,

    coeflearn = "Breiman", control)Argumentsformula a formula, as in the lm function.

    asasEconmi

    Econmi

    data a data frame in which to interpret the variables named in formula.

    boos ifTRUE (by default), a bootstrap sample of the training set is drawn using the weights for

    each observation on that iteration. IfFALSE, every observation is used with its weights.

    v- v

    dddedeCienci

    Cienci , . .

    of observations, leave-one-out cross validation is carried out. Besides this, every value between two

    and the number of observations is valid and means that roughly every v-th observation is left out.

    mfinal an integer, the number of iterations for which boosting is run or the number of trees to use.

    Fac

    ulta

    Fac

    ulta Defaults to mfinal=100 iterations.

    coeflearn if Breiman(by default), alpha=1/2ln((1-err)/err) is used. If Freund alpha=ln((1-err)/err) is used. Where alpha is the weight updating

    .control options that control details of the rpart algorithm. See rpart.control for more

    details.

  • 8/3/2019 Adabag 2.0 presentacion

    8/17

    3. Funciones: Boosting

    Albacete

    Albacete

    boosting.cv Runs v-fold cross validation with adaboost.M1

    sariales

    sarialesdd

    cas

    casyyEmpr

    Empr

    Value

    asasEconmi

    Econmi . ,

    class the class predicted by the ensemble classifier.

    dd

    dedeCienci

    Cienci

    confusion the confusion matrix which compares the real class with the predicted one.

    error returns the average error.

    Faculta

    Faculta

    3. Funciones: Bagging

    Albacete

    Albacete

    bagging Applies the Bagging algorithm to a data set

    esariales

    esarialesdd

    Description

    cas

    casyyEmp

    r

    Emp

    r

    Fits the Bagging algorithm proposed by Breiman in 1996 using classification trees assingle classifiers.Usage

    asasEconmi

    Econmi agg ng ormu a, a a, m na = , con ro

    Argumentsformula a formula, as in the lm function.

    dddedeCienci

    Cienci .

    mfinal an integer, the number of iterations for which boosting is run or the number of

    trees to use. Defaults to mfinal=100 iterations.

    Fac

    ulta

    Fac

    ulta . .

    more details.

  • 8/3/2019 Adabag 2.0 presentacion

    9/17

    3. Funciones: Bagging

    Albacete

    Albacete

    bagging Applies the Bagging algorithm to a data set

    sariales

    sarialesdd

    Details

    Unlike boosting, individual classifiers are independent among them in bagging.

    cas

    casyyEmpr

    Empr

    ValueAn object of class bagging, which is a list with the following components:

    asasEconmi

    Econmi

    formula the formula used.

    trees the trees grown along the iterations.

    dd

    dedeCienci

    Cienci , ,

    each class.

    class the class predicted by the ensemble classifier.

    sam les the bootstra sam les used alon the iterations.

    Faculta

    Faculta

    importance returns the relative importance of each variable in the classification task.

    This measure is the number of times each variable is selected to split.

    3. Funciones: Bagging

    Albacete

    Albacete

    predict.bagging Predicts from a fitted bagging object

    esariales

    esarialesdd

    Description

    cas

    casyyEmp

    r

    Emp

    r

    ass es a ata rame us ng a tte agg ng o ect.Usage## S3 method for class 'bagging'

    asasEconmi

    Econmi pre c o ec , new a a, ...

    Argumentsobject fitted model object of class bagging. This is assumed to be the result of some

    dddedeCienci

    Cienci

    bagging function.

    newdata data frame containing the values at which predictions are required. The

    Fac

    ulta

    Fac

    ulta

    newdata. further arguments passed to or from other methods.

  • 8/3/2019 Adabag 2.0 presentacion

    10/17

    3. Funciones: Bagging

    Albacete

    Albacete

    predict.bagging Predicts from a fitted bagging object

    sariales

    sarialesdd

    cas

    casyyEmpr

    Empr

    ValueAn object of class predict.bagging, which is a list with the following components:

    asasEconmi

    Econmi

    formula the formula used.

    votes a matrix describing, for each observation, the number of trees that assigned it to each

    dd

    dedeCienci

    Cienci .

    class the class predicted by the ensemble classifier.

    confusion the confusion matrix which compares the real class with the predicted one.

    Faculta

    Faculta .

    3. Funciones: Bagging

    Albacete

    Albacetebagging.cv Runs v-fold cross validation with Bagging

    esariales

    esarialesdd Description

    The data are divided into v non-overlapping subsets of roughly equal size. Then,bagging is applied on (v-1) of the subsets. Finally, predictions are made for the

    cas

    casyyEmp

    r

    Emp

    r e ou su se s, an e process s repea e or eac o e v su se s.Usage

    bagging.cv(formula, data, v = 10, mfinal = 100, control)

    asasEconmi

    Econmi rgumen s

    formula a formula, as in the lm function.

    data a data frame in which to interpret the variables named in formula.

    dddedeCienci

    Cienci , - . .

    the number of observations, leave-one-out cross validation is carried out. Besides this,

    every value between two and the number of observations is valid and means that roughly

    -

    Fac

    ulta

    Fac

    ulta .

    mfinal an integer, the number of iterations for which boosting is run or the number oftrees to use. Defaults to mfinal=100 iterations.

    control o tions that control details of the r art al orithm. See r art.control for

    more details.

  • 8/3/2019 Adabag 2.0 presentacion

    11/17

    3. Funciones: Bagging

    Albacete

    Albacete

    bagging.cv Runs v-fold cross validation with Bagging

    sariales

    sarialesdd

    cas

    casyyEmpr

    Empr

    Value

    asasEconmi

    Econmi n o ec o c ass agg ng.cv, w c s a s w e o ow ng componen s:

    class the class predicted by the ensemble classifier.

    dd

    dedeCienci

    Cienci

    confusion the confusion matrix which compares the real class with the predicted one.

    Faculta

    Faculta .

    3. Funciones: Mrgenes

    Albacete

    Albacete

    margins Calculates the margins

    esariales

    esarialesdd

    Descri tion

    cas

    casyyEmp

    r

    Emp

    r

    Calculates the margins of an adaboost.M1 or bagging classifier for a data frame.

    Usa e

    asasEconmi

    Econmi

    margins(object, newdata, ...)

    Arguments

    dddedeCienci

    Cienci

    object This object must be the output of one of the functions bagging,

    adaboost.M1, predict.bagging or predict.boosting. This is assumed to

    be the result of some function that produces an object with two components named

    Fac

    ulta

    Fac

    ulta

    formula and class, as those returned for instance by the bagging function.

    newdata The same data frame used for building the object.

  • 8/3/2019 Adabag 2.0 presentacion

    12/17

    3. Funciones: Mrgenes

    Albacete

    Albacete

    margins Calculates the margins

    sariales

    sarialesdd

    cas

    casyyEmpr

    Empr

    Details

    Intuitively, the margin for an observation is related to the certainty of its

    asasEconmi

    Econmi

    classification. It is calculated as the difference between the support of thecorrect class and the maximum support of a wrong class.

    dd

    dedeCienci

    Cienci Value

    An object of class margins, which is a list with only a component:

    Faculta

    Faculta marg ns vec or w e marg ns.

    4. Ejemplo: Vehicle (4 clases)

    Albacete

    Albacete library(adabag)

    data(Vehicle)

    esariales

    esarialesdd

    - ,

    sub

  • 8/3/2019 Adabag 2.0 presentacion

    13/17

    4. Ejemplo: Vehicle (4 clases)

    Albacete

    Albacete

    sariales

    sarialesdd > Vehicle.bagging Vehicle.bagging.pred Vehicle.bagging.pred$confusion

    Observed Class

    asasEconmi

    Econmi

    Predicted Class bus opel saab van

    bus 67 10 9 1

    opel 1 32 11 0

    dd

    dedeCienci

    Cienci

    saab 1 25 40 0

    van 0 5 12 68

    Faculta

    Faculta . .

    [1] 0.2659574

    4. Ejemplo: Vehicle (4 clases)

    Albacete

    Albacete> Vehicle.bagging$importance

    Comp Circ D.Circ Rad.Ra Pr.Axis.Ra Max.L.Ra Scat.Ra Elong

    9.433962 2.744425 4.459691 2.572899 3.087479 14.236707 4.459691 9.777015

    esariales

    esarialesdd

    Pr.Axis.Rect Max.L.Rect Sc.Var.Maxis Sc.Var.maxis Ra.Gyr Skew.Maxis Skew.maxis Kurt.maxis

    0.000000 12.006861 4.116638 9.777015 1.372213 2.229846 6.861063 1.715266

    Kurt.Maxis Holl.Ra

    5.660377 5.488851

    cas

    casyyEmp

    r

    Emp

    r

    > barplot(sort(Vehicle.bagging$importance,dec=T),col="blue",horiz=T,las=1)

    asasEconmi

    Econmi

    dddedeCienci

    Cienci

    Fac

    ulta

    Fac

    ulta

  • 8/3/2019 Adabag 2.0 presentacion

    14/17

    4. Ejemplo: Vehicle (4 clases)

    Albacete

    Albacete

    sariales

    sarialesdd> Vehicle.bagging.cv names(Vehicle.bagging.cv)1 "class" "confusion" "error"

    cas

    casyyEmpr

    Empr

    > Vehicle.bagging.cv$confusion

    Clase real

    asasEconmi

    Econmi Clase estimada bus opel saab van

    bus 206 20 28 1

    opel 3 106 72 0

    dd

    dedeCienci

    Cienci

    saab 0 70 91 1

    van 9 16 26 197

    > Vehicle.ba in .cv$error

    Faculta

    Faculta

    [1] 0.2907801

    > mar ins Vehicle.ba in Vehicle[sub ] ->Vehicle.ba in .mar ins trainin set

    Albacete

    Albacete

    . . .

    > margins(Vehicle.bagging.pred,Vehicle[-sub,])->Vehicle.bagging.predmargins # test set

    > margins.test margins.train plot(sort(margins.train), (1:length(margins.train))/length(margins.train),type="l", xlim=c(-1,1),

    + main=Bagging, Margin cumulative distribution graph",xlab="m,ylab="% observations", col="blue3",lty=2)

    > abline(v=0, col="red",lty=2)

    =" " = =" "

    cas

    casyyEmp

    r

    Emp

    r . , . . , , . ,

    asasEconmi

    Econmi

    dddedeCienci

    Cienci

    Fac

    ulta

    Fac

    ulta

  • 8/3/2019 Adabag 2.0 presentacion

    15/17

    4. Ejemplo: Vehicle (4 clases)

    Albacete

    Albacete

    > Vehicle.adaboost Vehicle.adaboost.pred error.adaboost error.adaboost

    [[1] 0.2304965

    Albacete

    Albacete> Vehicle.adaboost$importance

    Comp Circ D.Circ Rad.Ra Pr.Axis.Ra Max.L.Ra Scat.Ra Elong

    7.100592 3.550296 6.213018 3.254438 9.171598 11.538462 6.804734 5.917160

    esariales

    esarialesdd

    Pr.Axis.Rect Max.L.Rect Sc.Var.Maxis Sc.Var.maxis Ra.Gyr Skew.Maxis Skew.maxis Kurt.maxis

    0.000000 10.946746 5.029586 5.325444 6.213018 4.733728 5.325444 3.254438

    Kurt.Maxis Holl.Ra

    3.254438 2.366864

    cas

    casyyEmp

    r

    Emp

    r

    > barplot(sort(Vehicle.adaboost$importance,dec=T),col="blue",horiz=T,las=1)

    asasEconmi

    Econmi

    dddedeCienci

    Cienci

    Fac

    ulta

    Fac

    ulta

  • 8/3/2019 Adabag 2.0 presentacion

    16/17

    4. Ejemplo: Vehicle (4 clases)

    Albacete

    Albacete

    sariales

    sarialesdd

    . . - . ~., =

    > names(Vehicle.adaboost.cv)

    [1] "class" "confusion" "error"

    cas

    casyyEmpr

    Empr > Vehicle.adaboost.cv$confusion

    Observed Class

    Predicted Class bus opel saab van

    asasEconmi

    Econmi

    bus 213 2 3 1

    opel 3 116 74 2

    dd

    dedeCienci

    Cienci

    van 1 6 6 192

    > Vehicle.adaboost.cv$error

    Faculta

    Faculta[1] 0.2257683

    [

    Albacete

    Albacete

    . , u , - . .

    > margins(Vehicle.adaboost.pred,Vehicle[-sub,])->Vehicle.adaboost.predmargins # test set

    > margins.test margins.train plot(sort(margins.train), (1:length(margins.train))/length(margins.train),type="l", xlim=c(-1,1),

    +main="AdaBoost, Margin cumulative distribution graph",xlab="m,ylab="% observations", col="blue3",lty=2)

    > abline(v=0, col="red",lty=2)

    > lines(sort(mar ins.test), (1:len th(mar ins.test))/len th(mar ins.test),t pe="l", cex=.5,col=" reen")

    cas

    casyyEmp

    r

    Emp

    r

    asasEconmi

    Econmi

    dddedeCienci

    Cienci

    Fac

    ulta

    Fac

    ulta

  • 8/3/2019 Adabag 2.0 presentacion

    17/17

    5. Conclusiones

    Albacete

    Albacete

    sariales

    sarialesdd

    Las funciones incluidas en esta librera permiten:

    cas

    casyyEmpr

    Empr 1. Resolver problemas de clasificacin con ms de dos clases.

    asasEconmi

    Econmi . .

    3. Medir la im ortacia relativa de las variables.

    dd

    dedeCienci

    Cienci

    4. Calcular los mrgenes de los clasificadores combinados.Faculta

    Faculta

    5. Estimar el error mediante validacin cruzada.

    Gracias por su atencin!

    e-mail: [email protected]

    . .

    [email protected]