An 15 DM Caracterizacion

download An 15 DM Caracterizacion

of 15

Transcript of An 15 DM Caracterizacion

  • 8/20/2019 An 15 DM Caracterizacion

    1/38

    Data Mining:Characterization

  • 8/20/2019 An 15 DM Caracterizacion

    2/38

    Concept Description:

    Characterization and

    Comparison What is concept description?  Data generalization and summarization-based

    characterization

    Analytical characterization: Analysis of attribute

    relevance

    Mining class comparisons: Discriminating between

    dierent classes Mining descriptive statistical measures in large

    databases

    ummary

  • 8/20/2019 An 15 DM Caracterizacion

    3/38

    What is Concept

    Description?

    Descriptive vs! predictive data mining Descriptive mining: describes concepts or tas"-

    relevant data sets in concise# summarative#

    informative# discriminative forms $redictive mining: %ased on data and analysis#

    constructs models for the database# and predictsthe trend and properties of un"nown data

    &oncept description: &haracterization: provides a concise and succinctsummarization of the given collection of data

    &omparison: provides descriptions comparing twoor more collections of data

  • 8/20/2019 An 15 DM Caracterizacion

    4/38

    Concept Description:

    Characterization and

    Comparison What is concept description? Data generalization and summarization-based

    characterization

    Analytical characterization: Analysis of attribute

    relevance

    Mining class comparisons: Discriminating between

    dierent classes Mining descriptive statistical measures in large

    databases

    ummary

  • 8/20/2019 An 15 DM Caracterizacion

    5/38

    Data Generalization and

    Summarization-based

    Characterization Data generalization

    A process which abstracts a large set of tas"-relevant data in a database from a low

    conceptual levels to higher ones!12

    3

    4

    5Conceptual levels

  • 8/20/2019 An 15 DM Caracterizacion

    6/38

    Attribute-Oriented

    Induction $roposed in '()( *+DD ,)( wor"shop .ot con/ned to categorical data nor particular

    measures!

    0ow it is done? &ollect the tas"-relevant data* initial relation using a

    relational database 1uery

    $erform generalization by attribute removal or

    attribute generalization! Apply aggregation by merging identical# generalized

    tuples and accumulating their respective counts!

    2nteractive presentation with users!

  • 8/20/2019 An 15 DM Caracterizacion

    7/38

    asic !rinciples o"

    Attribute-Oriented

    Induction Data focusing: tas"-relevant data# including dimensions# andthe result is the initial relation!

    Attribute-removal: remove attribute A if there is a large set of

    distinct values for A but *' there is no generalization

    operator on A# or *3 A4s higher level concepts are e5pressedin terms of other attributes!

    Attribute-generalization: 2f there is a large set of distinct

    values for A# and there e5ists a set of generalization operators

    on A# then select an operator and generalize A!

    Attribute-threshold control: typical 3-)# speci/ed6default!

    7eneralized relation threshold control: control the /nal

    relation6rule size!

  • 8/20/2019 An 15 DM Caracterizacion

    8/38

    #$ample

    Describe general characteristics of graduatestudents in the %ig-8niversity database

    use %ig98niversity9D%

     mine characteristics as cience9tudents;

    in relevance to name# gender# ma statement:Select name# gender# ma

  • 8/20/2019 An 15 DM Caracterizacion

    9/38

    Class Characterization: An

    #$ampleName Gender Major Birth-Place Birth_date Residence Phone # GPAJimoodman

      M !" anco$%er&B!&

    !anada

      '-12-() 3511 Main "t*&

    Richmond

    )'(-45+' 3*)(

    "cott

    ,achance

      M !" Montreal& $e&

    !anada

    2'-(-(5 345 1st A%e*&

    Richmond

    253-+1.) 3*(.

    ,a$ra ,ee

    /

      0

    /

    Phsics

    /

    "eattle& A& "A

    /25-'-(.

    /

    125 A$stin A%e*&

    B$rna

    /

    42.-5232

    /

    3*'3

    /

    Remo%ed Retained "ci&n&B$s

    !o$ntr Ae rane !it Remo%ed 6cl&G&**

    Gender Major Birth_region Age_range Residence GPA Count

      M Science Canada 20-25 Richond !er"-good #$

      % Science %oreign 25-&0 Burna'" ()cellent 22

      * * * * * * *

      Birth_Region

    Gender 

    Canada %oreign +otal

      M #$ #, &0

      % #0 22 &2

      +otal 2$ &$ $2

    Prime

    Generali7ed

    Relation

    8nitial

    Relation

  • 8/20/2019 An 15 DM Caracterizacion

    10/38

    Concept Description:

    Characterization and

    Comparison What is concept description? Data generalization and summarization-based

    characterization

    Analytical characterization: Analysis of attribute

    relevance

    Mining class comparisons: Discriminating between

    dierent classes Mining descriptive statistical measures in large

    databases

    ummary

  • 8/20/2019 An 15 DM Caracterizacion

    11/38

    Attribute %ele&ance

    Anal'sis Why? Which dimensions should be included?

    0ow high level of generalization?

    Automatic vs! interactive Beduce = attributesC easy to understand patterns

    What? statistical method for preprocessing data

    /lter out irrelevant or wea"ly relevant attributes

    retain or ran" the relevant attributes

    relevance related to dimensions and levels

    analytical characterization# analytical comparison

  • 8/20/2019 An 15 DM Caracterizacion

    12/38

    Attribute rele&ance

    anal'sis (cont)d* 0ow? Data &ollection

    Analytical 7eneralization

    8se information gain analysis *e!g!# entropy or othermeasures to identify highly relevant dimensions andlevels!

    Belevance Analysis ort and select the most relevant dimensions and levels!

    Attribute-oriented 2nduction for class description n selected dimension6level

    A$ operations *e!g! drilling# slicing on relevancerules

  • 8/20/2019 An 15 DM Caracterizacion

    13/38

    %ele&ance Measures

    >uantitative relevance measuredetermines the classifying power of

    an attribute within a set of data! Methods

    information gain *2DE

    gain ratio *&F!G   χ3 contingency table statistics

    uncertainty coeHcient

  • 8/20/2019 An 15 DM Caracterizacion

    14/38

    In"ormation-+heoretic

    Approach Decision tree each internal node tests an attribute

    each branch corresponds to attribute value

    each leaf node assigns a classi/cation

    2DE algorithm build decision tree based on training ob

  • 8/20/2019 An 15 DM Caracterizacion

    15/38

    +op-Do,n Induction o"

    Decision +reeAttri'utes ./utloo1 +eperature1 uidit"1 3ind4

    /utloo 

    uidit"   3ind

    sunn" rainovercast

    "es

    no "es

    high noral

    no

    strong   ea 

    "es

    Pla"+ennis ."es1 no4

  • 8/20/2019 An 15 DM Caracterizacion

    16/38

    #$ample: Anal'tical

    Characterization  Ias"

    Mine general characteristics describing graduatestudents using analytical characterization

    7iven attributes name, gender, major, birth_place,

    birth_date, phone## and gpa

    Gen(ai ) J concept hierarchies on ai Ui J attribute analytical thresholds for ai T i J attribute generalization thresholds for ai R J attribute relevance threshold

  • 8/20/2019 An 15 DM Caracterizacion

    17/38

    #$ample: Anal'tical

    Characterization (cont)d*

    '! Data collection target class: graduate student

    contrasting class: undergraduate student

    3! Analytical generalization using 8i attribute removal

    remove name and phone#

    attribute generalization

     generalize major # birth_place# birth_date and gpa accumulate counts

    candidate relation: gender # major #birth_country # age_range and gpa

  • 8/20/2019 An 15 DM Caracterizacion

    18/38

    #$ample: Anal'tical

    characterization (*ender major irth_co$ntr ae_rane 9a co$ntM Science Canada 20-25 !er"_good #$

    % Science %oreign 25-&0 ()cellent 22

    M (ngineering %oreign 25-&0 ()cellent #6

    % Science %oreign 25-&0 ()cellent 25

    M Science Canada 20-25 ()cellent 2#

    % (ngineering Canada 20-25 ()cellent #6

    Candidate relation for Target class: Graduate students (  

    =120)

    ender major irth_co$ntr ae_rane 9a co$nt

    M Science %oreign 720 !er"_good #6

    % Business Canada 720 %air 20M Business Canada 720 %air 22

    % Science Canada 20-25 %air 2,

    M (ngineering %oreign 20-25 !er"_good 22

    % (ngineering Canada 720 ()cellent 2,

    Candidate relation for Contrasting class: Undergraduate students (  =130)

  • 8/20/2019 An 15 DM Caracterizacion

    19/38

    #$ample: Anal'tical

    characterization (.*

    E! Belevance analysis &alculate e5pected info re1uired to classify an

    arbitrary tuple

    &alculate entropy of each attribute: e!g! major 

    88660250

    #&0

    250

    #&0

    250

    #20

    250

    #20#&0#20   222#   .log log  ) , I(  ) s , I(s   =−−==

    %or major=”Science”9   S##6,   S2#,2 :;s##1s2#

  • 8/20/2019 An 15 DM Caracterizacion

    20/38

    #$ample: Anal'tical

    Characterization (/* &alculate e5pected info re1uired to classify agiven sample if is partitioned according to theattribute

    &alculate information gain for each attribute

    2nformation gain for all attributes

    B6B&0250

    ,2

    250

    62

    250

    #2$2&22#22###   . ) s , s(  I  ) s , s(  I  ) s , s(  I  E(major)   =++=

    2##502#   . E(major) ) s , I(s )Gain(major    =−=

    Gain;gender< 0=000&

    Gain;'irth_countr"< 0=0,0

    Gain;ajor< 0=2##5

    Gain;gpa< 0=,,80

    Gain;age_range< 0=58#

  • 8/20/2019 An 15 DM Caracterizacion

    21/38

    #$ample: Anal'tical

    characterization (0* F! 2nitial wor"ing relation *WK derivation B J K!'

    remove irrelevant6wea"ly relevant attributes from candidaterelation JL drop gender # birth_country 

    remove contrasting class candidate relation

    G! $erform attribute-oriented induction on WK using Ii

    major ae_rane 9a co$nt

    Science 20-25 !er"_good #$

    Science 25-&0 ()cellent ,

    Science 20-25 ()cellent 2#

    (ngineering 20-25 ()cellent #6

    (ngineering 25-&0 ()cellent #6

    8nitial taret class :or;in relation .< Grad$ate st$dents

  • 8/20/2019 An 15 DM Caracterizacion

    22/38

    Concept Description:

    Characterization and

    Comparison What is concept description? Data generalization and summarization-based

    characterization

    Analytical characterization: Analysis of attribute

    relevance

    Mining class comparisons: Discriminating between

    dierent classes

    Mining descriptive statistical measures in large

    databases

    ummary

  • 8/20/2019 An 15 DM Caracterizacion

    23/38

    Mining Class Comparisons

    &omparison: &omparing two or more classes! Method:

    $artition the set of relevant data into the target classand the contrasting class*es

    7eneralize both classes to the same high levelconcepts

    &ompare tuples with the same high level descriptions $resent for every tuple its description and two

    measures:

    support - distribution within single class comparison - distribution between classes

    0ighlight the tuples with strong discriminant features

    Belevance Analysis: ind attributes *features which best distinguish

    dierent classes!

  • 8/20/2019 An 15 DM Caracterizacion

    24/38

    #$ample: Anal'tical

    comparison

     Ias" &ompare graduate and undergraduate

    students using discriminant rule!

    DM> 1uery

    $se Big_niversit"_DBmine com9arison as @grad_vs_undergrad_students

    in rele%ance to name, gender, major, birth_place, birth_date, residence, phone, gpa

    =or @graduate_students

    :here status in @graduate%ers$s @undergraduate_students

    :here status in @undergraduateanal7e countE

    =rom student

  • 8/20/2019 An 15 DM Caracterizacion

    25/38

    #$ample: Anal'tical

    comparison (* 7iven

    attributes name, gender, major, birth_place,birth_date, residence, phone# and gpa

    Gen(ai ) J concept hierarchies on attributes ai

    Ui J attribute analytical thresholds for

    attributes ai

    T i J attribute generalization thresholds forattributes ai

    R J attribute relevance threshold

  • 8/20/2019 An 15 DM Caracterizacion

    26/38

    #$ample: Anal'tical

    comparison (.*

    '! Data collection target and contrasting classes

    3! Attribute relevance analysis remove attributes name, gender, major, phone#

    E! ynchronous generalization controlled by user-speci/ed dimension thresholds

    prime target and contrasting class*esrelations6cuboids

  • 8/20/2019 An 15 DM Caracterizacion

    27/38

    #$ample: Anal'tical

    comparison (/*Birth_co$ntr Ae_rane G9a !o$nt>Canada 20-25 Good 5=5&E

    Canada 25-&0 Good 2=&2E

    Canada /ver_&0 !er"_good 5=6$E

    * * * *

    /ther /ver_&0 ()cellent ,=$6E

    Prime enerali7ed relation =or the taret class< Grad$ate st$dents

    Birth_co$ntr Ae_rane G9a !o$nt>

    Canada #5-20 %air 5=5&E

    Canada #5-20 Good ,=5&E

    * * * *

    Canada 25-&0 Good 5=02E

    * * * *

    /ther /ver_&0 ()cellent 0=$6E

    Prime enerali7ed relation =or the contrastin class< nderrad$ate st$dents

  • 8/20/2019 An 15 DM Caracterizacion

    28/38

    #$ample: Anal'tical

    comparison (0*

    F! Drill down# roll up and other A$ operationson target and contrasting classes to ad

  • 8/20/2019 An 15 DM Caracterizacion

    29/38

    Concept Description:

    Characterization and

    Comparison What is concept description? Data generalization and summarization-based

    characterization

    Analytical characterization: Analysis of attributerelevance

    Mining class comparisons: Discriminating between

    dierent classes

    Mining descriptive statistical measures in large

    databases

    ummary

  • 8/20/2019 An 15 DM Caracterizacion

    30/38

    Mining Data Dispersion

    Characteristics Motivation  Io better understand the data: central tendency# variation

    and spread

    Data dispersion characteristics

    median# ma5# min# 1uantiles# outliers# variance# etc!

    .umerical dimensions correspond to sorted intervals

    Data dispersion: analyzed with multiple granularities of

    precision

    %o5plot or 1uantile analysis on sorted intervals

    Dispersion analysis on computed measures

    olding measures into numerical dimensions

    %o5plot or 1uantile analysis on the transformed cube

  • 8/20/2019 An 15 DM Caracterizacion

    31/38

    Measuring the Central

    +endenc' Mean

    Weighted arithmetic mean

    Median: A holistic measure

    Middle value if odd number of values# or average ofthe middle two values otherwise

    estimated by interpolation

    Mode

    Palue that occurs most fre1uently in the data

    8nimodal# bimodal# trimodal

    Qmpirical formula:

    ∑=

    =n

    i

    i !

    n !

    #

    #

    =

    ==n

    i

    i

    n

    i

    ii

    "

     !"

     !

    #

    #

    c #  

    l  #  n $median

    median

    <

  • 8/20/2019 An 15 DM Caracterizacion

    32/38

    Measuring the Dispersion o"

    Data >uartiles# outliers and bo5plots

    >uartiles: >' *3Gth percentile# >E *RGth percentile

    2nter-1uartile range: 2>B J >E S >'

    ive number summary: min# >'# M# >E# ma5

    %o5plot: ends of the bo5 are the 1uartiles# median is mar"ed#

    whis"ers# and plot outlier individually

    utlier: usually# a value higher6lower than '!G 5 2>B

    Pariance and standard deviation

    Pariance s2: *algebraic# scalable computation

    tandard deviation s is the s1uare root of variance s2

    ∑ ∑∑= ==

    −−

    =−−

    =n

    i

    n

    i

    ii

    n

    i

    i  !

    n !

    n ! !

    n s

    # #

    22

    #

    22 G

  • 8/20/2019 An 15 DM Caracterizacion

    33/38

     o$plot Anal'sis

    ive-number summary of a distribution:

    Minimum# >'# M# >E# Ma5imum

    %o5plot

    Data is represented with a bo5

     Ihe ends of the bo5 are at the /rst and third1uartiles# i!e!# the height of the bo5 is 2B>

     Ihe median is mar"ed by a line within thebo5

    Whis"ers: two lines outside the bo5 e5tendto Minimum and Ma5imum

  • 8/20/2019 An 15 DM Caracterizacion

    34/38

    A o$plot

    A 'o)plot

  • 8/20/2019 An 15 DM Caracterizacion

    35/38

    Concept Description:

    Characterization and

    Comparison What is concept description? Data generalization and summarization-based

    characterization

    Analytical characterization: Analysis of attributerelevance

    Mining class comparisons: Discriminating between

    dierent classes

    Mining descriptive statistical measures in large

    databases

    ummary

  • 8/20/2019 An 15 DM Caracterizacion

    36/38

    Summar'

    &oncept description: characterization and

    discrimination

    A$-based vs! attribute-oriented induction

    QHcient implementation of A2 Analytical characterization and comparison

    Mining descriptive statistical measures in large

    databases

    Discussion

    2ncremental and parallel mining of description

    Descriptive mining of comple5 types of data

  • 8/20/2019 An 15 DM Caracterizacion

    37/38

    %e"erences

     T! &ai# .! &ercone# and U! 0an! Attribute-oriented induction in relationaldatabases! 2n 7! $iatets"y-hapiro and W! U! rawley# editors# +nowledgeDiscovery in Databases# pages 3'E-33)! AAA26M2I $ress# '(('!

    ! &haudhuri and 8! Dayal! An overview of data warehousing and A$technology! A&M 27MD Becord# 3V:VG-RF# '((R

    &! &arter and 0! 0amilton! QHcient attribute-oriented generalization for"nowledge discovery from large databases! 2QQQ Irans! +nowledge and DataQngineering# 'K:'(E-3K)# '(()!

    W! &leveland! Pisualizing Data! 0obart $ress# ummit .U# '((E!

     U! ! Devore! $robability and tatistics for Qngineering and the cience# Fth ed!Du5bury $ress# '((G!

     I! 7! Dietterich and B! ! Michals"i! A comparative review of selected methodsfor learning from e5amples! 2n Michals"i et al!# editor# Machine earning: AnArti/cial 2ntelligence Approach# Pol! '# pages F'-)3! Morgan +aufmann# '()E!

     U! 7ray# ! &haudhuri# A! %osworth# A! ayman# D! Beichart# M! Pen"atrao# !$ellow# and 0! $irahesh! Data cube: A relational aggregation operatorgeneralizing group-by# cross-tab and sub-totals! Data Mining and +nowledgeDiscovery# ':3(-GF# '((R!

     U! 0an# T! &ai# and .! &ercone! Data-driven discovery of 1uantitative rules inrelational databases! 2QQQ Irans! +nowledge and Data Qngineering# G:3(-FK#'((E!

  • 8/20/2019 An 15 DM Caracterizacion

    38/38

    %e"erences (cont1*

     U! 0an and T! u! Q5ploration of the power of attribute-oriented induction in datamining! 2n 8!M! ayyad# 7! $iatets"y-hapiro# $! myth# and B! 8thurusamy#editors# Advances in +nowledge Discovery and Data Mining# pages E((-F3'!AAA26M2I $ress# '((V!

    B! A! Uohnson and D! A! Wichern! Applied Multivariate tatistical Analysis# Erded! $rentice 0all# '((3!

    Q! +norr and B! .g! Algorithms for mining distance-based outliers in large

    datasets! PD%()# .ew Tor"# .T# Aug! '(()! 0! iu and 0! Motoda! eature election for +nowledge Discovery and Data

    Mining! +luwer Academic $ublishers# '(()!

    B! ! Michals"i! A theory and methodology of inductive learning! 2n Michals"i etal!# editor# Machine earning: An Arti/cial 2ntelligence Approach# Pol! '# Morgan+aufmann# '()E!

     I! M! Mitchell! Persion spaces: A candidate elimination approach to rule

    learning! 2U&A2(R# &ambridge# MA!  I! M! Mitchell! 7eneralization as search! Arti/cial 2ntelligence# '):3KE-33V#

    '()3!

     I! M! Mitchell! Machine earning! Mc7raw 0ill# '((R!

     U! B! >uinlan! 2nduction of decision trees! Machine earning# ':)'-'KV# '()V!

    D! ubramanian and U! eigenbaum! actorization in e5periment generation!AAA2)V# $hiladelphia# $A# Aug! '()V!