Aprendizaje Automático sobre
Grandes Volúmenes de Datos

Clase 5

Pablo Ariel Duboue, PhD

Universidad Nacional de Córdoba,
Facultad de Matemática, Astronomía y Física
figura escudo.png

None.1 Quinta Clase: Ingeniería de Features

None.1.1 Clase anterior

Material de lectura
Preguntas
Regresión Logística: estimando la probabilidad de pi
Yi∣x1, i, …, xm, i  amp; ~ \operatornameBernoulli(pi) 𝔼[Yi∣x1, i, …, xm, i] amp; = pi Pr(Yi = yi∣x1, i, …, xm, i) amp; =  pi amp;if yi = 1        1 − pi amp;if yi = 0  Pr(Yi = yi∣x1, i, …, xm, i) amp; = piyi(1 − pi)(1 − yi)
Estimando pi y los coeficientes
logit(pi) = ln(pi)/(1 − pi) = β0 + β1x1, i + ⋯ + βmxm, i
Algunos Comentarios
Revisión: Árboles de Decisión
figura CART_tree_titanic_survivors.png

(CC-BY-SA Stephen Milborrow, from Wikipedia)
Ejemplo de DT
Archivo ARFF
@relation adult
@attribute age real
@attribute workclass { Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked }
@attribute education { Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool }
% ...
@data
28, Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 0, 0, 40, Cuba, <=50K
37, Private, 284582, Masters, 14, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, <=50K
52, Self-emp-not-inc, 209642, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K
49, Private, 160187, 9th, 5, Married-spouse-absent, Other-service, Not-in-family, Black, Female, 0, 0, 16, Jamaica, <=50K
Usando Weka línea de comando
1
$ java -cp weka.jar weka.classifiers.trees.Id3 -t adult.arff
weka.core.UnsupportedAttributeTypeException: weka.classifiers.trees.Id3: Cannot handle numeric attributes!
$ java -cp weka.jar weka.classifiers.trees.Id3 -t adult-no-num.arff
weka.core.NoSupportForMissingValuesException: weka.classifiers.trees.Id3: Cannot handle missing values!
$ java -cp weka.jar weka.classifiers.trees.Id3 -t adult-no-missing.arff
relationship = Wife
| occupation = Tech-support
| | education = Bachelors: >50K
| | education = Some-college
| | | workclass = Private
| | | | race = White: >50K
| | | | race = Black: <=50K
| | | workclass = Federal-gov: <=50K (...)
Correctly Classified Instances 23471 77.8165 %
Incorrectly Classified Instances 5297 17.5618 %
Usando Weka línea de comando
2
$ java -cp weka.jar weka.classifiers.trees.J48 -t adult-no-missing.arff
marital-status = Married-civ-spouse
| education = Bachelors
| | relationship = Wife: >50K (269.0/84.0)
| | relationship = Own-child: <=50K (11.0/4.0)
| | relationship = Husband: >50K (2298.0/718.0)
| | relationship = Not-in-family: <=50K (1.0)
| | relationship = Other-relative: <=50K (20.0/3.0)
| | relationship = Unmarried: >50K (0.0)
| education = Some-college
| | occupation = Tech-support: >50K (109.0/43.0)
| | occupation = Craft-repair: <=50K (527.0/214.0)
| | occupation = Other-service: <=50K (123.0/17.0) (...)
Correctly Classified Instances 24803 82.2326 %
Incorrectly Classified Instances 5359 17.7674 %
Information Gain
Entropy(S) =  − p + log2(p + ) − p − log2(p − )
Entropy(S) = mi = 1 − pilog2pi
Gain(S, A) = Entropy(S) − v ∈ valores(A)(|Sv|)/(|S|)Entropy(Sv)
Ejemplo
Full set entropy: 0.809565832961416
Attribute: workclass Gain: 0.0171044796229905
Attribute: education Gain: 0.0933939854773693
Attribute: marital-status Gain: 0.1581659232038
Attribute: occupation Gain: 0.0933446245279624
Attribute: relationship Gain: 0.203664907965332
Attribute: race Gain: 0.0603373246980876
Attribute: sex Gain: 0.64386930448363
Attribute: native-country Gain: 0.00932901407433484
Attribute: target Gain: 0.809565832961416

None.1.2 Un ejemplo

keywords4bytecodes
Java Bytecodes
\inputre-ex.tex
k4w: Debian
k4w: Datos
Ciclo de aprendizaje

None.1.3 Ingeniería de features

Features patrón
Filtrado de features
Heurísticas de filtrado
Valores Faltantes
Normalización
Binning
Combinaciones Aritméticas
Combinaciones entre Instancias
Binarización / thresholding
Limitaciones Particulares