Overview feature engineering - gaussian37

Overview feature engineering

Let’s study about overview of feature engineering. I will talk about the shadow color subject below.

Feature Engineering

Generation
- Binarization, Quantization
- Scaling (normalization)
- Interaction features
- Log transformation
- Dimension reduction
- Clustering
- …
Selection
- Univariate feature selection
- Model-based selection
- Iterative feature selection
- Feature removal
- …

Interaction features

create new features with combination of existing features.
need to pre-knowledge and understanding of existing features.
- ex) height, width → square (square = height * width)
- ex) sensor1 + sensor2 → new sensor feature

Log transformations

Data distribution is merged extremely in a point.(ex. Poisson)
- login count, sales rate, searched word, the number of friends
- Log transformation practice
Normal distribution is suitable for linear model.
- If data have Poisson distribution, make them to fit the linear model. Then, data will fit to normal distribution. (Poission → Normal dist)

Dimension reduction

is used when existing feature space is much large.
algorithm reduces the space.
ex) PCA, t-SNE, LDA, …
- PCA practice

Clustering

without Y, create the component of dataset
- created components are used as features for classification
is useful when you know the relationship between data to some degree.
ex) K-means …

Feature selection

All features are not necessary to learn a model.
Some features are malignant to learn.
Too many features causes overfitting.
According to learning model, select necessary features and remove unnecessary features.

Univariate feature selection

selects optimal feature based on statistical model.
- ex) Chi square, F-test, ANOVA …
- univariate feature selection
analyzes the relation between y and one feature.
is useful for linear model.
is fast-applicable and simple technique.

Model based feature selection

A model find proper features in the learning process
- ex) L1 penalty, Tree-based model
- Model based feature selection
Possible to select the features based on Feature importance
Useful to use Model-based feature selection as preprocessor of feature selection for other model
Tree-based ensemble also has characteristics similar to Model-based feature selection
Mainly used for explanation of which features are important.