Overview feature engineering

Overview feature engineering

2018, Sep 15    

Let’s study about overview of feature engineering. I will talk about the shadow color subject below.

Feature Engineering

  • Generation
    • Binarization, Quantization
    • Scaling (normalization)
    • Interaction features
    • Log transformation
    • Dimension reduction
    • Clustering
  • Selection
    • Univariate feature selection
    • Model-based selection
    • Iterative feature selection
    • Feature removal

Interaction features

  • create new features with combination of existing features.
  • need to pre-knowledge and understanding of existing features.
    • ex) height, width → square (square = height * width)
    • ex) sensor1 + sensor2 → new sensor feature

Log transformations

  • Data distribution is merged extremely in a point.(ex. Poisson)
  • Normal distribution is suitable for linear model.
    • If data have Poisson distribution, make them to fit the linear model. Then, data will fit to normal distribution. (Poission → Normal dist)

      linear-gaussian

Dimension reduction

  • is used when existing feature space is much large.
  • algorithm reduces the space.
  • ex) PCA, t-SNE, LDA, …

Clustering

  • without Y, create the component of dataset
    • created components are used as features for classification
  • is useful when you know the relationship between data to some degree.
  • ex) K-means

Feature selection

  • All features are not necessary to learn a model.
  • Some features are malignant to learn.
  • Too many features causes overfitting.
  • According to learning model, select necessary features and remove unnecessary features.

Univariate feature selection

  • selects optimal feature based on statistical model.
  • analyzes the relation between y and one feature.
  • is useful for linear model.
  • is fast-applicable and simple technique.

Model based feature selection

  • A model find proper features in the learning process
  • Possible to select the features based on Feature importance
  • Useful to use Model-based feature selection as preprocessor of feature selection for other model
  • Tree-based ensemble also has characteristics similar to Model-based feature selection
  • Mainly used for explanation of which features are important.