You might be
familiar with some of the sklearn preprocessing tools, like TfidfVectorizer and
Binarizer. If you look at the docs for these preprocessing tools, you'll see that
they implement both of these methods. What I find pretty cool is that some
estimators can also be used as transformation steps, e.g. LinearSVC!
Estimators are classes that implement both fit() and predict(). You'll find that
many of the classifiers and regression models implement both these methods, and as
such you can readily test many different models. It is possible to use another
transformer as the final estimator (i.e., it doesn't necessarily implement
predict(), but definitely implements fit()). All this means is that you wouldn't be
able to call predict().
Like data preparation, feature extraction procedures must be restricted to the data
in your training dataset.
The pipeline provides a handy tool called the FeatureUnion which allows the results
of multiple feature selection and extraction procedures to be combined into a
larger dataset on which a model can be trained. Importantly, all the feature
extraction and the feature union occurs within each fold of the cross validation
procedure.
The example below demonstrates the pipeline defined with four steps:
hey are an extremely simple yet very useful tool for managing machine learning
workflows.
Maybe your preprocessing requires only one of these tansformations, such as some
form of scaling. But maybe you need to string a number of transformations together,
and ultimately finish off with an estimator of some sort. This is where Scikit-
learn Pipelines can be helpful.