Anda di halaman 1dari 2

Transformers are classes that implement both fit() and transform().

You might be
familiar with some of the sklearn preprocessing tools, like TfidfVectorizer and
Binarizer. If you look at the docs for these preprocessing tools, you'll see that
they implement both of these methods. What I find pretty cool is that some
estimators can also be used as transformation steps, e.g. LinearSVC!

Estimators are classes that implement both fit() and predict(). You'll find that
many of the classifiers and regression models implement both these methods, and as
such you can readily test many different models. It is possible to use another
transformer as the final estimator (i.e., it doesn't necessarily implement
predict(), but definitely implements fit()). All this means is that you wouldn't be
able to call predict().

Pipeline 2: Feature Extraction and Modeling


Feature extraction is another procedure that is susceptible to data leakage.

Like data preparation, feature extraction procedures must be restricted to the data
in your training dataset.

The pipeline provides a handy tool called the FeatureUnion which allows the results
of multiple feature selection and extraction procedures to be combined into a
larger dataset on which a model can be trained. Importantly, all the feature
extraction and the feature union occurs within each fold of the cross validation
procedure.

The example below demonstrates the pipeline defined with four steps:

Feature Extraction with Principal Component Analysis (3 features)


Feature Extraction with Statistical Selection (6 features)
Feature Union
Learn a Logistic Regression Model
The pipeline is then evaluated using 10-fold cross validation.

hey are an extremely simple yet very useful tool for managing machine learning
workflows.

A typical machine learning task generally involves data preparation to varying


degrees. We won't get into the wide array of activities which make up data
preparation here, but there are many. Such tasks are known for taking up a large
proportion of time spent on any given machine learning task.

After a dataset is cleaned up from a potential initial state of massive disarray,


however, there are still several less-intensive yet no less-important
transformative data preprocessing steps such as feature extraction, feature
scaling, and dimensionality reduction, to name just a few.

Maybe your preprocessing requires only one of these tansformations, such as some
form of scaling. But maybe you need to string a number of transformations together,
and ultimately finish off with an estimator of some sort. This is where Scikit-
learn Pipelines can be helpful.

Scikit-learn's Pipeline class is designed as a manageable way to apply a series of


data transformations followed by the application of an estimator. In fact, that's
really all it is:

Pipeline of transforms with a final estimator.

That's it. Ultimately, this simple tool is useful for:

Convenience in creating a coherent and easy-to-understand workflow


Enforcing workflow implementation and the desired order of step applications
Reproducibility
Value in persistence of entire pipeline objects (goes to reproducibility and
convenience)
So let's have a quick look at Pipelines. Specifically, here is what we will do.

Anda mungkin juga menyukai