top of page

Should we fear AutoML to rule the world of automated machine learning anytime soon?

  • Writer: Sivasathivel Kandasamy
    Sivasathivel Kandasamy
  • Jun 27, 2019
  • 3 min read

Automated Machine Learning

Automated machine learning is the process of automating the end-to-end machine learning process. The steps involved in developing machine learning (ML) applications usually include:

  • data pre-processing

  • feature engineering

  • model selection

  • model tuning

Automated ML aims to automate this process end-to-end.


These processes need seasoned data scientists, with domain knowledge. Such data scientists are in demand and command premium salaries. Automated ML seems to obviate the need for such data scientists [1,2]. This seems to be the major attraction to many of the businesses - pay less get more[3].


AutoML

AutoML is one of the popular frameworks currently, for automated machine learning. Some of the features of AutoML include:

  • Analytics - the relationship between features and targets

  • Feature Engineering - around dates

  • Robust Scaling

  • Feature Selection

  • Data Formatting

  • Model Selection

  • Hyperparameter Tuning

  • Ensembling - SubPredictors and Weak Predictors

AutoML is a native python library. You can install it using the python package manager:


ree

The usage of this framework is also very simple:

ree

A sample training output is given below:

ree

ree

Deep Dive into the AutoML Code

The GitHub repo of the code shows the following python scripts:


ree

The class Predictor is the entry point. It takes in two parameters, the type of estimator and the column description. The first step of the Predictor is to call its method set_params_and_defaults.

set_params_and_defaults first removes all rows in the data frame with missing target values.

To compare different models then set the parameter ‘compare_all_models’ to ‘True’. Then AutoML will select 'RANSACRegressor', 'RandomForestRegressor', 'LinearRegression', 'AdaBoostRegressor', 'ExtraTreesRegressor', and 'GradientBoostingRegressor', for regressions. For classification, AutoML selects 'GradientBoostingClassifier', 'LogisticRegression' and 'RandomForestClassifier' by default. When 'compare_all_models' is not set, AutoML selects 'GradientBoostingRegressor' (regression) and 'GradientBoostingClassifier' (classification)


After this, it creates model pipelines and performs a search for models. The default search algorithms is GridSearchCV. Yet, when the number of combinations seems to be more than 50(magic number!) AutoML uses Evolutionary Algorithm Search with CV (EASCV) by default. AutoML also uses a python multiprocessing module for parallel processing. It doesn’t use EASCV for 'CatBoostClassifier' and 'CatBoostRegressor' as,

“For some reason, EASCV doesn't play nicely with CatBoost. It blows up the memory hugely, and takes forever to train”.

While I can go line by line of the code, I don’t want this post to become a tome or a post of static code analysis. The above discussions gives a birds’ eye view of the flow of logic in the library.

These algorithms are also among the top choices for many data scientists.


So when to use it?

Interestingly, these algorithms are also among the top choices for many data scientists. They also automate their parameter tuning using GridSearchCV or other algorithms. However, their approach would be more specific to their industry and focus on the present and future directions of the company.


AutoML is a very handy tool when one has no idea of the direction one has to take or a newbie. Or to come with a low effort study on the viability of a project.

That is, AutoML replaces experience and knowledge with the computational cost.



If you are adopting some automated ML frameworks beware of these:

  • AutoML reduces time by parallelizing the workloads. This may cause problems in architecting for production

  • The default parameters search space provided may not be optimal for your application. There is no quantitative proof, that this search space is relevant for all problems.

  • Might help a data scientist/student, who wants to compare a new algorithm.

  • You got to take steps to prevent these frameworks from stifling innovation at your place.



Data scientists do a lot more than trying different models. I spent a lot time comparing time series models during my post-doctoral research. The works would show the effort we took in understanding the data uncertainties. The knowledge of these uncertainties play a great role in building a robust system. These frameworks, at the most replace experience with computational cost. In platforms like AWS and Google, the cost of this search could be exorbitant.





Comments


© 2023 by Skyline

bottom of page