Getting Started

PCA and Feature Selection Strategies in AutoML

Photo by Markus Spiske on Unsplash

The performance of an automated machine learning(AutoML) workflow depends on how we process and feed different types of variables to the model, due to most machine learning models only accept numerical variables. Thus, categorical features encoding becomes a necessary step for any automated machine learning approaches. It not only elevates the model quality but also helps in better feature engineering.

There are two major feature reduction strategies in AutoML: principal component analysis(PCA) and feature selection.

PCA is widely used in current AutoML frameworks, due to it often used for reducing the dimensionality of a large dataset so that it becomes…


Photo by Kevin Ku on Unsplash

In the latest version(0.1.10) of OptimalFlow, it added a “no-code” Web App as an application demo built on OptimalFlow. The web app allows simple click and selection for all of the parameters inside of OptimalFLow, which means users could build end-to-end Automated Machine Learning workflow without coding at all! (Documentation).

OptimalFlow was designed highly modularized at the beginning, which made it easy to continue developing. And users could build applications based on it. The web app of OptimalFlow is a user-friendly tool for people who don’t have coding experience to build an Omni-ensemble Automated Machine Learning workflow simply and quickly.


Easy Way with Simple Code to Select the Optimal Model

Photo by Artem Sapegin on Unsplash

Model selection is an essential step to create the baseline model in the machine learning workflow. This step usually is a time exhausting process and needs more model tuning experiments.

So I wrote a handful package called OptimalFlow with an ensemble model selection module, autoCV, in it, which can go through popular supervised modeling algorithms with cross-validation, also applying ‘lazy’ search over hyperparameters to select the optimal model.

Why we use OptimalFlow? You could read another story of its introduction: An Omni-ensemble Automated Machine Learning — OptimalFlow.


Easy Way with Simple Code to Select top Features

Feature selection is a crucial part of the machine learning workflow. How well the features were selected directly related to the model’s performance. There are usually 2 pain points for data scientists to go through:

  • Which feature selection algorithm is better?
  • How many columns from the input dataset need to be kept?

So I wrote a handful Python library called OptimalFlow with an ensemble feature selection module in it, called autoFS to simplify this process easily.

OptimalFlow is an Omni-ensemble Automated Machine Learning toolkit, which is based on Pipeline Cluster Traversal Experiment(PCTE) approach, to help data scientists building optimal models…


Formula E Laps Prediction — Part 2

In the previous Part 1 of this tutorial, we discussed how to implement data engineering to prepare suitable datasets, feeding further modeling steps. And now we will focus on how to use OptimalFlow library(Documentation | GitHub) to implement Omni-ensemble automated machine learning.

Why we use OptimalFlow? You could read another story of its introduction: An Omni-ensemble Automated Machine Learning — OptimalFlow.


Formula E Laps Prediction — Part 1

In this end-to-end tutorial, we will illustrate how to use OptimalFlow (Documentation | GitHub), an Omni-ensemble automated machine learning toolkit, to predict the number of laps a driver will need to complete in an FIA Formula E race. This is a typical regression predictive problem, which impacts the performance of the team’s racing and energy strategy.

Why we use OptimalFlow? You could read another story of its introduction: An Omni-ensemble Automated Machine Learning — OptimalFlow.


Photo by Hunter Harritt on Unsplash

OptimalFlow is an Omni-ensemble Automated Machine Learning toolkit, which is based on Pipeline Cluster Traversal Experiment approach, to help data scientists building optimal models in an easy way, and automate Machine Learning workflow with simple codes.

OptimalFlow wraps the Scikit-learn supervised learning framework to automatically create a collection of machine learning pipelines(Pipeline Cluster) based on algorithms permutation in each framework component.

It includes feature engineering methods in its preprocessing module such as missing value imputation, categorical feature encoding, numeric feature standardization, and outlier winsorization. The models inherit algorithms from Scikit-learn and XGBoost estimators for classification and regression problems. …


OptimalFlow is a high-level API toolkit to help data scientists building models in an ensemble way, and automate Machine Learning workflow with simple codes.

Comparing other popular “AutoML or Automated Machine Learning” APIs, OptimalFlow is designed as an Omni-ensemble ML workflow optimizer with higher-level API targeting to avoid manual repetitive train-along-evaluate experiments in general pipeline building.

It rebuilt the automated machine learning framework by switching the focus from single pipeline components automation to a higher workflow level by creating an automated ensemble pipelines (Pipeline Cluster) traversal experiments and evaluation mechanisms. In another word, OptimalFlow jumps out of a single pipeline’s…

Tony Dong

Healthcare & Pharmaceutical Data Scientist | Big Data Analytics & AI Enthusiast

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store