An Omni-ensemble Automated Machine Learning — OptimalFlow

Tony Dong
4 min readAug 12, 2020

OptimalFlow is a high-level API toolkit to help data scientists building models in an ensemble way, and automate Machine Learning workflow with simple codes.

Comparing other popular “AutoML or Automated Machine Learning” APIs, OptimalFlow is designed as an Omni-ensemble ML workflow optimizer with higher-level API targeting to avoid manual repetitive train-along-evaluate experiments in general pipeline building.

It rebuilt the automated machine learning framework by switching the focus from single pipeline components automation to a higher workflow level by creating an automated ensemble pipelines (Pipeline Cluster) traversal experiments and evaluation mechanisms. In another word, OptimalFlow jumps out of a single pipeline’s scope, while treats the whole pipeline as an entity, and automate produce all possible pipelines for assessment, until finding one of them leads to the optimal model. Thus, when we say a pipeline represents an automated workflow, OptimalFlow is designed to assemble all these workflows, and find the optimal one. That’s also the reason to name it as OptimalFlow.

Fig1. OptimalFlow’s workflow

To achieve that, OptimalFlow creates Pipeline Cluster Traversal Experiments to assemble all cross-matching pipelines covering major tasks of Machine Learning workflow, and apply traversal-experiment to search the optimal baseline model. Besides, by modularizing all key pipeline components in reusable packages, it allows all components to be custom updated along with high scalability.

The common machine learning workflow is automated by a “single pipeline” strategy, which is first introduced and well-supported by scikit-learn library. In practical usage, data scientists need to implement repetitive experiments in each component within one pipeline, adjust algorithms & parameters, to get the optimal baseline model. I call this operation mechanism “Single Pipeline Repetitive Experiments”. No matter classic machine learning or current popular AutoML libraries, it’s hard to avoid this single pipeline focused experiment, which is the biggest pain point in the supervised modeling workflow.

Fig 2, Single Pipeline Repetitive Experiments

The core concept/improvement in OptimalFlow is Pipeline Cluster Traversal Experiments, which is a theory of framework first proposed by Tony Dong in Genpact 2020 GVector Conference, to optimize and automate Machine Learning Workflow using ensemble pipelines algorithm.

Comparing other automated or classic machine learning workflow’s repetitive experiments using a single pipeline, Pipeline Cluster Traversal Experiments is more powerful, since it expends the workflow from 1 dimension to 2 dimensions by ensemble all possible pipelines(Pipeline Cluster) and automated experiments. With larger coverage scope, to find the best model without manual intervention, and also more flexible with elasticity to cope with unseen data due to its ensemble designs in each component, the Pipeline Cluster Traversal Experiments provide data scientists an alternative more convenient and “Omni-automated” machine learning approach.

Fig 3, Pipeline Cluster Traversal Experiments

OptimalFlow is consist of 6 modules below, you can find more details about each module in Documentation’s here

  • autoPP for feature preprocessing
  • autoFS for classification/regression features selection
  • autoCV for classification/regression model selection and evaluation
  • autoPipe for Pipeline Cluster Traversal Experiments
  • autoViz for pipeline cluster visualization
  • autoFlow for logging & tracking.
Fig 4, Model Retrieval Diagram Generated by autoViz Module

There are some live notebook (on binder)and demos in the documentation.

Using OptimalFlow, data scientists, including experienced users or beginners, can build optimal models easily without tedious experiments and pay more attention to convert their industry domain knowledge to the deployment phase w/ practical implement.

In summary, OptimalFlow shares a few useful properties for data scientists:

  • Easy & less code— High-level APIs to implement Pipeline Cluster Traversal Experiments, and each ML component is highly automated and modularized;
  • Well-ensemble — Each key component is an ensemble of popular algorithms w/ hyperparameters tuning included;
  • Omni-coveragePipeline Cluster Traversal Experiments are designed to cross-experiment with all key ML components, like combined permuted input datasets, feature selection, and model selection;
  • Scalable & Consistency — Each module could add new algorithms easily due to its ensemble & reusable design; no extra needs to modify existing codes;
  • AdaptablePipeline Cluster Traversal Experiments makes it easier to adapt unseen datasets with the right pipeline;
  • Custom Modify Welcomed — Support custom settings to add/remove algorithms or modify hyperparameters for elastic requirements.

As an initial stable version to release, all supports are welcome! Please feel free to share your feedback, report issue or join as a contributor at OptimalFlow’s GitHub here.

--

--

Tony Dong

Healthcare & Pharmaceutical Data Scientist | Big Data Analytics & AI Enthusiast