Skip to content

Demonstrations of scalable sklearn with dask for out-of-core computation.

License

Notifications You must be signed in to change notification settings

PythonWorkshop/scalable-sklearn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Exploring scikit-learn with dask for scaling out computation on large data

tl;dr

Here, you'll find demonstrations of scalable sklearn with dask for out-of-core computation on large and complex datasets. Dask uses task graphs (which are even modifiable) to scale out computation onto disk (out-of-core). In this way both the computation and amount of data can be scaled in a big way which is really nice for ML.

Blurb

It’s becoming increasingly important to scale up machine learning and deep learning computation either using a common solution in a cluster of GPUs or out-of-core computation on a single machine with enough local disk storage, which is rarely a problem these days. Dask is a new library built on python that through out-of-core processes in task graphs can handle large datasets (gbs - tbs) for resource hungry computation. It can do all this on a single PC/laptop given enough disk.

Outline

  1. Skimage to convert to numeric
  • Standard scaling of data
  • (Optional) clean up noise
  • Image classification with
    • MLP setup (using sklearn 0.18.dev0)
    • use dask and the partial_fit for MLP
  • Visualize task graph
  • Try it with gridsearchcv for hyperparameter tuning

About

Demonstrations of scalable sklearn with dask for out-of-core computation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published