Skip to content

GSoC 2016 Final Report : The Shogun Detox

Cactusinhand edited this page Jun 13, 2020 · 5 revisions

GSoC 2016 Final Report: The Shogun Detox

Name: Pan Deng
Organization: Shogun Machine Learning Toolbox
Mentors: Heiko Strathmann, lambday, Viktor

Synopsis

As a powerful machine learning toolkit, Shogun was achieved by the efforts of many developers. However, this also implicates the trouble with Shogun: some parts of the codes are outdated, or less optimized, and codes between modules are not unified. The problems dampen the developers’ experience with Shogun, and can lead to the obstruction for further implementations. Thus, my project was focusing on clean-up and refactoring the codes of Shogun. I focused on the two important modules of Shogun – linear algebra library as the computational core for machine learning libraries, and serialization framework, for the fast and easy serialization of Shogun data.

Table of Contents

Refactor linear algebra library

Shogun's internal linear algebra library (refers as linalglibrary below) serves as the computational core for Shogun's machine learning libraries. However, the old linalg is not well-organized, and many operations that should be implemented in linalg are implemented in individual classes. The project here is to work out a new linalg framework and migrate the linear algebra methods back into linalg library. Also, we aim to refactor the new linalg library to be plugin-based, which will allow the developers to add external linalg libraries easily.

Framework

The new linalg library supports CPU and GPU backend linear algebra operations with Eigen3 and Viennacl libraries, and allows the easy plugin of other linear algebra libraries. However, users can and can only register one CPU backend, and/or one GPU backend library at one time, as the linalg library backend class SGLinalg is designed as singleton.

The linalg library provides a unified interface for all the linalg methods implemented in linalg namespace. Users need to include shogun/mathematics/linag/LinalgNameSpace.h header file in the class and call linalg::method(arg1, arg2, ..) to run the linalg operations. Linalg library will infer the backend to use by the location (CPU/GPU) the data is stored at.

The operations are implemented in each backend class with the same base class LinalgBackendBase.h and overrode the base methods. for GPU backends, to_gpu and from_gpu methods must be implemented, as required by the base class LinalgBackendGPUBase.h.

The framework of the new linalg library was created in PR3317 and PR3348. Minor updates are made in the following PRs: #3346, #3351, #3363, #3367, #3369, #3383, #3392, #3404.

Linear algebra operations

Currently the linalg library supports the following linear algebra operation with SGVector and SGMatrix using Eigen3 or ViennaCL libraries:

Pull requests Descriptions
3335, 3359, 3387, 3391 linalg::add(). In-place add is available.
3334 linalg::mean()
3336 linalg::max()
3340 linalg::range_fill(). Only works for Eigen3 library.
3344, 3382, 3400, 3403 linalg::sum(), linalg::rowwise_sum(), linalg::colwise_sum().
The sum methods can operate with matrix blocks and have flag parameter no_diag.
3350 linalg::set_const()
3358 linalg:scale()

Future work

  • Migrate other linalg methods to the new library and remove the old methods.
  • Migrate linalg methods in SGVector, SGMatrix and other classes to the new linalg library and remove the old methods.
  • Enable the linalg operations with other CPU or GPU backends.

Refactor serialization framework

The old Shogun serialization framework is redundant and hard to read. We want to switch to a new serialization framework that is light and fast, with Cereal serialization library. For this project, I first modified the CMake files and enabled automatic download and installation of Cereal library in Shogun. I also implemented the revised Cereal library into Shogun classes with the new Tag- parameter framework (work of sanuj and lisitsyn).

To implement Cereal serialization library into Shogun, I added a Cereal check in CMakeLists and provided the download path of Cereal in .cmake files: #3202 and #3397.

Most classes in Shogun are based on SGObject class, which defines methods of registering parameters of the class to the parameter list, as well as the serialization of the data. To replace the serialization framework, I implemented serialization wrapper methods and serialization functions in SGObject.cpp and Any.h, the latter saves the parameter values registered by a SGObject class, in PR3375. There are also some basic data structures in Shogun that are not SGObject-based, such as SGVector and SGMatrix. I also implemented serialization methods in SGVector and SGReferencedData class (Shogun version of smart pointer object that works with C++0x) in PR3375, and in SGMatrix in PR3412. The unit-tests of SGObject-Any-SGVector-SGReferencedData can be found in PR3375.

With the implementations, one can serialize SGObject-based classes in Shogun into XML, JSON or Binary files with (here I use JSON as example):

SGObject obj_save;
obj_save.save_json(filename);

SGObject obj_load;
obj_load.load_json(filename);

One can find detailed introduction to the serialization framework in the README file.

There is also unfinished work with the serialization project. One is to support the serialization of all data types and data structures in SGObject parameter list, Any.h, as shown in PR3418. The explicit listing strategy I am currently using is too verbose to read.

Add cookbook for Shogun

To interpret Shogun's functions to the users, I worked on the Shogun cookbook project, writing API examples that cover major Shogun machine learning algorithms in all target languages with Shogun's meta language and a sphinx-based API documentation system. The goal is to have a cookbook with all algorithms in Shogun, and the current one looks like this.

Submitted cookbook pages

I submitted the following cookbook pages with integration test datasets:

Cookbook page PR Test dataset PR Descriptions
Clustering
3183 91, 94, 101 K-means clustering
3207 87 Hierarchical clustering
Binary classifiers
- 105 Linear SVM
Multi-class classifiers
3208 89, 93 Quadratic discriminant analysis
3242 97 Multi-class linear machine
3244 95 Multi-class logistic regression
3280, 3296 98 ECOC random
3286 100 Relaxed tree classifier
3287, 3318 103 Shareboost classifier
3326 112 Multi-class LDA
Gaussian processes
3311 108 Gaussian process classifier

I also refactored the structure of the cookbook with PRs: #3297, #104

Unfinished cookbook pages

Finish the two undergoing cookbook pages: CHAID tree classifier (PR3303, dataset PR119) and CARTree classifier (PR3282, dataset PR120). The two examples can be translated to C++ from meta-language and generate the correct results, while fail JAVA and some other languages. I am still looking into the reason for the failure.

I will also continue to add cookbook pages for other algorithms, such as kernels and regressions to the cookbook.

Other contributions

  • Removed HAVE_EIGEN3 macros Shogun-wise (PR3092).
  • Fixed shogun/mathematics/ warnings (PR3185).
  • Added assertation in CCHAIDTree class (PR3395)
  • Added new CQDA class constructor (PR3233)

Appendix: timeline

Week1: May 23rd – May 29th

  • Download and installation of Cereal serialization library to Shogun.
  • The prototype of new linalg library witwh CPU dot method on vectors.
  • Cookbook: hierarchical clustering and quadratic discriminant analysis.

Week2: May 30th – Jun 5th

  • SGVector dot operation with CPU Eigen3 library and GPU ViennaCL library.
  • Added singleton for Linalg class in init.h and init.cpp.
  • Cookbook: multiclass logistic regression and multiclass linear machine.

Week3: Jun 6th – Jun 12th

  • SGVector sum operation with CPU Eigen3 library and GPU ViennaCL library.
  • Benchmark of new linalg methods.
  • Cookbook: ecoc and CARTree.

Week4: Jun 13th – Jun 19th

  • Integrated CPU and GPU vector data structure in SGVector class and GPU data storage modules.
  • Cookbook: shareboost, relaxed tree and kmeans.

Week5: Jun 20th – Jun 26th

  • Finished new linalg method implementation modules with vector dot method with Eigen3 and ViennaCL library.
  • Cookbook: Gaussian process classifier. CHAIDTree.
  • Cookbook: split classifiers into binary and multi-class.

Week6: Jun 27th - Jul 3rd

  • Finished to_gpu and from_gpu methods with VienaCL library.
  • Doxygens of new linalg library
  • Refactored linalg vector add, mean, sum, max, range_fill, scale methods
  • Cookbook: Multi-class linear discriminant analysis classifier.

Week7: Jul 4th – Jul 10th

  • Unit-tests of new linalg library.
  • Integrated SGMatrix to the new linalg library.

Week8: Jul 11th – Jul 17th

  • Checked the current parameter serialization framework and the newly added tags framework.
  • Serialization of class SGVector with Cereal library.
  • Serialization of class Any with Cereal library.

Week9: Jul 18th – Jul 24th

  • Finished add and colwise sum/rowwise sum/ sum methods working with SGVector and SGMatrix to the new linalg library.
  • Serialization of class SGObject with Cereal library.

Week10: Jul 25th – Jul 31st

  • Had SGObejct-Any-SGVector-SGReference data serialization working at local.
  • Finished inplace add and mean methods working with SGVector and SGMatrix to the new linalg library.

Week11: Aug 1st – Aug 7th

  • Had SGObejct-Any-SGVector-SGReference data serialization working on travis.
  • Added unit-tests for serialization.
  • Merged block sum, scale, set_const and range_fill methods working with SGVector and SGMatrix to the new linalg library.

Week12: Aug 8th – Aug 14th

  • Added SGMatrix serialization methods.
  • READMEs for linalg library and serialization framework.
  • Cookbook: CHAIDtress and CARTree revisit.

Week13: Aug 15th – Aug 23th

  • Peer review: code and README of Tag-parameter framework and plugin module by Sanuj
  • GSoC16 summary

Welcome to the Shogun wiki!

Clone this wiki locally