-
Notifications
You must be signed in to change notification settings - Fork 0
GSoC 2016 Final Report : The Shogun Detox
Name: Pan Deng
Organization: Shogun Machine Learning Toolbox
Mentors: Heiko Strathmann, lambday, Viktor
As a powerful machine learning toolkit, Shogun was achieved by the efforts of many developers. However, this also implicates the trouble with Shogun: some parts of the codes are outdated, or less optimized, and codes between modules are not unified. The problems dampen the developers’ experience with Shogun, and can lead to the obstruction for further implementations. Thus, my project was focusing on clean-up and refactoring the codes of Shogun. I focused on the two important modules of Shogun – linear algebra library as the computational core for machine learning libraries, and serialization framework, for the fast and easy serialization of Shogun data.
- Refactor linear algebra library
- Refactor serialization framework
- Add cookbook for Shogun
- Other contributions
- Appendix: timeline
Shogun's internal linear algebra library (refers as linalg
library below) serves as the computational core for Shogun's machine learning libraries. However, the old linalg
is not well-organized, and many operations that should be implemented in linalg
are implemented in individual classes. The project here is to work out a new linalg
framework and migrate the linear algebra methods back into linalg
library. Also, we aim to refactor the new linalg
library to be plugin-based, which will allow the developers to add external linalg
libraries easily.
The new linalg
library supports CPU and GPU backend linear algebra operations with Eigen3
and Viennacl
libraries, and allows the easy plugin of other linear algebra libraries. However, users can and can only register one CPU backend, and/or one GPU backend library at one time, as the linalg
library backend class SGLinalg
is designed as singleton.
The linalg
library provides a unified interface for all the linalg
methods implemented in linalg
namespace. Users need to include shogun/mathematics/linag/LinalgNameSpace.h
header file in the class and call linalg::method(arg1, arg2, ..)
to run the linalg
operations. Linalg
library will infer the backend to use by the location (CPU/GPU) the data is stored at.
The operations are implemented in each backend class with the same base class LinalgBackendBase.h
and overrode the base methods. for GPU backends, to_gpu
and from_gpu
methods must be implemented, as required by the base class LinalgBackendGPUBase.h
.
The framework of the new linalg
library was created in PR3317 and PR3348. Minor updates are made in the following PRs: #3346, #3351, #3363, #3367, #3369, #3383, #3392, #3404.
Currently the linalg
library supports the following linear algebra operation with SGVector
and SGMatrix
using Eigen3
or ViennaCL
libraries:
Pull requests | Descriptions |
---|---|
3335, 3359, 3387, 3391 |
linalg::add() . In-place add is available. |
3334 | linalg::mean() |
3336 | linalg::max() |
3340 |
linalg::range_fill() . Only works for Eigen3 library. |
3344, 3382, 3400, 3403 |
linalg::sum() , linalg::rowwise_sum() , linalg::colwise_sum() . The sum methods can operate with matrix blocks and have flag parameter no_diag . |
3350 | linalg::set_const() |
3358 | linalg:scale() |
- Migrate other
linalg
methods to the new library and remove the old methods. - Migrate
linalg
methods inSGVector
,SGMatrix
and other classes to the newlinalg
library and remove the old methods. - Enable the
linalg
operations with other CPU or GPU backends.
The old Shogun serialization framework is redundant and hard to read. We want to switch to a new serialization framework that is light and fast, with Cereal
serialization library. For this project, I first modified the CMake
files and enabled automatic download and installation of Cereal
library in Shogun. I also implemented the revised Cereal
library into Shogun classes with the new Tag
- parameter framework (work of sanuj and lisitsyn).
To implement Cereal
serialization library into Shogun, I added a Cereal
check in CMakeLists
and provided the download path of Cereal
in .cmake
files: #3202 and #3397.
Most classes in Shogun are based on SGObject
class, which defines methods of registering parameters of the class to the parameter list, as well as the serialization of the data. To replace the serialization framework, I implemented serialization wrapper methods and serialization functions in SGObject.cpp
and Any.h
, the latter saves the parameter values registered by a SGObject
class, in PR3375. There are also some basic data structures in Shogun that are not SGObject
-based, such as SGVector
and SGMatrix
. I also implemented serialization methods in SGVector
and SGReferencedData
class (Shogun version of smart pointer object that works with C++0x) in PR3375, and in SGMatrix
in PR3412. The unit-tests of SGObject-Any-SGVector-SGReferencedData
can be found in PR3375.
With the implementations, one can serialize SGObject
-based classes in Shogun into XML, JSON or Binary files with (here I use JSON as example):
SGObject obj_save;
obj_save.save_json(filename);
SGObject obj_load;
obj_load.load_json(filename);
One can find detailed introduction to the serialization framework in the README file.
There is also unfinished work with the serialization project. One is to support the serialization of all data types and data structures in SGObject
parameter list, Any.h
, as shown in PR3418. The explicit listing strategy I am currently using is too verbose to read.
To interpret Shogun's functions to the users, I worked on the Shogun cookbook project, writing API examples that cover major Shogun machine learning algorithms in all target languages with Shogun's meta language and a sphinx-based API documentation system. The goal is to have a cookbook with all algorithms in Shogun, and the current one looks like this.
I submitted the following cookbook pages with integration test datasets:
Cookbook page PR | Test dataset PR | Descriptions |
---|---|---|
Clustering | ||
3183 | 91, 94, 101 | K-means clustering |
3207 | 87 | Hierarchical clustering |
Binary classifiers | ||
- | 105 | Linear SVM |
Multi-class classifiers | ||
3208 | 89, 93 | Quadratic discriminant analysis |
3242 | 97 | Multi-class linear machine |
3244 | 95 | Multi-class logistic regression |
3280, 3296 | 98 | ECOC random |
3286 | 100 | Relaxed tree classifier |
3287, 3318 | 103 | Shareboost classifier |
3326 | 112 | Multi-class LDA |
Gaussian processes | ||
3311 | 108 | Gaussian process classifier |
I also refactored the structure of the cookbook with PRs: #3297, #104
Finish the two undergoing cookbook pages: CHAID tree classifier (PR3303, dataset PR119) and CARTree classifier (PR3282, dataset PR120). The two examples can be translated to C++ from meta-language and generate the correct results, while fail JAVA and some other languages. I am still looking into the reason for the failure.
I will also continue to add cookbook pages for other algorithms, such as kernels and regressions to the cookbook.
- Removed HAVE_EIGEN3 macros Shogun-wise (PR3092).
- Fixed
shogun/mathematics/
warnings (PR3185). - Added assertation in
CCHAIDTree
class (PR3395) - Added new
CQDA
class constructor (PR3233)
Week1: May 23rd – May 29th
- Download and installation of
Cereal
serialization library to Shogun. - The prototype of new
linalg
library witwh CPU dot method on vectors. - Cookbook: hierarchical clustering and quadratic discriminant analysis.
Week2: May 30th – Jun 5th
-
SGVector
dot operation with CPUEigen3
library and GPUViennaCL
library. - Added singleton for
Linalg
class ininit.h
andinit.cpp
. - Cookbook: multiclass logistic regression and multiclass linear machine.
Week3: Jun 6th – Jun 12th
-
SGVector
sum operation with CPUEigen3
library and GPUViennaCL
library. - Benchmark of new
linalg
methods. - Cookbook: ecoc and CARTree.
Week4: Jun 13th – Jun 19th
- Integrated CPU and GPU vector data structure in
SGVector
class and GPU data storage modules. - Cookbook: shareboost, relaxed tree and kmeans.
Week5: Jun 20th – Jun 26th
- Finished new
linalg
method implementation modules with vector dot method withEigen3
andViennaCL
library. - Cookbook: Gaussian process classifier. CHAIDTree.
- Cookbook: split classifiers into binary and multi-class.
Week6: Jun 27th - Jul 3rd
- Finished
to_gpu
andfrom_gpu
methods withVienaCL
library. - Doxygens of new
linalg
library - Refactored linalg vector
add
,mean
,sum
,max
,range_fill
,scale
methods - Cookbook: Multi-class linear discriminant analysis classifier.
Week7: Jul 4th – Jul 10th
- Unit-tests of new
linalg
library. - Integrated
SGMatrix
to the newlinalg
library.
Week8: Jul 11th – Jul 17th
- Checked the current parameter serialization framework and the newly added tags framework.
- Serialization of class
SGVector
withCereal
library. - Serialization of class
Any
withCereal
library.
Week9: Jul 18th – Jul 24th
- Finished
add
andcolwise sum
/rowwise sum
/sum
methods working withSGVector
andSGMatrix
to the newlinalg
library. - Serialization of class
SGObject
withCereal
library.
Week10: Jul 25th – Jul 31st
- Had SGObejct-Any-SGVector-SGReference data serialization working at local.
- Finished
inplace add
andmean
methods working withSGVector
andSGMatrix
to the newlinalg
library.
Week11: Aug 1st – Aug 7th
- Had SGObejct-Any-SGVector-SGReference data serialization working on travis.
- Added unit-tests for serialization.
- Merged
block sum
,scale
,set_const
andrange_fill
methods working withSGVector
andSGMatrix
to the newlinalg
library.
Week12: Aug 8th – Aug 14th
- Added
SGMatrix
serialization methods. - READMEs for
linalg
library and serialization framework. - Cookbook:
CHAIDtress
andCARTree
revisit.
Week13: Aug 15th – Aug 23th
- Peer review: code and README of
Tag
-parameter framework and plugin module by Sanuj - GSoC16 summary
Welcome to the Shogun wiki!
-
[quick link GSoC 2016 projects](Google Summer of Code 2016 Projects)
-
Readmes:
-
Documents
-
[Roadmaps](Project roadmaps)
-
GSoC
- Getting involved
- Follow ups
- [2016 projects](Google Summer of Code 2016 Projects)
-
Credits