Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Longitudinal preprocessor parallelization #246

Closed

Conversation

MaryanMorel
Copy link
Member

Make the longitudinal preprocessors serializable and make them parallel using python.multiprocessing.

To avoid sending the whole instance of each class to spawned or forked processes, I used an initializer in the process Pool to create a cpp object instance per process. I also used higher-order functions for the same purpose.
The drawback of doing so is having the pool initializers (_inject_cpp_object methods) using global variables to store the cpp objects. This should not have dire consequences in the multiprocessing context, as each process possesses its own namespace. However, such methods could cause some trouble if called outside of this context (rogue/monkey user).

@MaryanMorel MaryanMorel force-pushed the longitudinal-preprocessor-parallelization branch 2 times, most recently from 2a5945a to ab57615 Compare May 29, 2018 13:48
@MaryanMorel MaryanMorel self-assigned this May 29, 2018
@MaryanMorel MaryanMorel force-pushed the longitudinal-preprocessor-parallelization branch 2 times, most recently from 94e0ff2 to 3d2275a Compare May 29, 2018 14:41
@MaryanMorel MaryanMorel force-pushed the longitudinal-preprocessor-parallelization branch from 3d2275a to 04f35d6 Compare May 31, 2018 12:42
@@ -355,7 +355,8 @@ def fit_kfold_cv(self, features, labels, censoring, C_tv_range: tuple = (),
features, labels, censoring)
# split the data with stratified KFold
kf = StratifiedKFold(n_folds, shuffle, self.random_state)
labels_interval = np.nonzero(p_labels)[1]
# labels_interval = np.nonzero(p_labels)[1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to keep this?

Copy link
Contributor

@Mbompr Mbompr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments about the code writing (I trust you on its validity :) )

Also, did you try to make your old tests faster?

@@ -68,14 +73,15 @@ void LongitudinalFeaturesLagger::sparse_lag_preprocessor(ArrayULong &row,
ArrayULong &out_col,
ArrayDouble &out_data,
ulong censoring) const {
// TODO: add checks here ? Or do them in Python ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checks must be done in C++ to avoid segfault possibilities !

@@ -14,9 +15,14 @@ class LongitudinalPreprocessor(ABC, Base):
set to the number of cores.
"""

def __init__(self, n_jobs=-1):
_attrinfos = {'n_jobs': {'writable': True}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This _attrinfos is not necessary

caution.
"""
global _cpp_preprocessor
_cpp_preprocessor = _LongitudinalFeaturesLagger(n_intervals, n_lags)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a clever way (using __call__ in a class) to avoid global variable, did you give it a try?

https://stackoverflow.com/a/37232918/2015762

@PhilipDeegan PhilipDeegan added the future planned for label Jan 7, 2019
@PhilipDeegan
Copy link
Member

has this been obsoleted by #373 ?

@MaryanMorel
Copy link
Member Author

MaryanMorel commented Jul 12, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants