Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trace sheduler works as stack, not a queue. #70

Open
SkorikGG opened this issue Jan 5, 2020 · 5 comments
Open

Trace sheduler works as stack, not a queue. #70

SkorikGG opened this issue Jan 5, 2020 · 5 comments

Comments

@SkorikGG
Copy link

SkorikGG commented Jan 5, 2020

For example:

...
runPar $ do
    ...
    fork f1 >>  fork f2  >> fork f3 >> fork f4 >> fork f5 >>  fork f6  >> fork f7 >> fork f8
    fork f9 >>  fork f10  >> fork f11 >> fork f12 >> fork f13 >>  fork f14  >> fork f15 >> fork f16
    ...

On a quad-core processor, execution of forks will occur in the order of f1,f2,f3,f4,f16,f15,14,..., f5.
Thus, the time to execute f5 will reach when the remaining threads are processed.
This behavior can degrade performance and reduce parallelism due to the long wait for f5 results.

@SkorikGG
Copy link
Author

SkorikGG commented Jan 6, 2020

Pull request #71 added.

@simonmar
Copy link
Owner

simonmar commented Jan 6, 2020

Could you explain why this "degrades performance"? In your example it doesn't matter which order we execute things in, so long as we keep all the cores busy.

@SkorikGG
Copy link
Author

SkorikGG commented Jan 6, 2020

A few percent performance drop can be seen in this example:

data IVList a = Nil | Cons a (IVar (IVList a))
type Stream a = IVar (IVList a)
parFoldMapBuffer :: Int ->  (c -> b -> Par c) -> (a -> Par b) -> c -> [a] -> Par c
parFoldMapBuffer n f g z0 xs0 = do
    sr <- new
    mswxs<- start n sr xs0
    folder sr z0 mswxs
  where
    start _ sw [] = put_ sw Nil >> return Nothing
    start 0 sw xs = return $ Just (sw,xs)
    start !m sw (x:xs) = do
        sw'<-new
        ivy<- new
        put_ sw (Cons ivy sw')
        fork $ g x >>= put_ ivy
        start (m-1) sw' xs
    folder sr !z mswxs = do
        se <-get sr
        case se of
            Nil -> return z
            Cons ivy sr' -> do
                y<-get ivy
                z'<- f z y
                mswxs'<- case mswxs of
                    Nothing -> return Nothing
                    Just (sw,[]) -> put_ sw Nil >> return Nothing
                    Just (sw,x:xs) -> do
                        sw'<-new
                        ivy'<- new
                        put_ sw (Cons ivy' sw')
                        fork $ g x >>= put_ ivy'
                        return $ Just (sw',xs)
                folder sr' z' mswxs'

fn :: Int -> Double -> Double
fn 0 !x = x
fn !n !x = fn (n-1) $ sin $ 3* x

g :: Double -> Double
g x = fn 10000 x

test :: Int -> Int ->  Double
test m n = runPar $ parFoldMapBuffer 128 (\x y -> return $ fn 2500 (x+y)) (return . g . fromIntegral)  0.0 [m..n]

main = do
    t1<- getTime Realtime
    print $ test 0 50000
    t2<- getTime Realtime
    print $ fromIntegral (sec t2 - sec t1) + 1e-9 * fromIntegral  (nsec t2 - nsec t1)

Some uneven activity of the processor cores is noticeable in eventlog.
Eventlog.zip

@SkorikGG
Copy link
Author

SkorikGG commented Jan 7, 2020

In addition, the new commit in the pull request #71 making scheduler behavior more intuitively expected. With this change, the following function works properly (with full parallelism):

data IVList a = Nil | Cons a (IVar (IVList a))
type Stream a = IVar (IVList a)
parFoldMap :: (c -> b -> Par c) -> (a -> Par b) -> c -> [a] -> Par c
parFoldMap f g z0 xs0 = do
    sr <- new
    fork $ producer sr xs0
    consumer sr z0
  where
    producer sw [] = put_ sw Nil
    producer sw (x:xs) = do
        sw'<-new
        ivy<- new
        put_ sw (Cons ivy sw')
        fork $ g x >>= put_ ivy
        producer sw' xs
    consumer sr !z = do
        se <-get sr
        case se of
            Nil -> return z
            Cons ivy sr' -> do
                y<-get ivy
                z'<- f z y
                consumer sr' z'

For comparison, the work of spark pool:

parFoldMap2 :: (c -> b -> c) -> (a -> b) -> c -> [a] -> c
parFoldMap2 f g z xs = ys `pseq` foldl' f z ys where
    ys = start ys' ys'
    ys'= map g xs
    start [] ys = ys
    start (x:xs) ys = x `par` start xs ys

@simonmar
Copy link
Owner

It's been a while since I looked at monad-par in detail so apologies if I'm being a bit slow here. Could you explain why we get more parallelism with your change?

My intuition is that there isn't any clearly "better" scheduling strategy without knowing something about the workload, which the scheduler can't have any prior knowledge of. So demonstrating an improvement on one example isn't sufficient evidence to change the scheduling strategy - there will be other examples that get slower. You have to make the case either that

  • the examples that benefit from this strategy are more common, or
  • somehow you've made a consistent improvement due to bugs or inefficiencies in the original implementation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants