Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-parallel access on multi-thread #23

Open
jeremiedb opened this issue Aug 11, 2021 · 0 comments
Open

Non-parallel access on multi-thread #23

jeremiedb opened this issue Aug 11, 2021 · 0 comments

Comments

@jeremiedb
Copy link

I'm looking for the option to use a DataLoader with guaranteed non-shuffling when in a in a multi-thread Julia session.
I noticed that the parallel keyword argument has been deprecated, though I'm not sure if the option was enforced when on a multi-thread session.

The issue is that in a scenario where inference on a large data is performed, it is then desirable to guarantee the order of the iterations so that the predictions are in the same order as the original data. However, this apparently cannot be achieved if num_threads > 1.

MWE:

using DataLoaders
import LearnBase: nobs, getobs

struct MyContainer{S <: AbstractArray}
    x::S
    length::Int
end

nobs(data::MyContainer) = ceil(Int, size(data.x, 1) / data.length)

function getobs(data::MyContainer, idx::Int)
    println("get obs MyContainer - idx: ", idx)
    x = if idx < nobs(data)
        data.x[((idx - 1) * data.length + 1):(idx * data.length), :]
    else
        data.x[((idx - 1) * data.length + 1):end, :]
    end
    return x
end

x = rand(10,2)
data = MyContainer(x, 4)
dloader = DataLoaders.DataLoader(data, nothing)

Then, randomness can be observed in the batch order:

julia> for x in dloader
           println("size(x): ", size(x))
       end
get obs MyContainer - idx: 3
get obs MyContainer - idx: 2
size(x): (2, 2)
get obs MyContainer - idx: 1
size(x): (4, 2)
size(x): (4, 2)

julia> for x in dloader
           println("size(x): ", size(x))
       end
get obs MyContainer - idx: 1
get obs MyContainer - idx: 3
get obs MyContainer - idx: 2
size(x): (4, 2)
size(x): (2, 2)
size(x): (4, 2)

Is it possible to enforce the returned idx to always be 1,2,3? Having the option to disable the multi-threaded fetch would do it. Not sure if it would be feasible to let the multi-processing in place but wait to return the result after the previous id has been completed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant