Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Submit draft of RFC for ForEach-Object -Parallel proposal #194
Submit draft of RFC for ForEach-Object -Parallel proposal #194
Changes from 3 commits
052e515
b75a572
fb0017b
6b78263
c51c960
5592e87
f5cfefd
4596ac4
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will ForEach-Object implement pipeline backpressure? If not, how will memory usage for pipelines like
Get-Content giantfile.txt | ForEach-Object -Parallel {'CPU-bound operation'}
be limited? Is so, how will backpressure work?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the default
ThrottleLimit
would be the way to limit how many parallel jobs will run, no?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daxian-dbw That's what I understand. The concern I'm raising here isn't about the number of parallel jobs. Rather, it is about how much of the output from
Get-Content giantfile.txt
is read into the input buffer ofForEach-Object -Parallel
at any one time. The way I read the RFC currently, the behavior that respect is not specified. If no backpressure is implemented, then the amount of memory that could be consumed byForEach-Object -Parallel
is unbounded. If backpressure is implemented, then the particularities of that behavior will affect how performantForEach-Object -Parallel
is under different conditions.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the creation of a new scriptblock apply only during
$using:
and/or parameter passing? Or will it also occur when the runspace encounters a scriptblock from another runspace at large? An example of such an encounter I'm thinking of iswhere the scriptblock produced by $g.GetNextScriptBlock() has affinity to the runspace where Get-SomeScriptBlockGenerator is invoked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this has changed. Since it is impossible to cover all cases where script block variables can be passed to the foreach script block, it will not be supported (i.e., no attempt will be made to detect and pass script blocks via Ast or string). An attempt will be made to catch this and error out when script block variables are passed directly through piped input or using variables, but not all possible variations will be checked (since it is not really possible).
Documentation will warn users not to do this, along with other unsafe uses of variables while running in parallel, such as assigning to passed in variables that are not thread safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PaulHigin: Were you planning for this to return a single job object, with one child thread job for each thread where the script block is invoked, or were you planning for this to return one job object for each thread where the script block is invoked? I presume the latter, but since it wasn't specified I wanted to ask to make sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I intend to follow PowerShell job convention and return a single job object with multiple child jobs for each pipeline iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, please review the other comments on this RFC related to why having
-AsJob
as a parameter is a bad idea, and how dropping-AsJob
allows you to support-Begin
and-End
without complication.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking on the unsupported scenarios I see an inconsistence - we can send variables by
$using
but can return result only byreturn
. Maybe discard$using
and only use-ArgumentList
andparam
? It seems it will be more clear for users (one entry point - one exit point) and it will be one step closer to being thread-safe. It also resolve problems like "Value types not passed by reference"/"Passed in reference variables should not be assigned to".There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Start-Parallel (https://www.powershellgallery.com/packages/start-parallel/1.3.0.0) goes down the
-argumentlist
andparam()
route.$using
variables are read-only copies.. When you run a script block on on another computer I don't think there is any expectation that it sets variables on the computer that invokes the command. And in DSC aUsing
variable is written into the MOF file, - when DSC applies the MOF file it wouldn't make sense to try to go back to the runspace which compiled it and change a variable - the application phase will often happen when that runspace has ceased to exist.Changing form the the read-only copy paradigm means
(a) The runspace using the variable must read the current value "live" from the calling runspace
(b) The child runspace must have a mechanism to write back, and the parent must track the variables the each runspace can read and write and authenticate that whatever is asking to read or write is really that runspace
(c) because there may be multiple runspaces writing back to the same variable there must be some locking mechanism. One runspace locking a variable while it performs multiple updates may block others.
One of the things I keep saying is if the desire is to make a parallel version of
foreach-object
then:begin
andend
blocks; returning jobs very much optional).-process
for sequential use. i.e. changingForeach-object {$variable}
toForeach-object {stuff $variable} -parallel
should not be syntactically valid. RequiringForeach-object -parallel {...}
tells us when reading or writing the code that the block runs in its own runspace with read only access to local variables, only if$Using
is specified.Param()
and-argumentList
means a greater change to the scriptblock than supporting$using
but the parameter changes when the block runs in a different runspace, try this:The first invoke has a deserialized object without the methods associated with a directory. The second has a "normal" directory object.
foreach-object
already supporting-argumentlist
when used with-memberName
e.g.ForEach-Object -MemberName Split -ArgumentList "."
But it does not work with a
-process
block, whether it should is another matter.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parser does add some restrictions to $using variables, mainly because they were used for remoting. But I think we want those restrictions given undefined behavior of assignable $using variables.
A new parameter set was added to foreach-object specifically to indicate that the -Parallel operation is not the same as the sequential operation using the traditional parameter set. I thought about creating a new cmdlet, but I feel a new parameter set is sufficient to differentiate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if instead of looking at
ForEach-Object
for parallelism we scaled this out to allow parallelism to be used with any command?For example, if we...
CmdletBinding
attribute calledSupportsParallelism
.-Parallel
,-ThrottleLimit
,-TimeoutSecs
, and-AsJob
common parameters to any advanced function or cmdlet that hasSupportsParallelism
set to$true
in theCmdletBinding
attribute (supporting the different common parameter sets that are proposed above, such that-AsJob
could not be used with-TimeoutSecs
).$using:
variable prefix so that it can be used in any process block inside of a command that hasSupportsParallelism
set to$true
(Note: This would also need to allow$using:
to work when the command is invoked without parallelism, so that the functionality would only need to be written once; the$using:
prefix would simply identify a variable that is pulled into the parallel runspace/job when the command is invoked in parallel. The$using:
prefix should probably not be required for parameters, though -- that should be implicit, and maybe$using
could be implicit for variables defined in the function/command scope -- again, makes things much easier, and users would only need to worry about using in commands likeForEach-Object
that accept script blocks).Then we could...
SupportsParallelism
attribute toForEach-Object
invocations to get the functionality we're talking about here.SupportsParallelism
attribute to any other command that is written to support parallelism.ForEach-Object
use to where it is really needed.begin
andend
blocks (which are still useful -- even with parallelism, some resources such as connections, credentials, or collections to accumulate results may be needed and shared among the different threads/jobs).Taking the examples provided above, they could be simplified as follows:
For folks who are going to look at this and complain about the lack of downlevel support in this functionality, I'd like to draw your attention to this RFC. That won't help projects that require a downlevel version, but PowerShell needs to start offering a lot more motivation for users to upgrade to newer releases, and this would be some pretty great motivation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An interesting idea, and would be a lot of work to cover all built-in cmdlets. A variation is to build on @BrucePay work with the
&
operator to force execution as background job.Maybe this can be extended to optionally fan-out using ThreadJobs.
This is outside the scope of this RFC, but is something we could look into in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You wouldn't have to cover all built-in cmdlets from the start. For the first cut you could just lay the foundation and update
ForEach-Object
to useSupportsParallelism
(or just call thatSupportsParallel
for short and to match the-Parallel
common parameter name). The other built-in cmdlets could be updated over time. That keeps focus on what this RFC is trying to do (add parallelism to ForEach-Object) while doing so in such a way that it can scale to any cmdlet/advanced function.The initial work shouldn't be that much greater than what would be required to implement what was originally proposed on this RFC (a specific modification of
ForEach-Object
, and that should be in scope in this RFC, at least listed in the alternate proposals and considerations section. The[CmdletBinding(SupportsParallelism)]
approach feels like a much better strategy to achieve the desired end goal for all of PowerShell than simply updatingForEach-Object
).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also regarding the
&
background operator and thread jobs comment, see PowerShell Issue #9873.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you thinking of this in the same simple sense as
SupportsShouldProcess
where all you're doing is opting in to a few common parameters and magic variables which you may --or may not, actually-- use (properly)?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've already started a separate RFC for this to see where that leads. Thanks for sharing these thoughts @Jaykul, I had some of them already, but not quite all. Automatic implementation is the goal, for consistency and performance, but I'm still thinking it would be bound to specific named parameter sets, or all parameter sets if you don't name them. Part of what I hope happens here is to "corral the troops" so to speak, and bring a consistent implementation to the table that commands can adopt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about adding a new, mutually exclusive [to process] processasync/parallelprocess block in cmdlets/functions?
It's so easy to write that out. Additional control might be handled by extra properties in
[CmdletBinding()]
, such as ThrottleLimit etc, which could also be overridden on a per-invocation basis, by adding ThrottleLimit etc as common parameters.Or maybe, even:
Perhaps this is a little easier to do and renders any function parallelizable (thar be dragons always but I digress.) IF AsyncProcess is true, then it lights up the parallel/job common parameters (ThrottleLimit etc)
I'm not tied to "async" versus "parallel" btw; perhaps the latter is a better choice to imply "jobs" and not "task".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oising The last thing you suggested (CmdletBinding configuration) is exactly what started this part of the discussion, albeit with different terminology (
[CmdletBinding(SupportsParallel=...)]
). I'm also consideringprocessasync
and the correspondingProcessRecordAsync
while I work through the complementary RFC.See: #194 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. I'm leaning towards the idea of just leaving the process keyword alone (e.g. no new keywords like processasync), but in the .NET API, surface ProcessRecordAsync, and have them materialize as threadjobs in the console.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the complementary RFC I wrote up today: #206.