-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Submit draft of RFC for ForEach-Object -Parallel proposal #194
Submit draft of RFC for ForEach-Object -Parallel proposal #194
Conversation
@PaulHigin I'm interested in commenting on this proposal. I'm new to the PowerShell RFC process and I have the following questions:
|
@alx9r judging from recent patterns, at least, you can go ahead and comment immediately. From what I can tell, the team want to try to make this happen for the PS7 release, so we probably want to get this ironed out sooner rather than later. Comments are usually submitted by reviewing the RFC diff as you would any kind of code on Github; point out bits you think should be different or want to discuss further, request they be elaborated on, etc. Helps to have all the comments listed in the context of the RFC section you're actually addressing. If there's bits you think are missing that should be in the RFC, go ahead and just write a standard comment and mention Paul so he can address that as well. |
|
||
# PowerShell ForEach-Object -Parallel Cmdlet | ||
|
||
This RFC proposes a new parameter set for the existing ForEach-Object cmdlet to parallelize script block executions, instead of running them sequentially as it does now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it complicates the ForEach-Object cmdlet too much and limits future development opportunities.
From previous discussion #174 (comment)
From iSazonov
Yesterday I accidentally saw GNU utility like "parallel --pipe". Also we could find "Where-Object -Parallel" useful too. And "Select-String". And others. So this suggests that maybe we need something more general for pipe parallelization (Parallel-Pipe)
$hugeArray | Parallel-Pipe { Where-Object Name -eq "test" }
$hugeArray | Parallel-Pipe {Select-String -Pattern "test" }
$hugeArray | Parallel-Pipe -AsJob { ... }
From rjmholt
A rose by any other name would smell as sweet:
$hugeArray | ForEach-Object -Parallel { $_ | Where-Object Name -eq "test" }
$hugeArray | ForEach-Object -Parallel { $_ | Select-String -Pattern "test" }
$hugeArray | ForEach-Object -Parallel -AsJob { ... }
I think in the name of single-purpose, composible cmdlets having a single parallel cmdlet (rather than trying to parallelise each individually) is the right way to go. But I think ForEach-Object has that brief of Select-like enumeration. And just like LINQ has .AsParallel(), I think -Parallel makes sense to do this. But that's just a brief and not-strongly-held opinion :)
From iSazonov
There is a problem in depth with "ForEach-Object -Parallel" - there is a lot of parameters and how will "-Begin/Process/End" and etc work - in "ForEach-Object -Parallel" context or scriptblock context? I think we will have to duplicate parameters. In the case Parallel-Pipe is more simple and more safely.
Parallel-Pipe -InputOblect $data
-Begin { … }
-Process { Foreach-Object -Begin { … } -Process { … } -End { … } }
-End { … }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-Begin, -Process, -End switches will not be part of the new parameter set, since they don't make sense in this parallel case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a meaning. Especially in binary cmdlets It should work just like in ForEach-Object (why do we have the blocks there?). Rather, there is a difficulty in implementation.
Current design looks like a workaround - it is simple implement but save all current limitations (see @alx9r comments below). My suggestion is to implement natively parallel pipelines that give significant flexibility, performance and extensibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it does make sense to keep the
before any {do this }
{do this for each one}
after all of them {do this}
that exists in the sequential for each.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're removing -Begin
, -Process
, and -End
then why are you proposing to overload ForEach-Object
instead of simply adding a new command? There's literally nothing else in this command, so you're really creating an entirely new command!
ForEach-Object
is currently one of the worst performing cmdlets in PowerShell (compare $power = 1..1e5 | foreach { $_ * 2 }
to $power = 1..1e5 | & { process { $_ * 2 } }
) and although I doubt this change will make it slower, it will make it more complicated -- reducing the possibility of ever fixing it 🤨
From the RFC we have
And says that Implementation will be similar to the ThreadJob module which should mean lightweight runspaces, not "traditional" PowerShell jobs with an instance of PowerShell.exe for each. Which is good. But it goes on Things are made unnecessarily complicated by the concept of fully asynchronous jobs (which don't return the result itself, but something used to get a result) and parallel but synchronous jobs (which return delayed results). It is not helpful that the distinction between them is Implemented properly (see https://www.powershellgallery.com/packages/start-parallel/1.3.0.0 ) a parallelizing command
This is done in Start-parallel (and similar things others have written) with a structure like this
A good test is can a command like this Building on this example from the RFC
this should also work
And it should write to the log file before any non-responsive computer has timed out. Syntactically, it should be possible to simply write |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Full Disclosure:
I've got a working implementation (under PowerShell 6) of Invoke-Parallel that I built to support a larger PowerShell testing framework story I've been experimenting with. It shares a lot with what this RFC describes. I have a lot of notes and experimental results from that work. Those are the basis for most of my review comments below.
I like this RFC and I'm looking forward to seeing how it evolves. Hopefully I can learn from the implementation of this RFC so I can overcome some of the rough edges I encountered during my implementation of Invoke-Parallel.
## Motivation | ||
|
||
As a PowerShell User, | ||
I can do simple fan-out concurrency with the PowerShell ForEach-Object cmdlet, without having to obtain and load a separate module, or deal with PowerShell jobs unless I want to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering what is meant by "simple" here. It seems to me that "simple" from the user's perspective means something like "works as I'd expect in common use cases". Vaguely speaking, I would suggest that such behavior would include the following:
- Speedup is achieved to the degree that the workload in the scriptblock allows.
- Scriptblocks can safely be passed as variables to the command.
I'm highlighting these behaviors in particular because they seem like obvious goals yet require overcoming substantial underlying implementation challenges to achieve. In particular,
- Runspace.Open() sometimes completes without the specified ImportPSModules becoming available PowerShell#7377, #7153, #7131, and #7035 all relate to various roadblocks to opening, resetting, and reusing, runspaces in parallel. The summary is that each runspace will need to load the modules it needs and there currently does not seem to be a way to open modules in parallel in a manner that results in speedup.
- Concurrently invoking a ScriptBlock in two runspaces causes InvalidOperationException and pwsh.exe termination PowerShell#7626 demonstrates that PowerShell currently can crash (with, I think, possibly undefined behavior) when scriptblocks are passed amongst runspaces.
I think overcoming these challenges would be great for PowerShell, but I don't think they should be underestimated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, scriptblocks have runspace affinity and we need resolve this before we can go any further. Perhaps it will be Vnext runspaces.
For the synchronous case, the `ForEach-Object` cmdlet will not return until all parallel executions complete. | ||
For the asynchronous case, the `ForEach-Object` cmdlet will immediately return a PowerShell job object that contains child jobs of each parallel execution. | ||
|
||
### Implementation details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will ForEach-Object implement pipeline backpressure? If not, how will memory usage for pipelines like Get-Content giantfile.txt | ForEach-Object -Parallel {'CPU-bound operation'}
be limited? Is so, how will backpressure work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the default ThrottleLimit
would be the way to limit how many parallel jobs will run, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think the default ThrottleLimit would be the way to limit how many parallel jobs will run, no?
@daxian-dbw That's what I understand. The concern I'm raising here isn't about the number of parallel jobs. Rather, it is about how much of the output from Get-Content giantfile.txt
is read into the input buffer of ForEach-Object -Parallel
at any one time. The way I read the RFC currently, the behavior that respect is not specified. If no backpressure is implemented, then the amount of memory that could be consumed by ForEach-Object -Parallel
is unbounded. If backpressure is implemented, then the particularities of that behavior will affect how performant ForEach-Object -Parallel
is under different conditions.
### Implementation details | ||
|
||
Implementation will be similar to the ThreadJob module. | ||
Script block execution will be run for each piped input on a separate thread and runspace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will module loading for the new runspaces be managed? Will the runspaces be reused? If they will be reused, will state be reset between uses? If so, to what degree? Just by way of ResetRunspaceState()? Or something more elaborate?
Will ForEach-Object attempt to open modules in parallel? If so, is it within the scope of this RFC to resolve the contention reported in PowerShell/PowerShell#7035? If not, how will module loading occurring in the runspaces be serialized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Module loading will work as normal, either explicitly loaded using 'Import-Module', or implicitly loaded via command discovery. Runspaces and threads will not be reused in the initial implementation, due to concerns of stale state causing problems, but it is something that can be looked at later.
Module loading contention is a concern but is not something that will be addressed in this work.
For synchronous operation, a `-Timeout` parameter will be available that terminates the wait for completion after a specified time. | ||
Without a `-Timeout` parameter, the cmdlet will wait indefinitely for completion. | ||
|
||
### Synchronous parameter set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be helpful to specify the following behavior for this parameter set:
- How is an exception from one but not all scriptblock invocations handled? Are the other non-throwing scriptblocks allowed to complete? Is the exception ever thrown? Is it converted to an error instead?
- How are exceptions from multiple scriptblock invokations handled? Are they combined into an AggregateException?
- How are Verbose/Warning/Error streams from scriptblock invocations handled? Are they multiplexed and passed through so they appear to the user? If so, how can the user correlate those streams by invocation? Some sort of ID prefix?
- Does ForEach-Object write its progress to the Progress stream?
- How are Progress streams from scriptblock invocations handled? Do they become child activities to a ForEach-Object parent activity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in current design it will work "as is" - based on runspace affinity. I expect we will get sometimes "interesting" side effects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The runspace affinity can be completely avoid in the implementation, by passing the AST instead of the ScriptBlock to another Runspace, then get the script block by Ast.GetScriptBlock()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are good questions that should be answered in this RFC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I'll add answers to the RFC for these questions:.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The runspace affinity can be completely avoid in the implementation, by passing the AST instead of the ScriptBlock to another Runspace, then get the script block by Ast.GetScriptBlock()
I don't see how this is true, in general. For example, in
$g = Get-SomeScriptBlockGenerator
1..10 | ForEach-Object -Parallel { . $using:g.GetNextScriptBlock() }
would the the scriptblock returned by .GetNextScriptBlock() be silently converted to its .Ast counterpart?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parser will not allow the example shown above, the $using directive only supports variable expressions. And generating a new script block from the Ast will only be done for ScriptBlock using variable types.
But it is still possible to pass in a script block variable aligned to a different runspace. e.g.,
ForEach-Parallel -Parallel { $g = $using:g; $g.GetNextScriptBlock() }
We won't be trying to prevent all possible cases of passing a variable reference to a script block and using it unsafely. But we can document safe and unsafe ways to use passed in variable references.
|
||
- `-ThrottleLimit` : parameter takes an integer value that determines the maximum number threads | ||
|
||
- `-TimeoutSecs` : parameter takes an integer that specifies the maximum time to wait for completion in seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens to the executing scriptblock when the timeout is reached? Is a stop signal sent? If so, what happens if stopping takes a long time? Does ForEach-Object wait for execution of all scriptblocks to stop before it returns from EndProcessing()? Is there another timeout for stopping? What happens if that timeout is reached? Is the thread killed? Or is the runspace and associated powershell merely abandoned?
|
||
- `-AsJob` : parameter switch returns a job object | ||
|
||
### Variable passing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are variables serialized and deserialized like happens when using Start-Job? Or are references to the original object instances available in the scriptblocks?
Will scriptblocks be allowed to be passed as variables? If so, is it within the scope of this RFC to resolve PowerShell/PowerShell#7626? If not, will the prohibition on passing scriptblocks be enforced or merely documented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or are references to the original object instances available in the scriptblocks?
I believe it's this one. We are talking about Runspace's in the same process, so no need to serialize/deserialize.
Passing a script block as variable is not a problem. $sb.Ast
can be passed to the target Runspace, where we call ast.GetScriptBlock()
to get back a script block that is associated with the target Runspace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...$sb.Ast can be passed to the target Runspace...
This suggests to me that ForEach-Object -Parallel would rely on users correctly converting every scriptblock to its .Ast prior to passing. Is that the intent? That doesn’t seem like a great strategy to me because it’s error prone. I also don’t think that is possible to do in general. For example, there can be modules that emit objects with scriptblock properties which are bound to a SessionState in the parent runspace. For such modules the user would not have the opportunity to intervene.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current plan is to not do any serialization and to pass object copies (for value types) and references to each running scriptblock via the $using keyword and the $_ variable for piped input. The parser already disallows assignment to a $using variable, however one can assign a reference value to another variable and assign to it, with undefined behavior. I tried to call this out in the RFC as unsupported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the cmdlet can pull the AST off the scriptblock internally, and then pass that along.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for scriptblock variables. I was talking about variable types in general.
@jhoneill I really wonder if this is indeed the intention of the RFC. (I was under the impression that the synchronous parameter set would output the plain objects from the scriptblocks until I read your post.) It seems to me that at least one of the parameter sets should output objects from the scriptblocks as they become available. I've done that in my own implementation of Invoke-Parallel and I did not come across any problems with such an arrangement. I can confirm that, as you suggest, getting objects from Invoke-Parallel is much more ergonomic than unpacking from something like a Jobs object.
I agree with your last statement. It seems like that ability might be beyond the scope of this RFC, though. AFAICT the following two issues complicate achieving that example:
I can think of some ways of overcoming those issues by projecting a specialization of TPL.Dataflow into PowerShell. I've experimented with doing something similar and had positive results, but it's not obvious that it could be made as user-friendly as your |
@alx9r I think what I've done with start-parallel and you've done with Invoke-parallel are very similar :-) |
@jhoneill ForEach-Object -Parallel will returns data as generated when run in synchronous mode (i.e., without the -asjob switch). However, the cmdlet doesn't complete or return control to the user until all parallel tasks complete or a provided timeout occurs. Getting a job object back is optional and opt-in by using the -asjob switch. It is for users who want control back immediately while the parallel script blocks run (asynchronous mode). PowerShell has rich support for job objects and I feel this is the best way to handle the asynchronous case for the majority of users. |
It is impossible to efficiently allocate threads without knowing what the workload will be. For high compute work, the number of threads should be limited to system number of cores. For work that is low compute and essentially waits can use more threads. Consequently I leave it to the user to decide the number of threads via the -ThrottleLimit parameter. Runspace reuse is definitely something to consider as an implementation detail, maybe opt-in. One concern I have is state getting corrupted between uses. I like the idea of providing a module list parameter so that all script block execution instances are initialized consistently, and it may make sense to reuse Runspaces in this case especially to minimize initialization time. But in general, if Runspace initialization time is a concern then I feel the workload is not right for parallelization and the user should batch up work before passing it to ForeEach-Object -Parallel. |
OK so -asjob gives them way of doing foreach {start-threadjob} a single command ? I think that is a "nice to have", but the core scenario is synchronous
That is the right way to go. For sending an audit job request to many servers that can take 1-2 minutes you can have 1000 threads waiting for a response and that's OK . If it is local compute threads=cores is good. Sending AD requests IIRC there was a protection mechanism so you don't get 100 simultaneous operations from one place, but you might still improve things with 2 threads.
Strongly agree :-) |
Regarding the need for But let me know if there is a particular scenario that requires |
Well not |
Here's a grander vision: instead of just adding parallelism to foreach-object, why not introduce a parallel pipeline syntax given that threadjob ships in the box. Example : ps> dir c:\temp |< get-filehash
# equivalent to
Ps> dir c:\temp | foreach-object -parallel { $_ | get-filehash} I like the idea of |
@PaulHigin I think the problem of runspace recycling (clearing state without creating a brand new runspace) was tackled and largely solved in PowerShell workflow. |
The reason it needs to be a command and not an operator is that for some (perhaps many) tasks - including getting a file hash - you may want to limit the number of threads. We don't want 1000 concurrent operations being sent to a spinning disk, and computing a hash can only go as fast as the number of cores. Other operations need us to specify a timeout if one of the threads should hang. |< command-A ; command-b | command-c >| looks nice , but a >| operator needs to know which threads it is waiting for so the two operators would need to communicate. |
I'm starting to think it would be quicker and easier to simply add a slightly more refined version of this to the threadjob module. It
As shown it doesn't return results while threads are still being created, and doesn't have a time out (which adds a condition to the while loop) both of those are in start-parallel .
|
Not a good reason at all to force a command syntax. Some sensible runtime defaults (i.e. max degree of parallelism = number of cores) and also having global PS-prefixed variables that may allow users to override session defaults. If Anyway, |
Defaults with overrides yes. But threads = cores is not a good constraint if it is "ping every ip address". And a thread for everything is bad if it is read file; compute hash of it. One's idea of sensible changes with what the expected use is. Using environment variables to pass parameters to what is essentially a command... it works but it's being done the wrong way. I can't think of an operator which changes what it does in response to variables. Taking the reservation off < has a number of things going for it the one thing against it which its use in other shells for redirect in, and the parser splits at | making |< a valid alias needs more work than say %% |
} | ||
``` | ||
|
||
## Alternate Proposals and Considerations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if instead of looking at ForEach-Object
for parallelism we scaled this out to allow parallelism to be used with any command?
For example, if we...
- ...had a new
CmdletBinding
attribute calledSupportsParallelism
. - ...added
-Parallel
,-ThrottleLimit
,-TimeoutSecs
, and-AsJob
common parameters to any advanced function or cmdlet that hasSupportsParallelism
set to$true
in theCmdletBinding
attribute (supporting the different common parameter sets that are proposed above, such that-AsJob
could not be used with-TimeoutSecs
). - ...updated the
$using:
variable prefix so that it can be used in any process block inside of a command that hasSupportsParallelism
set to$true
(Note: This would also need to allow$using:
to work when the command is invoked without parallelism, so that the functionality would only need to be written once; the$using:
prefix would simply identify a variable that is pulled into the parallel runspace/job when the command is invoked in parallel. The$using:
prefix should probably not be required for parameters, though -- that should be implicit, and maybe$using
could be implicit for variables defined in the function/command scope -- again, makes things much easier, and users would only need to worry about using in commands likeForEach-Object
that accept script blocks).
Then we could...
- ...add the
SupportsParallelism
attribute toForEach-Object
invocations to get the functionality we're talking about here. - ...add the
SupportsParallelism
attribute to any other command that is written to support parallelism. - ...simplify command invocation when you want parallelism a great deal, which would keep
ForEach-Object
use to where it is really needed. - ...maintain support for
begin
andend
blocks (which are still useful -- even with parallelism, some resources such as connections, credentials, or collections to accumulate results may be needed and shared among the different threads/jobs). - ...discover commands that support parallelism (and therefore those that do not).
- ...make multitasking much more of a "thing" in PowerShell.
Taking the examples provided above, they could be simplified as follows:
$computerNames = 'computer1','computer2','computer3','computer4','computer5'
$logNames = 'System','SQL'
# Get logs from computers in parallel, throttled with a timeout
$logs = Get-Logs -ComputerName $computerNames -Parallel -ThrottleLimit 10 -TimeoutSecs 1800
# Get logs from computers in parallel using background jobs
$logs = Get-Logs -ComputerName $computerNames -Parallel -ThrottleLimit 10 -AsJob | Receive-Job -Wait
# Get System and SQL logs from computers in parallel with no throttle or timeout
$logs = Get-Logs -ComputerName $computerNames -LogNames $logNames -Parallel
For folks who are going to look at this and complain about the lack of downlevel support in this functionality, I'd like to draw your attention to this RFC. That won't help projects that require a downlevel version, but PowerShell needs to start offering a lot more motivation for users to upgrade to newer releases, and this would be some pretty great motivation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An interesting idea, and would be a lot of work to cover all built-in cmdlets. A variation is to build on @BrucePay work with the &
operator to force execution as background job.
Get-Process -id $pid &
Id Name PSJobTypeName State HasMoreData Location Command
-- ---- ------------- ----- ----------- -------- -------
5 Job5 BackgroundJob Running True localhost Microsoft.PowerShell.Man…
Maybe this can be extended to optionally fan-out using ThreadJobs.
This is outside the scope of this RFC, but is something we could look into in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You wouldn't have to cover all built-in cmdlets from the start. For the first cut you could just lay the foundation and update ForEach-Object
to use SupportsParallelism
(or just call that SupportsParallel
for short and to match the -Parallel
common parameter name). The other built-in cmdlets could be updated over time. That keeps focus on what this RFC is trying to do (add parallelism to ForEach-Object) while doing so in such a way that it can scale to any cmdlet/advanced function.
The initial work shouldn't be that much greater than what would be required to implement what was originally proposed on this RFC (a specific modification of ForEach-Object
, and that should be in scope in this RFC, at least listed in the alternate proposals and considerations section. The [CmdletBinding(SupportsParallelism)]
approach feels like a much better strategy to achieve the desired end goal for all of PowerShell than simply updating ForEach-Object
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also regarding the &
background operator and thread jobs comment, see PowerShell Issue #9873.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you thinking of this in the same simple sense as SupportsShouldProcess
where all you're doing is opting in to a few common parameters and magic variables which you may --or may not, actually-- use (properly)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've already started a separate RFC for this to see where that leads. Thanks for sharing these thoughts @Jaykul, I had some of them already, but not quite all. Automatic implementation is the goal, for consistency and performance, but I'm still thinking it would be bound to specific named parameter sets, or all parameter sets if you don't name them. Part of what I hope happens here is to "corral the troops" so to speak, and bring a consistent implementation to the table that commands can adopt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about adding a new, mutually exclusive [to process] processasync/parallelprocess block in cmdlets/functions?
function foo {
begin { }
processasync { }
end { }
}
It's so easy to write that out. Additional control might be handled by extra properties in [CmdletBinding()]
, such as ThrottleLimit etc, which could also be overridden on a per-invocation basis, by adding ThrottleLimit etc as common parameters.
Or maybe, even:
function foo {
[CmdletBinding(AsyncProcess=true)]
begin {}
process {}
end {}
}
Perhaps this is a little easier to do and renders any function parallelizable (thar be dragons always but I digress.) IF AsyncProcess is true, then it lights up the parallel/job common parameters (ThrottleLimit etc)
I'm not tied to "async" versus "parallel" btw; perhaps the latter is a better choice to imply "jobs" and not "task".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oising The last thing you suggested (CmdletBinding configuration) is exactly what started this part of the discussion, albeit with different terminology ([CmdletBinding(SupportsParallel=...)]
). I'm also considering processasync
and the corresponding ProcessRecordAsync
while I work through the complementary RFC.
See: #194 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. I'm leaning towards the idea of just leaving the process keyword alone (e.g. no new keywords like processasync), but in the .NET API, surface ProcessRecordAsync, and have them materialize as threadjobs in the console.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the complementary RFC I wrote up today: #206.
See this comment. |
I understand how begin/end blocks could be used with foreach-object -parallel, but it seems more like a 'nice to have' feature and not really needed. You can easily do any processing before and after the call to foreach and pass any computed variables with the |
They are a nice to have. Worst case if you don't have them is that you have to split a long pipeline. Now... this is already harder to read than
So the impact of losing them is small; the primary reason for keeping them is to make the parallel and sequential versions as similar as possible, since I think the work to keep them is also small. |
I guess the real question is whether threads and runspaces stay open and handle multiple inputs, or whether you're spinning up a clean runspace every time. On the one hand, we could specify that On the other hand, if the runspace is thrown out after each input, then the initialization might as well happen in the process block, since it has to happen every time anyway. I'm sure that most of the time, people do not use the begin or end blocks in I still think that if that's the case, we're needlessly overloading the |
In general, the flow of control of user code is not the same for these two styles. If you have a longer pipeline with multiple 'a','b' |
% -Begin {'begin1'} -Process {"process1$_"} -End {'end1'} |
% -Begin {'begin2'} -Process {"process2$_"} -End {'end2'} outputs
which demonstrates that the output of the first It is difficult to assess the implications of precluding the flow of control afforded by |
@Jaykul For parallel runspaces I haven't seen anything to suggest otherwise. Because each invocation of the |
### Asynchronous parameter set | ||
|
||
Asynchronous ForEach-Object -Parallel immediately returns a job object for monitoring parallel script block execution | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PaulHigin: Were you planning for this to return a single job object, with one child thread job for each thread where the script block is invoked, or were you planning for this to return one job object for each thread where the script block is invoked? I presume the latter, but since it wasn't specified I wanted to ask to make sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I intend to follow PowerShell job convention and return a single job object with multiple child jobs for each pipeline iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, please review the other comments on this RFC related to why having -AsJob
as a parameter is a bad idea, and how dropping -AsJob
allows you to support -Begin
and -End
without complication.
It is worthwhile thinking about, but I feel it is unlikely we can have universal PowerShell parallel execution without a great deal of effort and a high probability of breaking existing functionality. There is over a decade of legacy behavior that needs to be accounted for. Consequently I want to keep the scope of this RFC fairly narrow. The goal of this RFC is to provide a straightforward solution to execute script blocks in parallel threads, without having to resort to external module or PowerShell APIs, e.g., a more performant version of Windows PowerShell workflow 'foreach -parallel'. I feel begin/process/end blocks are not needed, however, having an initialization parameter to perform common session initialization is very worthwhile. Implementation details such as re-using runspaces and thread pools will be considered. In addition I feel an asynchronous option is very important and consistent with existing PowerShell behavior. I think that instead of creating a new cmdlet, I will add a new parameter set to the existing 'foreach-object' cmdlet to support parallel execution. I'll update this RFC to clarify all of this and add supported scenarios and scenarios specifically not supported. |
We may be far from this goal, but we all want it - the world has changed since the advent of PowerShell. In Russia we say - the eyes are afraid, the hands are doing :-) (My comment is out of the RFC) |
@iSazonov Sounds good. I am looking forward to discussing a wider ranging concurrency model for PowerShell. |
|
||
- `-ThrottleLimit` parameter takes an integer value that determines the maximum number of script blocks running at the same time | ||
|
||
- `-TimeoutSecs` parameter takes an integer that specifies the maximum time to wait for completion before the command is aborted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Windows PowerShell we use "Timeout" name
gcm -ParameterName Timeout
CommandType Name Version Source
----------- ---- ------- ------
Cmdlet Register-WmiEvent 3.1.0.0 Microsoft.PowerShell.Management
Cmdlet Restart-Computer 3.1.0.0 Microsoft.PowerShell.Management
Cmdlet Start-Transaction 3.1.0.0 Microsoft.PowerShell.Management
Cmdlet Wait-Event 3.1.0.0 Microsoft.PowerShell.Utility
Cmdlet Wait-Job 3.0.0.0 Microsoft.PowerShell.Core
Cmdlet Wait-Process 3.1.0.0 Microsoft.PowerShell.Management
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is important to be specific about the time units, maybe 'TimeoutSeconds'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. And assuming we don't have multiple such parameters, the name can be shortened to -Timeout
if it has units appended to the parameter name anyway.
It will also support the `$using:` directive for passing variables from script scope into the parallel executed script block scope. | ||
If the passed in variable is a value type, a copy of the value is passed to the script block. | ||
If the passed in variable is a reference type, the reference is passed and each running script block can modify it. | ||
Since the script blocks are running in different threads, modifying a reference type that is not thread safe will result in undefined behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note. PSSA could catch the cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can (and are) also caught within PowerShell via examining AST. My point is that the behaviors are not supported.
} | ||
``` | ||
|
||
### Unsupported scenarios |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking on the unsupported scenarios I see an inconsistence - we can send variables by $using
but can return result only by return
. Maybe discard $using
and only use -ArgumentList
and param
? It seems it will be more clear for users (one entry point - one exit point) and it will be one step closer to being thread-safe. It also resolve problems like "Value types not passed by reference"/"Passed in reference variables should not be assigned to".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Start-Parallel (https://www.powershellgallery.com/packages/start-parallel/1.3.0.0) goes down the -argumentlist
and param()
route.
$using
variables are read-only copies.. When you run a script block on on another computer I don't think there is any expectation that it sets variables on the computer that invokes the command. And in DSC a Using
variable is written into the MOF file, - when DSC applies the MOF file it wouldn't make sense to try to go back to the runspace which compiled it and change a variable - the application phase will often happen when that runspace has ceased to exist.
Changing form the the read-only copy paradigm means
(a) The runspace using the variable must read the current value "live" from the calling runspace
(b) The child runspace must have a mechanism to write back, and the parent must track the variables the each runspace can read and write and authenticate that whatever is asking to read or write is really that runspace
(c) because there may be multiple runspaces writing back to the same variable there must be some locking mechanism. One runspace locking a variable while it performs multiple updates may block others.
One of the things I keep saying is if the desire is to make a parallel version of foreach-object
then:
- Differences from existing sequential foreach should be avoided where possible (hence keeping
begin
andend
blocks; returning jobs very much optional). - Many existing script blocks will fail if run unmodified in parallel runspaces. Therefore one difference is to reserve
-process
for sequential use. i.e. changingForeach-object {$variable}
toForeach-object {stuff $variable} -parallel
should not be syntactically valid. RequiringForeach-object -parallel {...}
tells us when reading or writing the code that the block runs in its own runspace with read only access to local variables, only if$Using
is specified. - Switching to
Param()
and-argumentList
means a greater change to the scriptblock than supporting$using
but the parameter changes when the block runs in a different runspace, try this:
$foo = get-item .
invoke-command -ScriptBlock {param ($foo) $foo | gm } -ArgumentList $foo -ComputerName localhost -EnableNetworkAccess
invoke-command -ScriptBlock {param ($foo) $foo | gm } -ArgumentList $foo
The first invoke has a deserialized object without the methods associated with a directory. The second has a "normal" directory object.
- Things are complicated by
foreach-object
already supporting-argumentlist
when used with-memberName
e.g.
ForEach-Object -MemberName Split -ArgumentList "."
But it does not work with a-process
block, whether it should is another matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parser does add some restrictions to $using variables, mainly because they were used for remoting. But I think we want those restrictions given undefined behavior of assignable $using variables.
A new parameter set was added to foreach-object specifically to indicate that the -Parallel operation is not the same as the sequential operation using the traditional parameter set. I thought about creating a new cmdlet, but I feel a new parameter set is sufficient to differentiate.
I have updated the RFC to answer some of the questions raised here. |
|
||
- `-Parallel` : parameter switch specifies fan-out parallel script block execution | ||
Script block variables will be special cased because they have runspace affinity. | ||
Therefore script block variables will not be passed by reference and instead a new script block object instance will be created from the original script block variable Ast (abstract syntax tree). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the creation of a new scriptblock apply only during $using:
and/or parameter passing? Or will it also occur when the runspace encounters a scriptblock from another runspace at large? An example of such an encounter I'm thinking of is
$g = Get-SomeScriptBlockGenerator
1..10 | ForEach-Object -Parallel { . $using:g.GetNextScriptBlock() }
where the scriptblock produced by $g.GetNextScriptBlock() has affinity to the runspace where Get-SomeScriptBlockGenerator is invoked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this has changed. Since it is impossible to cover all cases where script block variables can be passed to the foreach script block, it will not be supported (i.e., no attempt will be made to detect and pass script blocks via Ast or string). An attempt will be made to catch this and error out when script block variables are passed directly through piped input or using variables, but not all possible variations will be checked (since it is not really possible).
Documentation will warn users not to do this, along with other unsafe uses of variables while running in parallel, such as assigning to passed in variables that are not thread safe.
Can you please update the language about assigning values to variables passed by reference in that it's no longer an undefined behavior, rather it emits an error? |
The below script runs faster without
It took 18 secs to finish with |
There are similar reports... Thomas Lee put up this case where it was slower. https://twitter.com/doctordns/status/1164461929093566465 I ran this it took 9 seconds The new equivalent is 31 seconds I ran yours as 14.7 seconds; Without the -parallel 7.2 seconds; Using start-parallel 4.3 seconds (both are fractionally faster if I use 300 threads ) |
Foreach parallel is not some magic that makes scripts run faster :). It uses PowerShell APIs to run script concurrently. It runs each concurrent scriptblock in a separate thread and that means system overhead for the thread as well as a PowerShell runspace context in which it runs. Running trivial script will almost certainly run slower with the -Parallel switch. But there are cases where it can be very beneficial to run scripts concurrently, depending on the system and number of cores. I am working on a blog article to help clarify this. |
That is understood. However this
Uses threadjob (which I think you wrote @PaulHigin ) and completes in 12 seconds This So it's a valid question why a for , starting a bunch of thread jobs, and processing the jobs afterwards is 2 1/2 times faster than the cmdlet to do the same thing |
The RFC PR is not for performance discussion - please open new tracking issue in PowerShell Core repository with repo steps and scenarios. |
@PowerShell/powershell-committee merging as accepted given @PaulHigin's latest updates to the examples. Thanks! |
No description provided.