-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Submit draft of RFC for ForEach-Object -Parallel proposal #194
Changes from 3 commits
052e515
b75a572
fb0017b
6b78263
c51c960
5592e87
f5cfefd
4596ac4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,154 @@ | ||
--- | ||
RFC: RFCnnnn | ||
Author: Paul Higinbotham | ||
Status: Draft | ||
SupercededBy: N/A | ||
Version: 1.0 | ||
Area: Engine | ||
Comments Due: July 18, 2019 | ||
Plan to implement: Yes | ||
--- | ||
|
||
# PowerShell ForEach-Object -Parallel Cmdlet | ||
|
||
This RFC proposes a new parameter set for the existing ForEach-Object cmdlet to parallelize script block executions, instead of running them sequentially as it does now. | ||
|
||
## Motivation | ||
|
||
As a PowerShell User, | ||
I can execute foreach-object piped input in script blocks running in parallel threads, either synchronously or asynchronously, while limiting the number of threads running at a given time. | ||
|
||
## Specification | ||
|
||
A new `-Parallel` parameter set will be added to the existing ForEach-Object cmdlet that supports running piped input concurrently in a provided script block. | ||
|
||
- `-Parallel` parameter switch specifies parallel script block execution | ||
|
||
- `-ScriptBlock` parameter takes a script block that is executed in parallel for each piped input variable | ||
|
||
- `-ThrottleLimit` parameter takes an integer value that determines the maximum number of script blocks running at the same time | ||
|
||
- `-TimeoutSecs` parameter takes an integer that specifies the maximum time to wait for completion before the command is aborted | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In Windows PowerShell we use "Timeout" name gcm -ParameterName Timeout
CommandType Name Version Source
----------- ---- ------- ------
Cmdlet Register-WmiEvent 3.1.0.0 Microsoft.PowerShell.Management
Cmdlet Restart-Computer 3.1.0.0 Microsoft.PowerShell.Management
Cmdlet Start-Transaction 3.1.0.0 Microsoft.PowerShell.Management
Cmdlet Wait-Event 3.1.0.0 Microsoft.PowerShell.Utility
Cmdlet Wait-Job 3.0.0.0 Microsoft.PowerShell.Core
Cmdlet Wait-Process 3.1.0.0 Microsoft.PowerShell.Management There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it is important to be specific about the time units, maybe 'TimeoutSeconds'. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed. And assuming we don't have multiple such parameters, the name can be shortened to |
||
|
||
- `-AsJob` parameter switch indicates that a job is returned, which represents the command running asynchronously | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @PaulHigin: Were you planning for this to return a single job object, with one child thread job for each thread where the script block is invoked, or were you planning for this to return one job object for each thread where the script block is invoked? I presume the latter, but since it wasn't specified I wanted to ask to make sure. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I intend to follow PowerShell job convention and return a single job object with multiple child jobs for each pipeline iteration. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In that case, please review the other comments on this RFC related to why having |
||
The 'ForEach-Object -Parallel' command will return only after all piped input have been processed. | ||
Unless the '-AsJob' switch is used, in which case a job object is returned immediately that monitors the ongoing execution state and collects generated data. | ||
PaulHigin marked this conversation as resolved.
Show resolved
Hide resolved
|
||
The returned job object can be used with all PowerShell cmdlets that manipulate jobs. | ||
|
||
### Implementation details | ||
|
||
Implementation will be similar to the ThreadJob module in that thread script block execution will be contained within a PSThreadChildJob object. | ||
The jobs will be run concurrently on separate runspaces/threads up to the ThrottleLimit value, and the remainder queued to wait for an available runspace/thread to run on. | ||
Initial implementation will not attempt to reuse threads and runspaces when running queued items, due to concerns of stale state breaking script execution. | ||
For example, PowerShell uses thread local storage to store per thread default runspaces. | ||
And even though there is a runspace 'ResetRunspaceState' API method, it only resets session variables and debug/transaction managers. | ||
Imported modules and function definitions are not affected. | ||
A script that defines a constant function would fail if the function is already defined. | ||
The initial assumption will be that runspace/thread creation time is insignificant compared to the time needed to execute the script block, either because of high compute needs or because of long wait times for results. | ||
If this assumption is not true then the user should consider batching the work load to each foreach-object iteration, or simply use the sequential/non-parallel form of the cmdlet. | ||
|
||
The 'TimeoutSecs' parameter will attempt to halt all script block executions after the timeout time has passed, however it may not be immediately successful if the running script is calling a native command or API, in which case it needs for the call to return before it can halt the running script. | ||
|
||
### Variable passing | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are variables serialized and deserialized like happens when using Start-Job? Or are references to the original object instances available in the scriptblocks? Will scriptblocks be allowed to be passed as variables? If so, is it within the scope of this RFC to resolve PowerShell/PowerShell#7626? If not, will the prohibition on passing scriptblocks be enforced or merely documented? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I believe it's this one. We are talking about Runspace's in the same process, so no need to serialize/deserialize. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This suggests to me that ForEach-Object -Parallel would rely on users correctly converting every scriptblock to its .Ast prior to passing. Is that the intent? That doesn’t seem like a great strategy to me because it’s error prone. I also don’t think that is possible to do in general. For example, there can be modules that emit objects with scriptblock properties which are bound to a SessionState in the parent runspace. For such modules the user would not have the opportunity to intervene. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The current plan is to not do any serialization and to pass object copies (for value types) and references to each running scriptblock via the $using keyword and the $_ variable for piped input. The parser already disallows assignment to a $using variable, however one can assign a reference value to another variable and assign to it, with undefined behavior. I tried to call this out in the RFC as unsupported. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, the cmdlet can pull the AST off the scriptblock internally, and then pass that along. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, for scriptblock variables. I was talking about variable types in general. |
||
|
||
ForEach-Object -Parallel will support the PowerShell `$_` current piped item variable within each script block. | ||
It will also support the `$using:` directive for passing variables from script scope into the parallel executed script block scope. | ||
If the passed in variable is a value type, a copy of the value is passed to the script block. | ||
If the passed in variable is a reference type, the reference is passed and each running script block can modify it. | ||
Since the script blocks are running in different threads, modifying a reference type that is not thread safe will result in undefined behavior. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Side note. PSSA could catch the cases. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This can (and are) also caught within PowerShell via examining AST. My point is that the behaviors are not supported. |
||
|
||
### Supported scenarios | ||
|
||
```powershell | ||
# Ensure needed module is installed on local system | ||
if (! (Get-Module -Name MyLogsModule -ListAvailable)) { | ||
Install-Module -Name MyLogsModule -Force | ||
} | ||
``` | ||
|
||
```powershell | ||
$computerNames = 'computer1','computer2','computer3','computer4','computer5' | ||
$logs = $computerNames | ForEach-Object -Parallel -ThrottleLimit 10 -TimeoutSecs 1800 -ScriptBlock { | ||
Get-Logs -ComputerName $_ | ||
} | ||
``` | ||
|
||
```powershell | ||
$computerNames = 'computer1','computer2','computer3','computer4','computer5' | ||
$job = ForEach-Object -Parallel -ThrottleLimit 10 -InputObject $computerNames -TimeoutSecs 1800 -AsJob -ScriptBlock { | ||
Get-Logs -ComputerName $_ | ||
} | ||
$logs = $job | Wait-Job | Receive-Job | ||
``` | ||
|
||
```powershell | ||
$computerNames = 'computer1','computer2','computer3','computer4','computer5' | ||
$logNames = 'System','SQL' | ||
$logs = ForEach-Object -Parallel -InputObject $computerNames -ScriptBlock { | ||
Get-Logs -ComputerName $_ -LogNames $using:logNames | ||
} | ||
``` | ||
|
||
```powershell | ||
$computerNames = 'computer1','computer2','computer3','computer4','computer5' | ||
$logNames = 'System','SQL','AD','IIS' | ||
$logResults = ForEach-Object -Parallel -InputObject $computerNames -ScriptBlock { | ||
Get-Logs -ComputerName $_ -LogNames $using:logNames | ||
} | ForEach-Object -Parallel -ScriptBlock { | ||
Process-Log $_ | ||
} | ||
``` | ||
|
||
### Unsupported scenarios | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looking on the unsupported scenarios I see an inconsistence - we can send variables by There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Start-Parallel (https://www.powershellgallery.com/packages/start-parallel/1.3.0.0) goes down the
One of the things I keep saying is if the desire is to make a parallel version of
The first invoke has a deserialized object without the methods associated with a directory. The second has a "normal" directory object.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The parser does add some restrictions to $using variables, mainly because they were used for remoting. But I think we want those restrictions given undefined behavior of assignable $using variables. A new parameter set was added to foreach-object specifically to indicate that the -Parallel operation is not the same as the sequential operation using the traditional parameter set. I thought about creating a new cmdlet, but I feel a new parameter set is sufficient to differentiate. |
||
|
||
```powershell | ||
# Variables must be passed in via $using: keyword | ||
$LogNameToUse = "IISLogs" | ||
$computers | ForEach-Object -Parallel -ScriptBlock { | ||
# This will fail because $LogNameToUse has not been defined in this scope | ||
Get-Log -ComputerName $_ -LogName $LogNameToUse | ||
} | ||
``` | ||
|
||
```powershell | ||
# Passed in reference variables should not be assigned to | ||
$MyLogs = @() | ||
$computers | ForEach-Object -Parallel -ScriptBlock { | ||
# Not thread safe, undefined behavior | ||
# Cannot assign to using variable | ||
$using:MyLogs += Get-Logs -ComputerName $_ | ||
} | ||
|
||
$dict = [System.Collections.Generic.Dictionary[string,object]]::New() | ||
$computers | ForEach-Object -Parallel -ScriptBlock { | ||
$dict = $using:dict | ||
$logs = Get-Logs -ComputerName $_ | ||
# Not thread safe, undefined behavior | ||
$dict.Add($_, $logs) | ||
} | ||
``` | ||
|
||
```powershell | ||
# Value types not passed by reference | ||
$count = 0 | ||
$computers | ForEach-Object -Parallel -ScriptBlock { | ||
# Can't assign to using variable | ||
$using:count += 1 | ||
$logs = Get-Logs -ComputerName $_ | ||
return @{ | ||
ComputerName = $_ | ||
Count = $count | ||
Logs = $logs | ||
} | ||
} | ||
``` | ||
|
||
## Alternate Proposals and Considerations | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What if instead of looking at For example, if we...
Then we could...
Taking the examples provided above, they could be simplified as follows: $computerNames = 'computer1','computer2','computer3','computer4','computer5'
$logNames = 'System','SQL'
# Get logs from computers in parallel, throttled with a timeout
$logs = Get-Logs -ComputerName $computerNames -Parallel -ThrottleLimit 10 -TimeoutSecs 1800
# Get logs from computers in parallel using background jobs
$logs = Get-Logs -ComputerName $computerNames -Parallel -ThrottleLimit 10 -AsJob | Receive-Job -Wait
# Get System and SQL logs from computers in parallel with no throttle or timeout
$logs = Get-Logs -ComputerName $computerNames -LogNames $logNames -Parallel For folks who are going to look at this and complain about the lack of downlevel support in this functionality, I'd like to draw your attention to this RFC. That won't help projects that require a downlevel version, but PowerShell needs to start offering a lot more motivation for users to upgrade to newer releases, and this would be some pretty great motivation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. An interesting idea, and would be a lot of work to cover all built-in cmdlets. A variation is to build on @BrucePay work with the Get-Process -id $pid &
Id Name PSJobTypeName State HasMoreData Location Command
-- ---- ------------- ----- ----------- -------- -------
5 Job5 BackgroundJob Running True localhost Microsoft.PowerShell.Man… Maybe this can be extended to optionally fan-out using ThreadJobs. This is outside the scope of this RFC, but is something we could look into in the future. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You wouldn't have to cover all built-in cmdlets from the start. For the first cut you could just lay the foundation and update The initial work shouldn't be that much greater than what would be required to implement what was originally proposed on this RFC (a specific modification of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also regarding the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are you thinking of this in the same simple sense as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've already started a separate RFC for this to see where that leads. Thanks for sharing these thoughts @Jaykul, I had some of them already, but not quite all. Automatic implementation is the goal, for consistency and performance, but I'm still thinking it would be bound to specific named parameter sets, or all parameter sets if you don't name them. Part of what I hope happens here is to "corral the troops" so to speak, and bring a consistent implementation to the table that commands can adopt. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about adding a new, mutually exclusive [to process] processasync/parallelprocess block in cmdlets/functions?
It's so easy to write that out. Additional control might be handled by extra properties in Or maybe, even:
Perhaps this is a little easier to do and renders any function parallelizable (thar be dragons always but I digress.) IF AsyncProcess is true, then it lights up the parallel/job common parameters (ThrottleLimit etc) I'm not tied to "async" versus "parallel" btw; perhaps the latter is a better choice to imply "jobs" and not "task". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @oising The last thing you suggested (CmdletBinding configuration) is exactly what started this part of the discussion, albeit with different terminology ( See: #194 (comment) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Cool. I'm leaning towards the idea of just leaving the process keyword alone (e.g. no new keywords like processasync), but in the .NET API, surface ProcessRecordAsync, and have them materialize as threadjobs in the console. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here's the complementary RFC I wrote up today: #206. |
||
|
||
Another option (and a previous RFC proposal) is to resurrect the PowerShell Windows workflow script `foreach -parallel` keyword to be used in normal PowerShell script to perform parallel execution of foreach loop iterations. | ||
However, the majority of the community felt it would be more useful to update the existing ForeEach-Object cmdlet with a -parallel parameter set. | ||
We may want to eventually implement both solutions. | ||
|
||
There are currently other proposals to create a more general framework to support running arbitrary scripts and cmdlets in parallel, by marking them as able to support parallelism (see RFC #206). | ||
That is outside the scope of this RFC, which focuses on extending just the ForEach-Object cmdlet to support parallel execution, and is intended to allow users to do parallel script/command execution without having to resort to PowerShell APIs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it complicates the ForEach-Object cmdlet too much and limits future development opportunities.
From previous discussion #174 (comment)
Yesterday I accidentally saw GNU utility like "parallel --pipe". Also we could find "Where-Object -Parallel" useful too. And "Select-String". And others. So this suggests that maybe we need something more general for pipe parallelization (Parallel-Pipe)
$hugeArray | Parallel-Pipe { Where-Object Name -eq "test" }
$hugeArray | Parallel-Pipe {Select-String -Pattern "test" }
$hugeArray | Parallel-Pipe -AsJob { ... }
A rose by any other name would smell as sweet:
$hugeArray | ForEach-Object -Parallel { $_ | Where-Object Name -eq "test" }
$hugeArray | ForEach-Object -Parallel { $_ | Select-String -Pattern "test" }
$hugeArray | ForEach-Object -Parallel -AsJob { ... }
I think in the name of single-purpose, composible cmdlets having a single parallel cmdlet (rather than trying to parallelise each individually) is the right way to go. But I think ForEach-Object has that brief of Select-like enumeration. And just like LINQ has .AsParallel(), I think -Parallel makes sense to do this. But that's just a brief and not-strongly-held opinion :)
There is a problem in depth with "ForEach-Object -Parallel" - there is a lot of parameters and how will "-Begin/Process/End" and etc work - in "ForEach-Object -Parallel" context or scriptblock context? I think we will have to duplicate parameters. In the case Parallel-Pipe is more simple and more safely.
Parallel-Pipe -InputOblect $data
-Begin { … }
-Process { Foreach-Object -Begin { … } -Process { … } -End { … } }
-End { … }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-Begin, -Process, -End switches will not be part of the new parameter set, since they don't make sense in this parallel case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a meaning. Especially in binary cmdlets It should work just like in ForEach-Object (why do we have the blocks there?). Rather, there is a difficulty in implementation.
Current design looks like a workaround - it is simple implement but save all current limitations (see @alx9r comments below). My suggestion is to implement natively parallel pipelines that give significant flexibility, performance and extensibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it does make sense to keep the
that exists in the sequential for each.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're removing
-Begin
,-Process
, and-End
then why are you proposing to overloadForEach-Object
instead of simply adding a new command? There's literally nothing else in this command, so you're really creating an entirely new command!ForEach-Object
is currently one of the worst performing cmdlets in PowerShell (compare$power = 1..1e5 | foreach { $_ * 2 }
to$power = 1..1e5 | & { process { $_ * 2 } }
) and although I doubt this change will make it slower, it will make it more complicated -- reducing the possibility of ever fixing it 🤨