Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submit draft of RFC for ForEach-Object -Parallel proposal #194

Merged
merged 8 commits into from
Sep 9, 2019
149 changes: 116 additions & 33 deletions 1-Draft/RFCnnnn-ForEach-Parallel-Cmdlet.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,68 +16,95 @@ This RFC proposes a new parameter set for the existing ForEach-Object cmdlet to
## Motivation

As a PowerShell User,
I can do simple fan-out concurrency with the PowerShell ForEach-Object cmdlet, without having to obtain and load a separate module, or deal with PowerShell jobs unless I want to.
I can execute foreach-object piped input in script blocks running in parallel threads, either synchronously or asynchronously, while limiting the number of threads running at a given time.

## Specification

There will be two new parameter sets added to the existing ForeEach-Object cmdlet to support both synchronous and asynchronous operations for parallel script block execution.
For the synchronous case, the `ForEach-Object` cmdlet will not return until all parallel executions complete.
For the asynchronous case, the `ForEach-Object` cmdlet will immediately return a PowerShell job object that contains child jobs of each parallel execution.
A new `-Parallel` parameter set will be added to the existing ForEach-Object cmdlet that supports running piped input concurrently in a provided script block.

- `-Parallel` parameter switch specifies parallel script block execution

- `-ScriptBlock` parameter takes a script block that is executed in parallel for each piped input variable

- `-ThrottleLimit` parameter takes an integer value that determines the maximum number of script blocks running at the same time

- `-TimeoutSeconds` parameter takes an integer that specifies the maximum time to wait for completion before the command is aborted

- `-AsJob` parameter switch indicates that a job is returned, which represents the command running asynchronously

The 'ForEach-Object -Parallel' command will return only after all piped input have been processed.
Unless the '-AsJob' switch is used, in which case a job object is returned immediately that monitors the ongoing execution state and collects generated data.
PaulHigin marked this conversation as resolved.
Show resolved Hide resolved
The returned job object can be used with all PowerShell cmdlets that manipulate jobs.

### Implementation details
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will ForEach-Object implement pipeline backpressure? If not, how will memory usage for pipelines like Get-Content giantfile.txt | ForEach-Object -Parallel {'CPU-bound operation'} be limited? Is so, how will backpressure work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the default ThrottleLimit would be the way to limit how many parallel jobs will run, no?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think the default ThrottleLimit would be the way to limit how many parallel jobs will run, no?

@daxian-dbw That's what I understand. The concern I'm raising here isn't about the number of parallel jobs. Rather, it is about how much of the output from Get-Content giantfile.txt is read into the input buffer of ForEach-Object -Parallel at any one time. The way I read the RFC currently, the behavior that respect is not specified. If no backpressure is implemented, then the amount of memory that could be consumed by ForEach-Object -Parallel is unbounded. If backpressure is implemented, then the particularities of that behavior will affect how performant ForEach-Object -Parallel is under different conditions.


Implementation will be similar to the ThreadJob module.
Script block execution will be run for each piped input on a separate thread and runspace.
The number of threads that run at a time will be limited by a `-ThrottleLimit` parameter with a default value.
Piped input that exceeds the allowed number of threads will be queued until a thread is available.
For synchronous operation, a `-Timeout` parameter will be available that terminates the wait for completion after a specified time.
Without a `-Timeout` parameter, the cmdlet will wait indefinitely for completion.
Implementation will be similar to the ThreadJob module in that thread script block execution will be contained within a PSThreadChildJob object.
The jobs will be run concurrently on separate runspaces/threads up to the ThrottleLimit value, and the remainder queued to wait for an available runspace/thread to run on.
Initial implementation will not attempt to reuse threads and runspaces when running queued items, due to concerns of stale state breaking script execution.
For example, PowerShell uses thread local storage to store per thread default runspaces.
And even though there is a runspace 'ResetRunspaceState' API method, it only resets session variables and debug/transaction managers.
Imported modules and function definitions are not affected.
A script that defines a constant function would fail if the function is already defined.
The initial assumption will be that runspace/thread creation time is insignificant compared to the time needed to execute the script block, either because of high compute needs or because of long wait times for results.
If this assumption is not true then the user should consider batching the work load to each foreach-object iteration, or simply use the sequential/non-parallel form of the cmdlet.

### Synchronous parameter set
The 'TimeoutSeconds' parameter will attempt to halt all script block executions after the timeout time has passed, however it may not be immediately successful if the running script is calling a native command or API, in which case it needs for the call to return before it can halt the running script.

Synchronous ForEach-Object -Parallel returns after all script blocks complete running or timeout
### Variable passing

```powershell
ForEach-Object -Parallel -ThrottleLimit 10 -TimeoutSecs 1800 -ScriptBlock {}
```
ForEach-Object -Parallel will support the PowerShell `$_` current piped item variable within each script block.
It will also support the `$using:` directive for passing variables from script scope into the parallel executed script block scope.
If the passed in variable is a value type, a copy of the value is passed to the script block.
If the passed in variable is a reference type, the reference is passed and each running script block can modify it.
Since the script blocks are running in different threads, modifying a reference type that is not thread safe will result in undefined behavior.

- `-Parallel` : parameter switch specifies fan-out parallel script block execution
Script block variables will be special cased because they have runspace affinity.
Therefore script block variables will not be passed by reference and instead a new script block object instance will be created from the original script block variable Ast (abstract syntax tree).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the creation of a new scriptblock apply only during $using: and/or parameter passing? Or will it also occur when the runspace encounters a scriptblock from another runspace at large? An example of such an encounter I'm thinking of is

$g = Get-SomeScriptBlockGenerator

1..10 | ForEach-Object -Parallel { . $using:g.GetNextScriptBlock() }

where the scriptblock produced by $g.GetNextScriptBlock() has affinity to the runspace where Get-SomeScriptBlockGenerator is invoked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this has changed. Since it is impossible to cover all cases where script block variables can be passed to the foreach script block, it will not be supported (i.e., no attempt will be made to detect and pass script blocks via Ast or string). An attempt will be made to catch this and error out when script block variables are passed directly through piped input or using variables, but not all possible variations will be checked (since it is not really possible).

Documentation will warn users not to do this, along with other unsafe uses of variables while running in parallel, such as assigning to passed in variables that are not thread safe.


- `-ThrottleLimit` : parameter takes an integer value that determines the maximum number threads
### Exceptions

- `-TimeoutSecs` : parameter takes an integer that specifies the maximum time to wait for completion in seconds
For critical exceptions, such as out of memory or stack overflow, the CLR will crash the process.
Since all parallel running script blocks run in different threads in the same process, all running script blocks will terminate, and queued script blocks will never run.
This is different from PowerShell jobs (Start-Job) where each job script runs in a separate child process, and therefore has better isolation to crashes.
The lack of process isolation is one of the costs of better performance while using threads for parallelization.

### Asynchronous parameter set
For all other catchable exceptions, PowerShell will catch them from each thread and write them as non-terminating error records to the error data stream.
If the `ErrorAction` parameter is set to 'Stop' then cmdlet will attempt to stop the parallel execution on any error.

Asynchronous ForEach-Object -Parallel immediately returns a job object for monitoring parallel script block execution
### Stop behavior

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PaulHigin: Were you planning for this to return a single job object, with one child thread job for each thread where the script block is invoked, or were you planning for this to return one job object for each thread where the script block is invoked? I presume the latter, but since it wasn't specified I wanted to ask to make sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I intend to follow PowerShell job convention and return a single job object with multiple child jobs for each pipeline iteration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, please review the other comments on this RFC related to why having -AsJob as a parameter is a bad idea, and how dropping -AsJob allows you to support -Begin and -End without complication.

```powershell
ForEach-Object -Parallel -ThrottleLimit 5 -AsJob -ScriptBlock {}
```
Whenever a timeout, a terminating error (-ErrorAction Stop), or a stop command (Ctrl+C) occurs, a stop signal will be sent to all running script blocks, and any queued script block iterations will be dequeued.
This does not guarantee that a running script will stop immediately, if that script is running a native command or making an API call.
So it is possible for a stop command to be ineffective if one running thread is busy or hung.

- `-Parallel` : parameter switch specifies fan-out parallel script block execution
We can consider including some kind of 'forcetimeout' parameter that would kill any threads that did not end in a specified time.

- `-ThrottleLimit` : parameter takes an integer value that determines the maximum number threads
If a job object is returned (-AsJob) the child jobs that were dequeued by the stop command will remain at 'NotStarted' state.

- `-AsJob` : parameter switch returns a job object
### Data streams

### Variable passing
Warning, Error, Debug, Verbose data streams will be written to the cmdlet data streams as received from each running parallel script block.
Progress data streams will not be supported, but can be added later if desired.

ForEach-Object -Parallel will support the PowerShell `$_` current piped item variable within each script block.
It will also support the `$using:` directive for passing variables from script scope into the parallel executed script block scope.
### Supported scenarios

### Examples
```powershell
# Ensure needed module is installed on local system
if (! (Get-Module -Name MyLogsModule -ListAvailable)) {
Install-Module -Name MyLogsModule -Force
}
```

```powershell
$computerNames = 'computer1','computer2','computer3','computer4','computer5'
$logs = $computerNames | ForEach-Object -Parallel -ThrottleLimit 10 -TimeoutSecs 1800 -ScriptBlock {
$logs = $computerNames | ForEach-Object -Parallel -ThrottleLimit 10 -TimeoutSeconds 1800 -ScriptBlock {
Get-Logs -ComputerName $_
}
```

```powershell
$computerNames = 'computer1','computer2','computer3','computer4','computer5'
$job = ForEach-Object -Parallel -ThrottleLimit 10 -InputObject $computerNames -AsJob -ScriptBlock {
$job = ForEach-Object -Parallel -ThrottleLimit 10 -InputObject $computerNames -TimeoutSeconds 1800 -AsJob -ScriptBlock {
Get-Logs -ComputerName $_
}
$logs = $job | Wait-Job | Receive-Job
Expand All @@ -91,9 +118,65 @@ $logs = ForEach-Object -Parallel -InputObject $computerNames -ScriptBlock {
}
```

```powershell
$computerNames = 'computer1','computer2','computer3','computer4','computer5'
$logNames = 'System','SQL','AD','IIS'
$logResults = ForEach-Object -Parallel -InputObject $computerNames -ScriptBlock {
Get-Logs -ComputerName $_ -LogNames $using:logNames
} | ForEach-Object -Parallel -ScriptBlock {
Process-Log $_
}
```

### Unsupported scenarios
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking on the unsupported scenarios I see an inconsistence - we can send variables by $using but can return result only by return. Maybe discard $using and only use -ArgumentList and param? It seems it will be more clear for users (one entry point - one exit point) and it will be one step closer to being thread-safe. It also resolve problems like "Value types not passed by reference"/"Passed in reference variables should not be assigned to".

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Start-Parallel (https://www.powershellgallery.com/packages/start-parallel/1.3.0.0) goes down the -argumentlist and param() route.

$using variables are read-only copies.. When you run a script block on on another computer I don't think there is any expectation that it sets variables on the computer that invokes the command. And in DSC a Using variable is written into the MOF file, - when DSC applies the MOF file it wouldn't make sense to try to go back to the runspace which compiled it and change a variable - the application phase will often happen when that runspace has ceased to exist.
Changing form the the read-only copy paradigm means
(a) The runspace using the variable must read the current value "live" from the calling runspace
(b) The child runspace must have a mechanism to write back, and the parent must track the variables the each runspace can read and write and authenticate that whatever is asking to read or write is really that runspace
(c) because there may be multiple runspaces writing back to the same variable there must be some locking mechanism. One runspace locking a variable while it performs multiple updates may block others.

One of the things I keep saying is if the desire is to make a parallel version of foreach-object then:

  1. Differences from existing sequential foreach should be avoided where possible (hence keeping begin and end blocks; returning jobs very much optional).
  2. Many existing script blocks will fail if run unmodified in parallel runspaces. Therefore one difference is to reserve -process for sequential use. i.e. changing Foreach-object {$variable} to Foreach-object {stuff $variable} -parallel should not be syntactically valid. Requiring Foreach-object -parallel {...} tells us when reading or writing the code that the block runs in its own runspace with read only access to local variables, only if $Using is specified.
  3. Switching toParam()and -argumentList means a greater change to the scriptblock than supporting $using but the parameter changes when the block runs in a different runspace, try this:
$foo = get-item .
invoke-command -ScriptBlock {param ($foo) $foo | gm }  -ArgumentList $foo  -ComputerName localhost -EnableNetworkAccess 

invoke-command -ScriptBlock {param ($foo) $foo | gm }  -ArgumentList $foo 

The first invoke has a deserialized object without the methods associated with a directory. The second has a "normal" directory object.

  1. Things are complicated by foreach-object already supporting -argumentlist when used with -memberName e.g.
    ForEach-Object -MemberName Split -ArgumentList "."
    But it does not work with a -process block, whether it should is another matter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parser does add some restrictions to $using variables, mainly because they were used for remoting. But I think we want those restrictions given undefined behavior of assignable $using variables.

A new parameter set was added to foreach-object specifically to indicate that the -Parallel operation is not the same as the sequential operation using the traditional parameter set. I thought about creating a new cmdlet, but I feel a new parameter set is sufficient to differentiate.


```powershell
# Variables must be passed in via $using: keyword
$LogNameToUse = "IISLogs"
$computers | ForEach-Object -Parallel -ScriptBlock {
# This will fail because $LogNameToUse has not been defined in this scope
Get-Log -ComputerName $_ -LogName $LogNameToUse
}
```

```powershell
# Passed in reference variables should not be assigned to
$MyLogs = @()
$computers | ForEach-Object -Parallel -ScriptBlock {
# Not thread safe, undefined behavior
# Cannot assign to using variable
$using:MyLogs += Get-Logs -ComputerName $_
}

$dict = [System.Collections.Generic.Dictionary[string,object]]::New()
$computers | ForEach-Object -Parallel -ScriptBlock {
$dict = $using:dict
$logs = Get-Logs -ComputerName $_
# Not thread safe, undefined behavior
$dict.Add($_, $logs)
}
```

```powershell
# Value types not passed by reference
$count = 0
$computers | ForEach-Object -Parallel -ScriptBlock {
# Can't assign to using variable
$using:count += 1
$logs = Get-Logs -ComputerName $_
return @{
ComputerName = $_
Count = $count
Logs = $logs
}
}
```

## Alternate Proposals and Considerations
Copy link
Contributor

@KirkMunro KirkMunro Jun 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if instead of looking at ForEach-Object for parallelism we scaled this out to allow parallelism to be used with any command?

For example, if we...

  • ...had a new CmdletBinding attribute called SupportsParallelism.
  • ...added -Parallel, -ThrottleLimit, -TimeoutSecs, and -AsJob common parameters to any advanced function or cmdlet that has SupportsParallelism set to $true in the CmdletBinding attribute (supporting the different common parameter sets that are proposed above, such that -AsJob could not be used with -TimeoutSecs).
  • ...updated the $using: variable prefix so that it can be used in any process block inside of a command that has SupportsParallelism set to $true (Note: This would also need to allow $using: to work when the command is invoked without parallelism, so that the functionality would only need to be written once; the $using: prefix would simply identify a variable that is pulled into the parallel runspace/job when the command is invoked in parallel. The $using: prefix should probably not be required for parameters, though -- that should be implicit, and maybe $using could be implicit for variables defined in the function/command scope -- again, makes things much easier, and users would only need to worry about using in commands like ForEach-Object that accept script blocks).

Then we could...

  • ...add the SupportsParallelism attribute to ForEach-Object invocations to get the functionality we're talking about here.
  • ...add the SupportsParallelism attribute to any other command that is written to support parallelism.
  • ...simplify command invocation when you want parallelism a great deal, which would keep ForEach-Object use to where it is really needed.
  • ...maintain support for begin and end blocks (which are still useful -- even with parallelism, some resources such as connections, credentials, or collections to accumulate results may be needed and shared among the different threads/jobs).
  • ...discover commands that support parallelism (and therefore those that do not).
  • ...make multitasking much more of a "thing" in PowerShell.

Taking the examples provided above, they could be simplified as follows:

$computerNames = 'computer1','computer2','computer3','computer4','computer5'
$logNames = 'System','SQL'

# Get logs from computers in parallel, throttled with a timeout
$logs = Get-Logs -ComputerName $computerNames -Parallel -ThrottleLimit 10 -TimeoutSecs 1800

# Get logs from computers in parallel using background jobs
$logs = Get-Logs -ComputerName $computerNames -Parallel -ThrottleLimit 10 -AsJob | Receive-Job -Wait

# Get System and SQL logs from computers in parallel with no throttle or timeout
$logs = Get-Logs -ComputerName $computerNames -LogNames $logNames -Parallel

For folks who are going to look at this and complain about the lack of downlevel support in this functionality, I'd like to draw your attention to this RFC. That won't help projects that require a downlevel version, but PowerShell needs to start offering a lot more motivation for users to upgrade to newer releases, and this would be some pretty great motivation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An interesting idea, and would be a lot of work to cover all built-in cmdlets. A variation is to build on @BrucePay work with the & operator to force execution as background job.

Get-Process -id $pid &
Id     Name            PSJobTypeName   State         HasMoreData     Location             Command
--     ----            -------------   -----         -----------     --------             -------
5      Job5            BackgroundJob   Running       True            localhost            Microsoft.PowerShell.Man…

Maybe this can be extended to optionally fan-out using ThreadJobs.

This is outside the scope of this RFC, but is something we could look into in the future.

Copy link
Contributor

@KirkMunro KirkMunro Jun 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You wouldn't have to cover all built-in cmdlets from the start. For the first cut you could just lay the foundation and update ForEach-Object to use SupportsParallelism (or just call that SupportsParallel for short and to match the -Parallel common parameter name). The other built-in cmdlets could be updated over time. That keeps focus on what this RFC is trying to do (add parallelism to ForEach-Object) while doing so in such a way that it can scale to any cmdlet/advanced function.

The initial work shouldn't be that much greater than what would be required to implement what was originally proposed on this RFC (a specific modification of ForEach-Object, and that should be in scope in this RFC, at least listed in the alternate proposals and considerations section. The [CmdletBinding(SupportsParallelism)] approach feels like a much better strategy to achieve the desired end goal for all of PowerShell than simply updating ForEach-Object).

Copy link
Contributor

@KirkMunro KirkMunro Jun 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also regarding the & background operator and thread jobs comment, see PowerShell Issue #9873.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you thinking of this in the same simple sense as SupportsShouldProcess where all you're doing is opting in to a few common parameters and magic variables which you may --or may not, actually-- use (properly)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've already started a separate RFC for this to see where that leads. Thanks for sharing these thoughts @Jaykul, I had some of them already, but not quite all. Automatic implementation is the goal, for consistency and performance, but I'm still thinking it would be bound to specific named parameter sets, or all parameter sets if you don't name them. Part of what I hope happens here is to "corral the troops" so to speak, and bring a consistent implementation to the table that commands can adopt.

Copy link

@oising oising Jun 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about adding a new, mutually exclusive [to process] processasync/parallelprocess block in cmdlets/functions?

function foo {
    begin { }
    processasync { }
    end { }
}

It's so easy to write that out. Additional control might be handled by extra properties in [CmdletBinding()], such as ThrottleLimit etc, which could also be overridden on a per-invocation basis, by adding ThrottleLimit etc as common parameters.

Or maybe, even:

function foo {
    [CmdletBinding(AsyncProcess=true)]
    begin {}
    process {}
    end {}
}

Perhaps this is a little easier to do and renders any function parallelizable (thar be dragons always but I digress.) IF AsyncProcess is true, then it lights up the parallel/job common parameters (ThrottleLimit etc)

I'm not tied to "async" versus "parallel" btw; perhaps the latter is a better choice to imply "jobs" and not "task".

Copy link
Contributor

@KirkMunro KirkMunro Jun 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oising The last thing you suggested (CmdletBinding configuration) is exactly what started this part of the discussion, albeit with different terminology ([CmdletBinding(SupportsParallel=...)]). I'm also considering processasync and the corresponding ProcessRecordAsync while I work through the complementary RFC.

See: #194 (comment)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. I'm leaning towards the idea of just leaving the process keyword alone (e.g. no new keywords like processasync), but in the .NET API, surface ProcessRecordAsync, and have them materialize as threadjobs in the console.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the complementary RFC I wrote up today: #206.


Another option (and a previous RFC proposal) is to resurrect the PowerShell Windows workflow script `foreach -parallel` keyword to be used in normal PowerShell script to perform parallel execution of foreach loop iterations.
However, the majority of the community felt it would be more useful to update the existing ForeEach-Object cmdlet with a -parallel parameter set.
We may want to eventually implement both solutions.
But the ForEach-Object -Parallel proposal in this RFC should be implemented first since it is currently the most popular.

There are currently other proposals to create a more general framework to support running arbitrary scripts and cmdlets in parallel, by marking them as able to support parallelism (see RFC #206).
That is outside the scope of this RFC, which focuses on extending just the ForEach-Object cmdlet to support parallel execution, and is intended to allow users to do parallel script/command execution without having to resort to PowerShell APIs.