Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes performance problem cause by deep hierarchies, ResolveAll, and unusable constructors #158

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

JoshuaRogers
Copy link

@JoshuaRogers JoshuaRogers commented Dec 15, 2022

Hi, sorry for the wall of text. I've tried to break this into a few sections and for each one show how the performance problem manifests. I felt that it was necessary to understand the reasoning for the approach at the end. If you prefer to skip to the end, the TL;DR is that "CanResolve, Resolve, and ResolveAll recalculate the same values repeatedly for some structures, causing multiple second delays."

When resolving an object from TinyIoC, before resolution, it first needs to verify that it's dependencies can be resolved. The way it handles this is to reflect a list of all constructors from the dependency and to go through them checking to see if all parameters for the constructor can themselves be constructed. This continues recursively, until a constructor that can be satisfied can be found for each dependency and the original object. After finding that an object can be resolved, TinyIoC then begins construction, a process that requires performing the previous calculation again to determine the correct constructor, as no data is cached.

This alone is unnoticable, though it is a small example of the problem that appears to grow at scale if using large dependency trees, ResolveAll, or the presence of constructors that cannot be successfully resolved. In all three cases, the performance problem itself is a result of repeated recalculation of constructability for the same types.

Large Dependency Tree

Consider the below as an example of a small "large dependency tree".

class Foo
{
    public Foo(Bar bar, Baz baz, Qux qux) { }
}

class Bar
{
    public Bar(Baz baz, Qux qux) { }
}

class Baz
{
    public Baz(Qux qux) { }
}

class Qux { }

Due to the lack of caching, a request to resolve Foo leads to the following:

  • Check if Foo can be resolved
    • Check if Bar can be resolved
      • Check if Baz can be resolved
        • Check if Qux can be resolved
      • Check if Qux can be resolved
    • Check if Baz can be resolved
      • Check if Quz can be resolved
    • Check if Qux can be resolved

This totals 1 check for Bar, 2 for Baz, and 3 for Qux. For a total of 6 tests for eligability. As the same logic must be performed again during the resolution stage, after we are certain that a constructor chain exists, the effective number of tests is doubled, coming in at 12.

Multiple Resolutions

Another possibility is that we have a number of event handlers all of which need to be resolved and which share dependencies, such as one might expect of a database or logger.

public class FooEventHandler1 : IFooEventHandler
{
    public FooEventHandler1(Foo foo) { }
}

public class FooEventHandler2 : IFooEventHandler
{
    public FooEventHandler2(Foo foo) { }
}

public class FooEventHandler3 : IFooEventHandler
{
    public FooEventHandler3(Foo foo) { }
}

public class FooEventHandler4 : IFooEventHandler
{
    public FooEventHandler4(Foo foo) { }
}

Assuming the above classes are all registered with TinyIoC as IFooEventHandler, then a call to ResolveAll<IFooEventHandler>() would check that the common dependency Foo could be resolved on each handler resolution, for a total of four checks of the same type.

If, continuing with the class definitions from the previous example, we remember that each check of Foo causes 6 tests to occur, followed by another 6 during the actual construction of the object, then we could expect this call to ResolveAll to cause 6 * (4 + 1) test for a total of 30 tests.

Unusable Constructors

There's still one more case that leads to test-multiplication: if the most preferable constructor (the one requiring the most arguments) has a dependency that cannot be resolved via IoC, such as for the case where it is generated by a factory or a deserializer.

class Foo
{
    public Foo(Bar bar, Baz baz, Qux qux, Bang bang) { }

    public Foo(Bar bar, Baz baz, Qux qux) { }
}

If we were to update our original Foo class to now have a constructor that requires an unresolvable type "Bang", the effect is that we manage to double the test performed: TinyIoC will test Bar, Baz, and Qux successfully before finding that Bang cannot be constructed. With the knowledge that this constructor cannot be resolved, TinyIoC will then try the next constructor in line, attempting a second time to resolve Bar, Baz, and Qux again, not remembering that we've already found them to be resolvable. The effect here is (4 + 1) * (6 + 6 + 1): Each of the 4 IFooEventHandlers requiring the chain be tested, followed by one final test in during construction in order to find the appropriate constructor. The result, given these simplistic classes and their hierarchy, is 65 tests even though there are only 10 constructors defined between the 9 classes.

This last case is the easiest to handle, as the addition of an attribute on the constructor can tell the TinyIoC which constructor to use, but that only works for first-party code and in many cases creates an awareness of TinyIoC from the objects being constructed that it did not already have.

Summary

Given these three very real and very common patterns, the effective number of checks can be in the millions in a larger codebase, taking 20-30 seconds per request, depending on the number of handlers needing construction.

Resolution

The resolution for this issue is rather straightforward: after testing a Type or constructor, the result should be stored so that it does not have to be recalculated.

My first approach was to store the value within TinyIoC so that separate calls to Resolve or ResolveAll would be able to use a common cache. This turned out to be non-practical though, since calls to these methods can specify extra parameters that affect resolution, such as named resolution. Using a global cache would require that either named resolutions were not cached or that logic was complicated to cache them while preventing a memory leak.

Given that limitation, I took the approach of creating a cache that gets passed around internally over the lifetime of the Resolve call and gets populated as tests occur. This means that repeated calls to Resolve will perform the same checks, but that each check will only be performed a single time within a Resolve, changing what could be quadradic growth of time into linear time.

I verified that this fixed the issue that I was seeing and I also ran the unit tests to make sure that this did not introduce a regression.

@gkarabin
Copy link
Contributor

It would be interesting to add a commit before this one with new performance tests to illustrate the issue, then run again after the change to show the difference. Is that feasible?

@JoshuaRogers
Copy link
Author

Sure thing! I won’t be able to practically create a hierarchy as large as a production one, but I should be able to make one large enough to get a reasonably interesting test.

I’ll start on this first thing tomorrow.

Joshua Rogers and others added 2 commits December 16, 2022 12:33
to verify that it's dependencies can be resolved. The way it handles
this is to reflect a list of all constructors from the dependency and to
go through them checking to see if all parameters for the constructor
can themselves be constructed. This continues recursively, until a
constructor that can be satisfied can be found for each dependency and
the original object. After finding that an object can be resolved,
TinyIoC then begins construction, a process that requires performing the
previous calculation again to determine the correct constructor, as no
data is cached.

This alone causes an slight performance degredation, but the problem
only appears when multiplied by the prescence of large dependency trees,
multiple resolution, or the presence of constructors that cannot be
successfully resolved. In all three cases, the performance problem
itself is a result of repeated recalculation of constructability for the
same types.

**Large Dependency Tree**

Consider the below as an example of a small "large dependency tree".

```
class Foo
{
    public Foo(Bar bar, Baz baz, Qux qux) { }
}

class Bar
{
    public Bar(Baz baz, Qux qux) { }
}

class Baz
{
    public Baz(Qux qux) { }
}

class Qux { }
```

Due to the lack of caching, a request to resolve Foo leads to the
following:

 - Check if Foo can be resolved
   - Check if Bar can be resolved
     - Check if Baz can be resolved
       - Check if Qux can be resolved
     - Check if Qux can be resolved
   - Check if Baz can be resolved
     - Check if Quz can be resolved
   - Check if Qux can be resolved

This totals 1 check for Bar, 2 for Baz, and 3 for Qux. For a total of 6
tests for eligability. As the same logic must be performed again during
the resolution stage, after we are certain that a constructor chain
exists, the effective number of tests is doubled, coming in at 12.

**Multiple Resolutions**

Another possibility is that we have a number of event handlers all of
which need to be resolved and which share dependencies, such as one
might expect of a database or logger.

```
public class FooEventHandler1 : IFooEventHandler
{
    public FooEventHandler1(Foo foo) { }
}

public class FooEventHandler2 : IFooEventHandler
{
    public FooEventHandler2(Foo foo) { }
}

public class FooEventHandler3 : IFooEventHandler
{
    public FooEventHandler3(Foo foo) { }
}

public class FooEventHandler4 : IFooEventHandler
{
    public FooEventHandler4(Foo foo) { }
}
```

Assuming the above classes are all registered with TinyIoC as
IFooEventHandler, then a call to `ResolveAll<IFooEventHandler>()` would
check that the common dependency Foo could be resolved on each handler
resolution, for a total of four checks of the same type.

If, continuing with the class definitions from the previous example, we
remember that each check of Foo causes 6 tests to occur, followed by
another 6 during the actual construction of the object, then we could
expect this call to ResolveAll to cause 6 * (4 + 1) test for a total of
30 tests.

**Unusable Constructors**

There's still one more case that leads to test-multiplication: if the
most preferable constructor (the one requiring the most arguments) has a
dependency that cannot be resolved via IoC, such as for the case where
it is generated by a factory or a deserializer.

```
class Foo
{
    public Foo(Bar bar, Baz baz, Qux qux, Bang bang) { }

    public Foo(Bar bar, Baz baz, Qux qux) { }
}
```

If we were to update our original Foo class to now have a constructor
that requires an unresolvable type "Bang", the effect is that we manage
to double the test performed: TinyIoC will test Bar, Baz, and Qux
successfully before finding that Bang cannot be constructed. With the
knowledge that this constructor cannot be resolved, TinyIoC will then
try the next constructor in line, attempting a second time to resolve
Bar, Baz, and Qux again, not remembering that we've already found them
to be resolvable. The effect here is (4 + 1) * (6 + 6 + 1): Each of the
4 IFooEventHandlers requiring the chain be tested, followed by one final
test in during construction in order to find the appropriate
constructor. The result, given these simplistic classes and their
hierarchy, is 65 tests even though there are only 10 constructors defined
between the 9 classes.

**Summary**

Given these three very real and very common patterns, the effective
number of checks can be in the millions in a larger codebase, taking
20-30 seconds per request, depending on the number of handlers needing
construction.

**Resolution**

The resolution for this issue is rather straightforward: after testing a
Type or constructor, the result should be stored so that it does not
have to be recalculated.

My first approach was to store the value within TinyIoC so that separate
calls to Resolve or ResolveAll would be able to use a common cache. This
turned out to be non-practical though, since calls to these methods can
specify extra parameters that affect resolution, such as named
resolution. Using a global cache would require that either named
resolutions were not cached or that logic was complicated to cache them
while preventing a memory leak.

Given that limitation, I took the approach of creating a cache that gets
passed around internally over the lifetime of the Resolve call and gets
populated as tests occur. This means that repeated calls to Resolve will
perform the same checks, but that each check will only be performed a
single time within a Resolve, changing what could be quadradic growth of
time into linear time.
@JoshuaRogers
Copy link
Author

I've added a test using a deep (though quite ugly) hierarchy. Locally, I had to change the project to run on .NET 4.6.1 due to missing SDK dependencies for 4.5, but I did not check that change in. (In case you are curious why the table shows .NET 4.6.1 instead 4.5)

== Results Start Here ==

// * Summary *

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.20348
Intel Xeon CPU E3-1505M v6 3.00GHz, 4 CPU, 4 logical and 4 physical cores
[Host] : .NET Framework 4.8 (4.8.4515.0), X64 RyuJIT
.NET 4.6.1 : .NET Framework 4.8 (4.8.4515.0), X64 RyuJIT
.NET 4.8 : .NET Framework 4.8 (4.8.4515.0), X64 RyuJIT
RyuJitX64 : .NET Framework 4.8 (4.8.4515.0), X64 RyuJIT

Jit=RyuJit Platform=X64

Method Job Runtime Mean Error StdDev Median Gen 0 Gen 1 Gen 2 Allocated
Original_DeepHierarchy .NET 4.6.1 .NET 4.6.1 13,937.3 ms 271.60 ms 333.54 ms 13,883.4 ms 969000.0000 8000.0000 - 5819.09 MB
New_DeepHierarchy .NET 4.6.1 .NET 4.6.1 354.8 ms 8.29 ms 24.06 ms 351.6 ms 16000.0000 - - 98.34 MB
Original_DeepHierarchy .NET 4.8 .NET 4.8 14,441.1 ms 252.93 ms 236.59 ms 14,504.5 ms 969000.0000 8000.0000 - 5819.09 MB
New_DeepHierarchy .NET 4.8 .NET 4.8 350.7 ms 6.99 ms 15.20 ms 353.3 ms 16000.0000 - - 98.34 MB
Original_DeepHierarchy RyuJitX64 .NET 4.8 14,805.7 ms 313.62 ms 894.79 ms 14,856.4 ms 969000.0000 8000.0000 - 5819.09 MB
New_DeepHierarchy RyuJitX64 .NET 4.8 358.1 ms 9.21 ms 27.02 ms 349.2 ms 16000.0000 - - 98.34 MB

@JoshuaRogers
Copy link
Author

@gkarabin : I've attached the benchmark, as requested. Is there anything else that I need to do?

@gkarabin
Copy link
Contributor

gkarabin commented Jan 5, 2023

I can’t speak for the maintainers, like @niemyjski or @grumpydev - they would have to accept/merge.

I want to run the benchmarks myself and look for regressions. I also want to spend some time digging into the code and understanding it. Unfortunately, with the holidays and deadlines for my job I haven’t made time for it. It’s on my to-do list, but I can’t make a firm commitment.

@JoshuaRogers
Copy link
Author

That makes complete sense. I'm just a little over eager. I'll tone it down a little. :)

@niemyjski
Copy link
Collaborator

niemyjski commented Jan 12, 2023

@JoshuaRogers thanks for the PR, I'm in the same boat as @gkarabin when it comes to time. Really appreciate the effort you have put into this.

At first, I was a bit concerned seeing all the new instances of ResolutionCache being created but then read your thread.

Some questions that I had:

  1. Have you found any need to lazy create any the resolution cache or resolution cache dictionaries to prevent extra allocations? If so, did this add more overhead / greatly increase the complexity for negligible gain?
  2. Is there any way to determine correct size for dictionaries so they aren't being resized often?
  3. What's the lifetime of this resolution cache? How are these disposed and cleaned up?

@niemyjski
Copy link
Collaborator

niemyjski commented Jan 12, 2023

I want to run the benchmarks myself and look for regressions. I also want to spend some time digging into the code and understanding it. Unfortunately, with the holidays and deadlines for my job I haven’t made time for it. It’s on my to-do list, but I can’t make a firm commitment.

@gkarabin @grumpydev Would love any feedback that you have, it does look ok from looking at files view, but I haven't ran the code locally.

@niemyjski
Copy link
Collaborator

Looks like the build failed for this, we may need to update the build. I'm fine with cutting unsupported build targets that are EOL.

@gkarabin
Copy link
Contributor

gkarabin commented Mar 25, 2023

I finally had a chance to take a closer look at the benchmark today. I wanted to see how the existing benchmarks behave after your change. I noticed that at least in your PR, you hadn't updated TInyIoCOriginal.cs (which you need to do, or your benchmark data would be against the original data).

I bundled up the results of 3 runs - the main branch from the grumpdev repo, from the main branch of your repo, and from the main branch of your repo, with TInyIoCOriginal.cs updated to have the contents of the grumpdev main branch:

results.zip

I only used .NET Core 3.1, and did not use .NET Framework (which is how your tests were done). I suspect that it does not matter, though.

Your new tests are an order of magnitude faster when ran using your PR compared to the updated TinyIoCOriginal.cs using the grumpydev main branch. The other tests perform worse, but it's the same order of magnitude.

So I guess the question is, is the performance improvement on the new tests worth the performance drop on the old tests? That's a hard question to answer. Which test set is more like typical workloads? I don't have a good answer, at least not yet. I wonder what others think - @niemyjski , @grumpydev , @Yortw - thoughts?

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-10750H CPU 2.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=7.0.102
  [Host]        : .NET Core 3.1.32 (CoreCLR 4.700.22.55902, CoreFX 4.700.22.56512), X64 RyuJIT
  .NET Core 3.1 : .NET Core 3.1.32 (CoreCLR 4.700.22.55902, CoreFX 4.700.22.56512), X64 RyuJIT
Jit=RyuJit  Runtime=.NET Core 3.1  
Method Job Platform Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
Original_DeepHierarchy .NET Core 3.1 X64 2,075.9 ms 40.44 ms 35.85 ms 238000.0000 1000.0000 - 1493174640 B
New_DeepHierarchy .NET Core 3.1 X64 163.3 ms 2.90 ms 3.56 ms 15666.6667 - - 98616792 B
Original_AutoRegister .NET Core 3.1 X64 10.56 ms 0.137 ms 0.121 ms 1.00 0.00 187.5000 62.5000
New_AutoRegister .NET Core 3.1 X64 11.04 ms 0.206 ms 0.161 ms 1.05 0.02 187.5000 93.7500
Original_Resolve_Singleton .NET Core 3.1 X64 127.8 ns 2.50 ns 2.22 ns 1.00 0.00 0.0203 -
New_Resolve_Singleton .NET Core 3.1 X64 197.8 ns 1.96 ns 1.74 ns 1.55 0.03 0.0892 -
Original_Resolve_Instance_Without_Dependencies .NET Core 3.1 X64 307.6 ns 4.62 ns 3.61 ns 1.00 0.00 0.0367 -
New_Resolve_Instance_Without_Dependencies .NET Core 3.1 X64 656.2 ns 6.21 ns 5.51 ns 2.13 0.02 0.2174 -
Original_Resolve_Instance_With_Singleton_Dependency .NET Core 3.1 X64 725.8 ns 8.09 ns 7.57 ns 1.00 0.00 0.1097 -
New_Resolve_Instance_With_Singleton_Dependency .NET Core 3.1 X64 1,058.6 ns 15.99 ns 14.96 ns 1.46 0.03 0.2899 -
Original_Resolve_OpenGeneric .NET Core 3.1 X64 704.1 ns 4.57 ns 3.57 ns 1.00 0.00 0.1526 -
New_Resolve_OpenGeneric .NET Core 3.1 X64 1,060.2 ns 14.78 ns 13.10 ns 1.50 0.02 0.3338 -
Original_Resolve_Unregistered .NET Core 3.1 X64 724.4 ns 14.32 ns 21.43 ns 1.00 0.00 0.1078 -
New_Resolve_Unregistered .NET Core 3.1 X64 983.5 ns 5.82 ns 5.16 ns 1.35 0.05 0.2193 -
Original_Resolve_AutomaticFuncFactory .NET Core 3.1 X64 409.4 ns 4.73 ns 4.19 ns 1.00 0.00 0.0672 -
New_Resolve_AutomaticFuncFactory .NET Core 3.1 X64 474.7 ns 6.25 ns 5.85 ns 1.16 0.02 0.1364 -

@zaddo67
Copy link

zaddo67 commented Mar 25, 2023

These results are very concerning. I use TinyIoC in game development and my projects don't have deep hierarchy's. The Resolve Instance Without Dependencies result is the most concerning. Due to the high frequency this is called in my projects, it would most likely impact frame rate.

Selfishly, I hope this change does not get merged. One shoe does not fit all. Perhaps this could be made a feature that can be turned on for projects that could benefit?

@JoshuaRogers
Copy link
Author

That makes sense. For us, this is used in our web server with DeepHierarchy being the most common case, making TinyIoC the source of most of our time.

I believe that most of the degradation could be avoided by initializing the objects I added all up front. The only reason I didn’t was because that removed my ability to add caching for named resolution (a feature that I don’t actually use myself). If we were okay with named resolution not seeing any increase in speed for DeepHierarchies, I could make that change instead.

@Yortw
Copy link
Contributor

Yortw commented Mar 26, 2023

Hi all,

I'm coming to this late, and haven't actually looked a the code yet (limited time, sorry). I have read the thread here and looked at the benchmarks posted, which all seems like good information.

I think my use case sits between that of @JoshuaRogers and @zaddo67 - I use TinyIoC mostly in native client apps (Xamarin Forms, WinForms, console apps sometimes etc). For the Xamarin apps I have some that run on very underpowered ARM devices where hardware choice is limited, and time between screens is a major UX factor, so milliseconds do matter. I tend not to use IoC for classes that have no depedencies, but I have a lot of singletons (or effectively singletons, within a scope, but my scopes probably last a lot longer than a typical HTTP request). So I'm concerned about the degradation there, and I support @zaddo67 's concern about the resolution of items with no dependencies anyway, I suspect (with no evidence) that's pretty common and not a case we'd like to degrade. On that basis I'm worried about merging the code here as is.

That said, the web case is definitely a common one too and the improvement in performance for the deep hierarchy handling is significant, so I think it's worth solving. I also just like/think the caching idea is a good idea. I thought I'd done some of this in my own PR for performance improvements (possibly just at a different level?).

For my 2cents worth I would like to re-evaluate this statement from earlier in the thread:

My first approach was to store the value within TinyIoC so that separate calls to Resolve or ResolveAll would be able to use a common cache. This turned out to be non-practical though, since calls to these methods can specify extra parameters that affect resolution, such as named resolution. Using a global cache would require that either named resolutions were not cached or that logic was complicated to cache them while preventing a memory leak.

I stand to be proven wrong but I think that if the cache persisted across multiple calls then probably all the calls would be faster (at least in the benchmark, but probably most of them in the real world too), with the noted exception of the named arguments. We'd also potentially eliminate/reduce the issues raised by @niemyjski around the dictionary resizing/lifetime etc. Given the comments above, the easy path would seem to be to cache everything that isn't named (or ignore the cache when a named parameter is present, I'm not sure which since I haven't dived into the code yet).

Obviously this leaves the original problem of the named arguments not benefitting, but:

  • Maybe we just can't have everything, and this is still the best compromise?
  • I'm sure if you need them named arguments are excellent and I don't suggest dropping them, but they seem like a code smell to me - I don't use them, and I'd like to think their use is rare across the TinyIoC user base. I don't know that it is, but if someone else does or we can assume that, this still seems like the best compromise?
  • Named arguments seem like an "override" or "special case", at which point you wouldn't expect caching/might expect some extra overhead. On that basis, if they simply don't benefit but don't get worse (or significantly worse) then that doesn't seem like a problem? Especially if no one is currently complaining about their performance?

So my vote would be to change the code to use a shared cache (per container probably), ignoring it for the named resolution cases, and repeat the benchmarks to see what the outcome is. If the old benchmarks are the same or better, and the new ones are also good I think that's a win even despite the named arguments issue. It's possible if that happens even my use case would gain some performance benefits, whereas with the "cache per call" as it is now it seems to just be adding overhead.

@Yortw
Copy link
Contributor

Yortw commented Mar 26, 2023

Also, really exciting to see the benchmark stuff I submitted being used to analyse and potentially prevent regressions. Thanks everyone for the effort being put in here.

@JoshuaRogers
Copy link
Author

That works by me: I don’t use named arguments either. I was originally including them for completeness sake (though at the cost of complication.)

I’ll go update my PR to do one cache per container and to bypass it when using named arguments.

As for not replacing TinyIoCOriginal.cs, I misunderstood “original” in that context to mean that I shouldn’t touch it (as then it would no longer be original, but instead modified.) I didn’t realize the intent was to copy the current TinyIoC.cs onto it before the test. :p I’ll make sure to update it when running the next set of benchmarks.

@gkarabin
Copy link
Contributor

I'm not overly worried about the named resolution use case. I do think that named resolution is a pretty common case - I don't think it would have made the short list of TinyIoC features if it wasn't.

My team uses it in our applications, but not in performance sensitive contexts - it tends to be at the beginning of a long-duration scope for identifying a subset of available implementations to use for a particular mode of operation, and not for a lot of resolutions at that.

If it doesn't get much slower, especially compared to the state of things before @Yortw's improvements, I don't see a problem.

@niemyjski
Copy link
Collaborator

niemyjski commented Mar 27, 2023

I really appreciate the collaboration and help around this. I'm not a real power user of this library, just here to try and help maintain it/provide feedback where I have experience. I mainly use it for simple singletons in the https://github.com/exceptionless/Exceptionless.Net library. I think we should try to avoid hurting the performance of the simplest of use cases that probably will have the most library usage (we have no stats for this without analyzing GitHub and your comments).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants